Skip to content

Commit 5cff306

Browse files
[Doc]Add developer guide of eplb. (#3759)
### What this PR does / why we need it? Add developer guide of eplb - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: offline0806 <[email protected]> Co-authored-by: offline0806 <[email protected]>
1 parent e0c23cb commit 5cff306

File tree

3 files changed

+223
-0
lines changed

3 files changed

+223
-0
lines changed

docs/source/assets/eplb.png

25.1 KB
Loading
Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
# Expert Parallelism Load Balancer (EPLB)
2+
3+
## Why We Need EPLB?
4+
When using Expert Parallelism (EP), different experts are assigned to different NPUs. Given that the load of various experts may vary depending on the current workload, it is crucial to maintain balanced loads across different NPUs. We adopt a redundant experts strategy by duplicating heavily-loaded experts. Then, we heuristically pack these duplicated experts onto NPUs to ensure load balancing across them. Moreover, thanks to the group-limited expert routing used in MoE models, we also attempt to place experts of the same group on the same node to reduce inter-node data traffic, whenever possible.
5+
6+
To facilitate reproduction and deployment, Vllm Ascend supported deployed EP load balancing algorithm in `vllm_ascend/eplb/core/policy`. The algorithm computes a balanced expert replication and placement plan based on the estimated expert loads. Note that the exact method for predicting expert loads is outside the scope of this repository. A common method is to use a moving average of historical statistics.
7+
8+
![eplb](../../assets/eplb.png)
9+
## How to Use EPLB?
10+
Please refer to the EPLB section of the user guide for detailed information: [How to Use EPLB](../../user_guide/feature_guide/eplb_swift_balancer.md)
11+
12+
## How It Works?
13+
**EPLB Module Architecture**
14+
15+
```
16+
vllm_ascend
17+
├── eplb
18+
│ ├── adaptor
19+
│ │ ├── abstract_adaptor.py
20+
│ │ ├── vllm_adaptor.py
21+
│ ├── core
22+
│ │ ├── policy
23+
│ │ │ ├── policy_abstract.py
24+
│ │ │ ├── policy_dynamic_ep.py
25+
│ │ │ ├── policy_dynamic_ep_v2.py
26+
│ │ │ ├── policy_factory.py
27+
│ │ │ ├── policy_flashlb.py
28+
│ │ ├── eplb_device_transfer_loader.py
29+
│ │ ├── eplb_utils.py
30+
│ │ ├── eplb_worker.py
31+
│ ├── eplb_updator.py
32+
│ ├── utils.py
33+
└───────────
34+
```
35+
36+
**1. Adaptor Module**
37+
*Handles registration and adaptation for different MoE model types*
38+
- `abstract_adaptor.py`
39+
Abstract base class defining unified registration interfaces for EPLB adapters
40+
- `vllm_adaptor.py`
41+
Implementation supporting Qwen3-MoE and DeepSeek models, standardizing parameter handling for policy algorithms
42+
43+
**2. Core Module**
44+
*Implements core algorithms, updates, and asynchronous processing*
45+
- **Policy Submodule**
46+
*Load balancing algorithms with factory pattern instantiation*
47+
- `policy_abstract.py`
48+
Abstract class for load balancing strategy interfaces
49+
- `policy_dynamic_ep.py`
50+
Default implementation of open-source EPLB paper algorithm
51+
- `policy_dynamic_ep_v2.py`
52+
Enhanced version optimizing expert swaps for low-bandwidth devices (e.g., A2)
53+
- `policy_flashlb.py`
54+
Threshold-based adjustment reducing operational costs through layer-wise fluctuation detection
55+
- `policy_factory.py`
56+
Strategy factory for automatic algorithm instantiation
57+
58+
- `eplb_device_transfer_loader.py`
59+
Manages expert table/weight transmission and updates
60+
- `eplb_utils.py`
61+
Utilities for expert table initialization and mapping
62+
- `eplb_worker.py`
63+
Asynchronous algorithm orchestration and result processing
64+
65+
**3. System Components**
66+
- `eplb_updator.py`
67+
Central coordinator for load balancing during inference workflows
68+
- `utils.py`
69+
General utilities for EPLB interface registration
70+
71+
*Key Optimizations:*
72+
1. Maintained original structure while improving technical clarity
73+
2. Standardized terminology
74+
3. Enhanced algorithm differentiation through concise descriptors
75+
4. Improved scoping through hierarchical presentation
76+
5. Preserved file/class relationships while optimizing readability
77+
78+
### Default Algorithm
79+
#### Hierarchical Load Balancing
80+
When the number of server nodes evenly divides the number of expert groups, we use the hierarchical load balancing policy to leverage group-limited expert routing. We first pack the expert groups onto nodes evenly, ensuring balanced loads across different nodes. Then, we replicate the experts within each node. Finally, we pack the replicated experts onto individual NPUs to ensure load balancing across them. The hierarchical load balancing policy can be used in the prefilling stage with a smaller expert-parallel size.
81+
82+
#### Global Load Balancing
83+
In other cases, we use the global load balancing policy, which replicates experts globally regardless of expert groups, and packs the replicated experts onto individual NPUs. This policy can be adopted in the decoding stage with a larger expert-parallel size.
84+
85+
### Add a New EPLB Policy
86+
If you want to add a new eplb policy to vllm_ascend, you must follow these steps:
87+
1. Inherit the `EplbPolicy` abstract class of `policy_abstract.py` and override the `rebalance_experts` interface, ensuring consistent input parameters `current_expert_table`, `expert_workload` and return types `newplacement`.
88+
For example:
89+
90+
```python
91+
class RandomLoadBalance(EplbPolicy):
92+
93+
def __init__(self, config: DynamicConfig):
94+
super().__init__(config)
95+
96+
def rebalance_experts(self, current_expert_table, expert_workload):
97+
new_table = copy.deepcopy(current_expert_table)
98+
num_layers = len(current_expert_table)
99+
100+
for i in range(num_layers):
101+
# randomly choose two card
102+
# indices = random.sample(range(num_card), 2)
103+
indices = [3, 1]
104+
105+
# swap redundant experts
106+
expert_id_to_exchange = new_table[i][indices[0]][-1].clone()
107+
new_table[i][indices[0]][-1] = new_table[i][indices[1]][-1]
108+
new_table[i][indices[1]][-1] = expert_id_to_exchange
109+
110+
return 1, [-i for i in range(num_layers)], new_table
111+
```
112+
113+
2. To add a new EPLB algorithm, include the policy type and its corresponding implementation class in the `PolicyFactory` of `policy_factory.py`.
114+
115+
### Add a New MoE Model
116+
**Implementation Guide for Model Integration**
117+
118+
1. **Adapter File Modification**
119+
- Inherit or modify `vllm_ascend/eplb/adaptor/vllm_adaptor.py`
120+
- Add processing logic for key parameters:
121+
- `num_dense_layers`
122+
- `global_expert_num`
123+
- `num_roe_layers`
124+
- Ensure parameter synchronization in the `model_register` function.
125+
126+
For example:
127+
128+
Modify `__init__` of `vllm_adaptor.py` to add a new moe model eplb params:
129+
130+
```python
131+
if self.model.config.model_type == "qwen3_moe":
132+
self.num_dense_layers = 0
133+
self.global_expert_num = self.model.config.num_experts
134+
```
135+
136+
Modify `model_register` of `vllm_adaptor.py` to register eplb params for new moe model:
137+
138+
```python
139+
if config.model_type == "qwen3_moe":
140+
model.num_moe_layers = config.num_hidden_layers
141+
```
142+
143+
2. **MoE Feature Integration**
144+
- Extend `vllm_ascend/eplb/utils.py` with MoE-specific methods
145+
- Implement required functionality for expert routing or weight management
146+
147+
3. **Registration Logic Update**
148+
- Add patch logic within the `model_register` function
149+
- Maintain backward compatibility with existing model types
150+
151+
4. **Validation & Testing**
152+
- Verify parameter consistency across layers
153+
- Test cross-device communication for expert tables
154+
- Benchmark against baseline implementations (e.g., Qwen3-MoE)
155+
156+
*Key Implementation Notes:*
157+
- Preserve existing interface contracts in abstract classes
158+
- Use decorators for non-intrusive patch integration
159+
- Leverage `eplb_utils.py` for shared expert mapping operations
160+
## DFX
161+
### Parameter Validation
162+
#### Integer Parameters
163+
All integer input parameters must explicitly specify their maximum and minimum values and be subject to valid value validation. For example, `num_iterations_eplb_update` must be greater than 0:
164+
165+
```python
166+
@staticmethod
167+
def check_iterations(iterations):
168+
if not isinstance(iterations, int):
169+
raise TypeError(f"The {iterations} is not int.")
170+
if iterations <= 0:
171+
raise ValueError(
172+
f"The {iterations} can not less than or equal to 0.")
173+
if iterations > sys.maxsize:
174+
raise ValueError(
175+
f"The {iterations} can not large than {sys.maxsize}")
176+
```
177+
178+
#### File Path
179+
The file path for EPLB must be checked for legality, such as whether the file path is valid and whether it has appropriate read and write permissions. For example:
180+
181+
```python
182+
@staticmethod
183+
def check_expert_map_path(expert_map):
184+
if expert_map is None:
185+
return
186+
if not isinstance(expert_map, str):
187+
raise TypeError("The expert_map is not str.")
188+
if not expert_map.strip():
189+
raise ValueError("The expert_map is not empty.")
190+
_, ext = os.path.splitext(expert_map)
191+
if ext.lower() != ".json":
192+
raise TypeError("The expert_map is not json.")
193+
if not os.path.exists(expert_map):
194+
raise ValueError("The expert_map is not exist.")
195+
try:
196+
with open(expert_map, "w", encoding='utf-8') as f:
197+
f.read()
198+
except Exception as e:
199+
raise IOError(
200+
f"Fail read expert info from {expert_map}, please check the reading permission of {expert_map} : {e}"
201+
)
202+
203+
```
204+
205+
### Function Specifications
206+
#### Initialization Function
207+
All EPLB parameters must be initialized by default during initialization, with specified parameter types and default values for proper handling.
208+
209+
#### General Functions
210+
All method arguments must specify parameter types and default values, and functions must include default return value handling for default arguments. It is recommended to use `try-except` blocks to handle the function body, specifying the type of exception captured and the failure handling (e.g., logging exceptions or returning a failure status).
211+
212+
### Consistency
213+
#### Expert Map
214+
The expert map must be globally unique during initialization and update. In a multi-node scenario during initialization, distributed communication should be used to verify the consistency of expert maps across each rank. If they are inconsistent, the user should be notified which ranks have inconsistent maps.
215+
During the update process, if only a few layers or the expert table of a certain rank has been changed, the updated expert table must be synchronized with the EPLB's context to ensure global consistency.
216+
217+
#### Expert Weight
218+
When updating expert weights, ensure that the memory allocated for the expert weights has been released, or that the expert (referring to the old version) is no longer in use.
219+
220+
## Limitation
221+
Before using EPLB, start the script and add `export DYNAMIC_EPLB="true"`.
222+
Before performing load data collection (or performance data collection), start the script and add `export EXPERT_MAP_RECORD="true"`.

docs/source/developer_guide/feature_guide/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ This section provides an overview of the features implemented in vLLM Ascend. De
77
:maxdepth: 1
88
patch
99
ModelRunner_prepare_inputs
10+
eplb_swift_balancer.md
1011
Multi_Token_Prediction
1112
ACL_Graph
1213
KV_Cache_Pool_Guide

0 commit comments

Comments
 (0)