You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[v0.11.0] [Bugfix] [MoE]fix error in deepseek when using allgather (#3827)
### What this PR does / why we need it?
After refactoring vllm_ascend/models and FusedMoE, we are unable to pass
`gate` from deepseekv2.py to `AscendFusedMoE.forward`, which will result
in error when running deepseek v3/r1 with allgather.
Hence, this pr removes `gate` related computations from FusedMoE module
in eager/aclgraph mode.
### Does this PR introduce _any_ user-facing change?
`rm_router_logits` is deprecated in eager/aclgraph.
### How was this patch tested?
e2e & ut
Signed-off-by: Pr0Wh1teGivee <[email protected]>
# In theory, this solution is only applicable to AllGather and AllGatherEP, because in the dp scenario, the previous operation was gate + two communications, and now it is changed to one communication + gate operation, which can save some communication time. In theory, all moe AllGather and AllGatherEP solutions can follow this logic, but now other moe models (qwen3-235b) dp solutions are not adjusted, so use the switch to control it to prevent code errors.
# the fusion operator torch_npu.npu_grouped_matmul_finalize_routing called by allgather ep
254
+
# only supports deepseek v3/r1
255
+
ifdp_size>1:
256
+
if (envs_ascend.VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EPandep_size>1
257
+
andis_deepseek_v3_r1):
258
+
returnTrue
259
+
elifep_size==1andis_deepseek_v3_r1:
260
+
returnTrue
261
+
returnFalse
262
+
263
+
264
+
# TODO(ttanzhiqiang): all_reduce merge
265
+
# When all_reduce_merge is in progress, shared_experts does not do all_reduce in mlp, but waits until shared_experts+router_experts are completed before doing all_reduce
266
+
# Currently, all_reduce_merge is enabled by default in the AllGather, AllGatherEP and NaiveMulticast scenarios of the deepseek model.
# In theory, this solution is only applicable to AllGather and AllGatherEP, because in the dp scenario, the previous operation was gate + two communications, and now it is changed to one communication + gate operation, which can save some communication time. In theory, all moe AllGather and AllGatherEP solutions can follow this logic, but now other moe models (qwen3-235b) dp solutions are not adjusted, so use the switch to control it to prevent code errors.
# the fusion operator torch_npu.npu_grouped_matmul_finalize_routing called by allgather ep
529
-
# only supports deepseek v3/r1
530
-
ifdp_size>1:
531
-
if (envs_ascend.VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EPandep_size>1
532
-
andis_deepseek_v3_r1):
533
-
returnTrue
534
-
elifep_size==1andis_deepseek_v3_r1:
535
-
returnTrue
536
-
returnFalse
537
-
538
-
539
-
# TODO(ttanzhiqiang): all_reduce merge
540
-
# When all_reduce_merge is in progress, shared_experts does not do all_reduce in mlp, but waits until shared_experts+router_experts are completed before doing all_reduce
541
-
# Currently, all_reduce_merge is enabled by default in the AllGather, AllGatherEP and NaiveMulticast scenarios of the deepseek model.
0 commit comments