[EPLB][Ops] Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list operator into dynamic EPLB #4216

845473182 · 2025-11-17T02:08:06Z

What this PR does / why we need it?

Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list into dynamic EPLB to support list-type parameters
This PR also modify the logic of loading model in dynamic-eplb scenario.
The operator is based on this pr: #3804

Does this PR introduce any user-facing change?

no

How was this patch tested?

vllm serve /home/weight/DeepSeek-V3.1_w8a8mix_mtp \
    --max_num_seqs 8 \
    --max-model-len 8192 \
    --max-num-batched-tokens 16384 \
    --tensor-parallel-size 8 \
    --data-parallel-size 2 \
    --enable-expert-parallel \
    --served-model-name ds_r1 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --no-enable-prefix-caching \
    --port 8999 \
    --quantization "ascend" \
    --gpu-memory-utilization 0.85 \
    --trust-remote-code \
    --compilation_config '{"cudagraph_capture_sizes":[1,2,4,8,16,32]}' \
    --additional-config='{"dynamic_eplb":true, "num_iterations_eplb_update":100, "num_wait_worker_iterations":100}'

input&output: 2k 2k
This PR:

Baseline:

vLLM version: v0.11.2
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: 白永斌 <[email protected]>

github-actions · 2025-11-17T02:08:14Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

…_mlp Signed-off-by: 白永斌 <[email protected]>

Signed-off-by: 白永斌 <[email protected]>

…oading model phase Signed-off-by: 白永斌 <[email protected]>

github-actions · 2025-11-26T06:45:31Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: 白永斌 <[email protected]>

Signed-off-by: 欧派果奶我还要 <[email protected]>

Signed-off-by: 白永斌 <[email protected]>

The main purposes of this PR are as follows: 1. Remove the multicast-related code; Reason: 1. In the scenario like a2 Dual-System Back-to-Back Networking，the performance is worse than all_gather. Before the modification, in e2e test, it was 3 tps; after the modification, it is 10 tps. 2. At the same time, we usually enable the SP feature，it is consistent with the current logic. 3. The advantage of broadcast communication lies in the fact that it does not suffer from uneven DP load and does not require the prefill ACL graph to be enabled. But we support prefill Acl graph recently. So we think there is no need to maintain the multicast as one choice in moe communication. Performance benefits are as follows: When not enable_flashcomm1, TTFT remains relatively stable at around 43000ms, which is approximately 15000ms faster than before the modification. When enable_flashcomm1, there is no diffenence, TTFT remains relatively stable at around 29000ms. - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: weijinqian_v1 <[email protected]> Signed-off-by: weijinqian0 <[email protected]> Co-authored-by: weijinqian_v1 <[email protected]>

### What this PR does / why we need it? Temporarily fix the oom issue, will update to vllm's plan later. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e&ut - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: Pr0Wh1teGivee <[email protected]>

…#4241) ### What this PR does / why we need it? vllm-ascend need to dump data during model execution to debug some precision problems, here msprobe provide the corresponding abilities, so msprobe will join vllm-ascend to make debug easier ### Does this PR introduce _any_ user-facing change? ``` 'dump_config': '/path/to/config.json' ``` - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: Tjh-UKN <[email protected]>

Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6. - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…vllm-project#4392) ### What this PR does / why we need it? When cudagraph_mode is set to FULL_DECODE_ONLY, if dp > 1, the dummy-run process will be triggered. When calling the update_attn_params function, the num_tokens parameter needs to be passed, and this value is obtained through positions.shape[0]. However, the multimodal model uses mRope (multi-dimensional rotary positional embeddings), which causes the shape of positions to be 2. As a result, the value obtained from positions.shape[0] is incorrect. We solve this problem by replacing positions.shape[0] with num_tokens. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? vLLM version: v0.11.0rc3 vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: wujinyuan1 <[email protected]> Co-authored-by: wujinyuan1 <[email protected]>

### What this PR does / why we need it? The "g" at the beginning of the current sentence is redundant and needs to be deleted "MindIE Turbo" is no longer required to be displayed. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: herizhen <[email protected]> Co-authored-by: herizhen <[email protected]>

### What this PR does / why we need it? Fix a bug caused by this pr: vllm-project#4223 The bug makes vllm-ascend/vllm_ascend/patch/platform/patch_multiproc_executor.py patch in a wrong way ### How was this patch tested? Tested in a single node. When the environment DYNAMIC_EPLB is set to true, the patch works correctly. When it's set to false, the patch do not patch - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: 白永斌 <[email protected]> Co-authored-by: 白永斌 <[email protected]>

…oject#4423) ### What this PR does / why we need it? This PR pins the transformers dependency to 4.57.1. Reason: CI tests (specifically test_completion_with_prompt_embeds.py) are failing with an AttributeError: 'dict' object has no attribute 'model_type' when using newer versions of transformers. The issue stems from a bug in tokenization_utils_base.py where the code attempts to access the model_type field of a configuration dictionary (_config) using dot notation (_config.model_type) instead of dictionary key lookup (_config["model_type"] or _config.get("model_type")). This occurs in the logic block checking for transformers_version <= 4.57.2. Pinning the version to 4.57.1 bypasses this buggy code path and restores CI stability. Error Traceback: ``` shell /usr/local/python3.11.13/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2419: if _is_local and _config.model_type not in [ E AttributeError: 'dict' object has no attribute 'model_type' ``` - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: MrZ20 <[email protected]>

…llm-project#4354) ### What this PR does / why we need it? **Problem**: The Qwen3Next model implementation currently imports chunk_gated_delta_rule directly using `from ... import ...` In frameworks like `verl`, the model file is often imported before `vllm-ascend` initializes and applies its patches. This causes the model to permanently hold a reference to the original (unpatched) vLLM kernel, resulting in execution errors on Ascend devices even if the patch runs later. **Solution**: Changed the import style to `from vllm...ops import chunk` and call `chunk.chunk_gated_delta_rule().` This ensures that the function lookup happens at runtime (dynamic dispatch), allowing the model to correctly pick up the patched function regardless of import order. - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: zjchenn <[email protected]>

### What this PR does / why we need it? To fix ops test, where `model_config` has been set to `None` and doesn't has `hf_config` attribute, we have added a check for `model_config` to guarantee it is not `None_Type`. - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: shen-shanshan <[email protected]>

Torch-npu 2.7.1 has fixed the device check bug. This patch can be removed now. - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: wangxiyuan <[email protected]>

### What this PR does / why we need it? Delete useless comments. ### Does this PR introduce _any_ user-facing change? No - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: GDzhu01 <[email protected]>

### What this PR does / why we need it? mkdir triton package and move triton files - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: shiyuan680 <[email protected]>

Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: 22dimensions <[email protected]> Co-authored-by: shen-shanshan <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: MengqingCao <[email protected]> Signed-off-by: hfadzxy <[email protected]> Signed-off-by: leo-pony <[email protected]> Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]>

1. Run 4-card test only when single and 2-card test passed 2. rename file to make it more clear 3. remove useless pd workflow, it has been managed by nightly test already. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: wangxiyuan <[email protected]>

Currently, there are two paths to judge the chip type in code, `get_ascend_soc_version` use `get_soc_version` api in torch_npu, and `is_310p` `use _build_info.__soc_version__`, which generate when install. We need to unify the two paths. We need to unify these codes based on the following points: 1. We need to ensure consistency in chip type judgment between compiling and running states; 2. In compiling state, we need chip type to complete op's compilation, but in running state, we only need device type(910B/910_93/310P/910_95/etc) to make code branch judgement; 3. In compiling state, torch_npu may not have been installed yet, so we can't use torch_npu's api. Based on the above points, we have made the following changes: 1. When user set env `SOC_VERSION`, use it; when not set, query soc_version by `npu-smi`; 2. generate device_type based on soc_version when compiling, and write `__device_type__` instead of `__soc_version__` in `_build_info.py`; 3. In running state, use `__device_type__` to judge code branch. When not set env `SOC_VERSION`, it will not be `ASCEND910B1` by default, we will query soc_version by `npu-smi`. And env `SOC_VERSION` must be in the list `soc_to_device` in `setup.py`. - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: zzzzwwjj <[email protected]>

### What this PR does / why we need it? When running 'python example.py',connection issues often occur.The solution is to comment out the first line the code. Complete the specific names of machines A2 and A3. Standardize document format,a space should be added after the colon. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.2 --------- Signed-off-by: herizhen <[email protected]> Co-authored-by: herizhen <[email protected]>

Signed-off-by: 白永斌 <[email protected]>

adapt grouped_matmul_swiglu_quant_weight_nz_tensor_list

e07bcec

Signed-off-by: 白永斌 <[email protected]>

github-actions bot added the module:ops label Nov 17, 2025

add tensor2listtensor in w8a8_dynamic & adjust quant_apply_mlp in moe…

68540d1

…_mlp Signed-off-by: 白永斌 <[email protected]>

github-actions bot added the module:quantization label Nov 17, 2025

白永斌 added 2 commits November 18, 2025 09:07

fix type error

ec06e3f

Signed-off-by: 白永斌 <[email protected]>

transfer weight and weight_scale from Tensor to list[Tensor] during l…

3270a72

…oading model phase Signed-off-by: 白永斌 <[email protected]>

github-actions bot added the merge-conflicts label Nov 26, 2025

白永斌 and others added 2 commits November 26, 2025 17:17

fix init adaptor bug and moe_mlp dow_proj type error

613a129

Signed-off-by: 白永斌 <[email protected]>

Merge branch 'main' into gmm_swiglu_quant_tensor_list

91b9589

Signed-off-by: 欧派果奶我还要 <[email protected]>

github-actions bot removed the merge-conflicts label Nov 27, 2025

白永斌 added 4 commits November 27, 2025 11:22

fix pre-commit

b2a3224

Signed-off-by: 白永斌 <[email protected]>

fix pre-commit

a817212

Signed-off-by: 白永斌 <[email protected]>

fix parameter naming error

ce64797

Signed-off-by: 白永斌 <[email protected]>

fix pre-commit

8b1d592

Signed-off-by: 白永斌 <[email protected]>

845473182 changed the title ~~Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list into dynamic EPLB~~ [EPLB][Ops] Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list operator into dynamic EPLB Nov 28, 2025

845473182 marked this pull request as ready for review November 28, 2025 07:48

白永斌 added 6 commits November 29, 2025 14:31

unify the data types of weight and weight scale in moe_mlp

137a272

Signed-off-by: 白永斌 <[email protected]>

standarlized using enable_custom_op()

34d4868

Signed-off-by: 白永斌 <[email protected]>

fix pre-commit

df229b8

Signed-off-by: 白永斌 <[email protected]>

fix pre-commit

9fdcf39

Signed-off-by: 白永斌 <[email protected]>

fix ut

11649f1

Signed-off-by: 白永斌 <[email protected]>

fix ut

ce97041

Signed-off-by: 白永斌 <[email protected]>

github-actions bot added the module:tests label Nov 29, 2025

白永斌 added 2 commits November 29, 2025 18:42

fix e2e-light

198c074

Signed-off-by: 白永斌 <[email protected]>

fix unquantized ut

4e4cfc5

Signed-off-by: 白永斌 <[email protected]>

weijinqian0 added ready read for review ready-for-test start test by label for PR labels Nov 29, 2025

adapt grouped_matmul_swiglu_quant_weight_nz_tensor_list

5ee3fd7

Signed-off-by: 白永斌 <[email protected]>

weijinqian0 and others added 30 commits November 29, 2025 23:28

[misc] Remove useless patch_logits (vllm-project#4252)

17b08e2

Torch-npu 2.7.1 has fixed the device check bug. This patch can be removed now. - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: wangxiyuan <[email protected]>

[TEST] Delete Comment (vllm-project#4427)

2c2d014

### What this PR does / why we need it? Delete useless comments. ### Does this PR introduce _any_ user-facing change? No - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: GDzhu01 <[email protected]>

mkdir triton package and move triton files (vllm-project#4420)

a4ec0d3

### What this PR does / why we need it? mkdir triton package and move triton files - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: shiyuan680 <[email protected]>

fix pre-commit

e36387e

Signed-off-by: 白永斌 <[email protected]>

fix pre-commit

7a15a36

Signed-off-by: 白永斌 <[email protected]>

fix parameter naming error

87b513f

Signed-off-by: 白永斌 <[email protected]>

fix pre-commit

77b13c1

Signed-off-by: 白永斌 <[email protected]>

unify the data types of weight and weight scale in moe_mlp

e60b249

Signed-off-by: 白永斌 <[email protected]>

standarlized using enable_custom_op()

695abe9

Signed-off-by: 白永斌 <[email protected]>

fix pre-commit

7d356c4

Signed-off-by: 白永斌 <[email protected]>

fix pre-commit

d6f9d40

Signed-off-by: 白永斌 <[email protected]>

fix ut

bd7c9bb

Signed-off-by: 白永斌 <[email protected]>

fix ut

699c707

Signed-off-by: 白永斌 <[email protected]>

fix e2e-light

29980c4

Signed-off-by: 白永斌 <[email protected]>

fix unquantized ut

b42a112

Signed-off-by: 白永斌 <[email protected]>

Update moe_mlp.py

fef26ce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[EPLB][Ops] Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list operator into dynamic EPLB #4216

[EPLB][Ops] Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list operator into dynamic EPLB #4216

845473182 commented Nov 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 17, 2025

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

22 participants

[EPLB][Ops] Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list operator into dynamic EPLB #4216

Are you sure you want to change the base?

[EPLB][Ops] Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list operator into dynamic EPLB #4216

Conversation

845473182 commented Nov 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 17, 2025

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

22 participants

845473182 commented Nov 17, 2025 •

edited by github-actions bot

Loading