Skip to content

Conversation

@845473182
Copy link
Contributor

@845473182 845473182 commented Nov 17, 2025

What this PR does / why we need it?

Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list into dynamic EPLB to support list-type parameters
This PR also modify the logic of loading model in dynamic-eplb scenario.
The operator is based on this pr: #3804

Does this PR introduce any user-facing change?

no

How was this patch tested?

vllm serve /home/weight/DeepSeek-V3.1_w8a8mix_mtp \
    --max_num_seqs 8 \
    --max-model-len 8192 \
    --max-num-batched-tokens 16384 \
    --tensor-parallel-size 8 \
    --data-parallel-size 2 \
    --enable-expert-parallel \
    --served-model-name ds_r1 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --no-enable-prefix-caching \
    --port 8999 \
    --quantization "ascend" \
    --gpu-memory-utilization 0.85 \
    --trust-remote-code \
    --compilation_config '{"cudagraph_capture_sizes":[1,2,4,8,16,32]}' \
    --additional-config='{"dynamic_eplb":true, "num_iterations_eplb_update":100, "num_wait_worker_iterations":100}'
 

input&output: 2k 2k
This PR:
fusion

Baseline:
baseline

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

白永斌 added 2 commits November 18, 2025 09:07
Signed-off-by: 白永斌 <[email protected]>
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

白永斌 added 4 commits November 27, 2025 11:22
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
@845473182 845473182 changed the title Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list into dynamic EPLB [EPLB][Ops] Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list operator into dynamic EPLB Nov 28, 2025
@845473182 845473182 marked this pull request as ready for review November 28, 2025 07:48
白永斌 added 6 commits November 29, 2025 14:31
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
白永斌 added 2 commits November 29, 2025 18:42
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
@weijinqian0 weijinqian0 added ready read for review ready-for-test start test by label for PR labels Nov 29, 2025
weijinqian0 and others added 30 commits November 29, 2025 23:28
The main purposes of this PR are as follows: 
1. Remove the multicast-related code; 

Reason:
1. In the scenario like a2 Dual-System Back-to-Back Networking,the
performance is worse than all_gather. Before the modification, in e2e
test, it was 3 tps; after the modification, it is 10 tps.
2. At the same time, we usually enable the SP feature,it is consistent
with the current logic.
3. The advantage of broadcast communication lies in the fact that it
does not suffer from uneven DP load and does not require the prefill ACL
graph to be enabled. But we support prefill Acl graph recently.

So we think there is no need to maintain the multicast as one choice in
moe communication.

Performance benefits are as follows:
When not enable_flashcomm1, TTFT remains relatively stable at around
43000ms, which is approximately 15000ms faster than before the
modification.

When enable_flashcomm1, there is no diffenence, TTFT remains relatively
stable at around 29000ms.


- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

---------

Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian0 <[email protected]>
Co-authored-by: weijinqian_v1 <[email protected]>
### What this PR does / why we need it?
Temporarily fix the oom issue, will update to vllm's plan later.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
e2e&ut

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

Signed-off-by: Pr0Wh1teGivee <[email protected]>
…#4241)

### What this PR does / why we need it?
vllm-ascend need to dump data during model execution to debug some
precision problems, here msprobe provide the corresponding abilities, so
msprobe will join vllm-ascend to make debug easier

### Does this PR introduce _any_ user-facing change?
```
'dump_config': '/path/to/config.json'
```



- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

---------

Signed-off-by: Tjh-UKN <[email protected]>
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6.

- vLLM main:
vllm-project/vllm@2918c1b

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…vllm-project#4392)

### What this PR does / why we need it?
When cudagraph_mode is set to FULL_DECODE_ONLY, if dp > 1, the dummy-run
process will be triggered. When calling the update_attn_params function,
the num_tokens parameter needs to be passed, and this value is obtained
through positions.shape[0]. However, the multimodal model uses mRope
(multi-dimensional rotary positional embeddings), which causes the shape
of positions to be 2. As a result, the value obtained from
positions.shape[0] is incorrect. We solve this problem by replacing
positions.shape[0] with num_tokens.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

---------

Signed-off-by: wujinyuan1 <[email protected]>
Co-authored-by: wujinyuan1 <[email protected]>
### What this PR does / why we need it?
The "g" at the beginning of the current sentence is redundant and needs
to be deleted
"MindIE Turbo" is no longer required to be displayed.

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut

- vLLM main:
vllm-project/vllm@2918c1b

---------

Signed-off-by: herizhen <[email protected]>
Co-authored-by: herizhen <[email protected]>
### What this PR does / why we need it?
Fix a bug caused by this pr:
vllm-project#4223
The bug makes
vllm-ascend/vllm_ascend/patch/platform/patch_multiproc_executor.py patch
in a wrong way

### How was this patch tested?
Tested in a single node. When the environment DYNAMIC_EPLB is set to
true, the patch works correctly. When it's set to false, the patch do
not patch
- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

Signed-off-by: 白永斌 <[email protected]>
Co-authored-by: 白永斌 <[email protected]>
…oject#4423)

### What this PR does / why we need it?
This PR pins the transformers dependency to 4.57.1.

Reason: CI tests (specifically test_completion_with_prompt_embeds.py)
are failing with an AttributeError: 'dict' object has no attribute
'model_type' when using newer versions of transformers.

The issue stems from a bug in tokenization_utils_base.py where the code
attempts to access the model_type field of a configuration dictionary
(_config) using dot notation (_config.model_type) instead of dictionary
key lookup (_config["model_type"] or _config.get("model_type")). This
occurs in the logic block checking for transformers_version <= 4.57.2.

Pinning the version to 4.57.1 bypasses this buggy code path and restores
CI stability.

Error Traceback:
``` shell
/usr/local/python3.11.13/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2419: 
if _is_local and _config.model_type not in [
E   AttributeError: 'dict' object has no attribute 'model_type'
```

- vLLM main:
vllm-project/vllm@2918c1b

Signed-off-by: MrZ20 <[email protected]>
…llm-project#4354)

### What this PR does / why we need it?

**Problem**: The Qwen3Next model implementation currently imports
chunk_gated_delta_rule directly using `from ... import ...`

In frameworks like `verl`, the model file is often imported before
`vllm-ascend` initializes and applies its patches. This causes the model
to permanently hold a reference to the original (unpatched) vLLM kernel,
resulting in execution errors on Ascend devices even if the patch runs
later.

**Solution**: Changed the import style to `from vllm...ops import chunk`
and call `chunk.chunk_gated_delta_rule().`

This ensures that the function lookup happens at runtime (dynamic
dispatch), allowing the model to correctly pick up the patched function
regardless of import order.

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

Signed-off-by: zjchenn <[email protected]>
### What this PR does / why we need it?

To fix ops test, where `model_config` has been set to `None` and doesn't
has `hf_config` attribute, we have added a check for `model_config` to
guarantee it is not `None_Type`.

- vLLM main:
vllm-project/vllm@2918c1b

Signed-off-by: shen-shanshan <[email protected]>
Torch-npu 2.7.1 has fixed the device check bug. This patch can be
removed now.

- vLLM main:
vllm-project/vllm@2918c1b

Signed-off-by: wangxiyuan <[email protected]>
### What this PR does / why we need it?
Delete useless comments.
### Does this PR introduce _any_ user-facing change?
No

- vLLM main:
vllm-project/vllm@2918c1b

Signed-off-by: GDzhu01 <[email protected]>
### What this PR does / why we need it?
mkdir triton package and move triton files

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

Signed-off-by: shiyuan680 <[email protected]>
Bump vLLM version to v0.11.2

What's broken and changed by vLLM:
1. structured_output is broken by
vllm-project/vllm#26866
2. get_mrope_input_positions is broken by
vllm-project/vllm#28399
3. graph mode is broken by
vllm-project/vllm#25110 we'll upgrade torch to
2.8 to fix the problem later
4. embedding is broken by
vllm-project/vllm#27583
5. `get_attn_backend_cls` and attention backend is broken are broken by
vllm-project/vllm#28534
6. spec decode is broken by
vllm-project/vllm#28771
7. sp feature is broken by
vllm-project/vllm#27126
8. mtp is broken by vllm-project/vllm#27922
9. lora is broken by vllm-project/vllm#21068
10. execute_model is broken by
vllm-project/vllm#26866
11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by
vllm-project/vllm#28159
12. kv cahe is broken by vllm-project/vllm#27753
13. dp is broken by vllm-project/vllm#25110

 
What's broken and changed by ourself:
1. qwen vl is broken by vllm-project/vllm#28455
We'll remove model files in the future to avoid this kind of error
2. Engine core is broken by
vllm-project/vllm#23691 We'll remove the patch
file in the future.
3. Ascend scheduler is broken by
vllm-project/vllm#28733 We'll remove ascend
scheudler later.
4. qwen3-next is broken by
vllm-project/vllm#28083 We'll remove model files
in the future to avoid this kind of error
5. qwen vl is broken by vllm-project/vllm#27764.
We'll remove model files in the future

Known issue:
1. ray doesn't work 
2. the accuracy of qwen3-next is not correct
3. qwen3-vl is broken
4. prefix cache+ ascend scheduler + deepseek v2 lite is broken.

Co-authored-by: MengqingCao <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Co-authored-by: 22dimensions <[email protected]>
Co-authored-by: shen-shanshan <[email protected]>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: hfadzxy <[email protected]>
Signed-off-by: leo-pony <[email protected]>
Co-authored-by: MengqingCao <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: leo-pony <[email protected]>
1. Run 4-card test only when single and 2-card test passed
2. rename file to make it more clear
3. remove useless pd workflow, it has been managed by nightly test
already.

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: wangxiyuan <[email protected]>
Currently, there are two paths to judge the chip type in code,
`get_ascend_soc_version` use `get_soc_version` api in torch_npu, and
`is_310p` `use _build_info.__soc_version__`, which generate when
install. We need to unify the two paths.

We need to unify these codes based on the following points:

1. We need to ensure consistency in chip type judgment between compiling
and running states;
2. In compiling state, we need chip type to complete op's compilation,
but in running state, we only need device
type(910B/910_93/310P/910_95/etc) to make code branch judgement;
3. In compiling state, torch_npu may not have been installed yet, so we
can't use torch_npu's api.

Based on the above points, we have made the following changes:

1. When user set env `SOC_VERSION`, use it; when not set, query
soc_version by `npu-smi`;
2. generate device_type based on soc_version when compiling, and write
`__device_type__` instead of `__soc_version__` in `_build_info.py`;
3. In running state, use `__device_type__` to judge code branch.

When not set env `SOC_VERSION`, it will not be `ASCEND910B1` by default,
we will query soc_version by `npu-smi`. And env `SOC_VERSION` must be in
the list `soc_to_device` in `setup.py`.

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

Signed-off-by: zzzzwwjj <[email protected]>
### What this PR does / why we need it?
When running 'python example.py',connection issues often occur.The
solution is to comment out the first line the code.
Complete the specific names of machines A2 and A3.
Standardize document format,a space should be added after the colon.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut

- vLLM version: v0.11.2

---------

Signed-off-by: herizhen <[email protected]>
Co-authored-by: herizhen <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.