-
Notifications
You must be signed in to change notification settings - Fork 617
[EPLB][Ops] Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list operator into dynamic EPLB #4216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
845473182
wants to merge
66
commits into
vllm-project:main
Choose a base branch
from
845473182:gmm_swiglu_quant_tensor_list
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
[EPLB][Ops] Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list operator into dynamic EPLB #4216
845473182
wants to merge
66
commits into
vllm-project:main
from
845473182:gmm_swiglu_quant_tensor_list
+142
−53
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: 白永斌 <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
…_mlp Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
…oading model phase Signed-off-by: 白永斌 <[email protected]>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 欧派果奶我还要 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
The main purposes of this PR are as follows: 1. Remove the multicast-related code; Reason: 1. In the scenario like a2 Dual-System Back-to-Back Networking,the performance is worse than all_gather. Before the modification, in e2e test, it was 3 tps; after the modification, it is 10 tps. 2. At the same time, we usually enable the SP feature,it is consistent with the current logic. 3. The advantage of broadcast communication lies in the fact that it does not suffer from uneven DP load and does not require the prefill ACL graph to be enabled. But we support prefill Acl graph recently. So we think there is no need to maintain the multicast as one choice in moe communication. Performance benefits are as follows: When not enable_flashcomm1, TTFT remains relatively stable at around 43000ms, which is approximately 15000ms faster than before the modification. When enable_flashcomm1, there is no diffenence, TTFT remains relatively stable at around 29000ms. - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: weijinqian_v1 <[email protected]> Signed-off-by: weijinqian0 <[email protected]> Co-authored-by: weijinqian_v1 <[email protected]>
### What this PR does / why we need it? Temporarily fix the oom issue, will update to vllm's plan later. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e&ut - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: Pr0Wh1teGivee <[email protected]>
…#4241) ### What this PR does / why we need it? vllm-ascend need to dump data during model execution to debug some precision problems, here msprobe provide the corresponding abilities, so msprobe will join vllm-ascend to make debug easier ### Does this PR introduce _any_ user-facing change? ``` 'dump_config': '/path/to/config.json' ``` - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: Tjh-UKN <[email protected]>
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6. - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…vllm-project#4392) ### What this PR does / why we need it? When cudagraph_mode is set to FULL_DECODE_ONLY, if dp > 1, the dummy-run process will be triggered. When calling the update_attn_params function, the num_tokens parameter needs to be passed, and this value is obtained through positions.shape[0]. However, the multimodal model uses mRope (multi-dimensional rotary positional embeddings), which causes the shape of positions to be 2. As a result, the value obtained from positions.shape[0] is incorrect. We solve this problem by replacing positions.shape[0] with num_tokens. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? vLLM version: v0.11.0rc3 vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: wujinyuan1 <[email protected]> Co-authored-by: wujinyuan1 <[email protected]>
### What this PR does / why we need it? The "g" at the beginning of the current sentence is redundant and needs to be deleted "MindIE Turbo" is no longer required to be displayed. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: herizhen <[email protected]> Co-authored-by: herizhen <[email protected]>
### What this PR does / why we need it? Fix a bug caused by this pr: vllm-project#4223 The bug makes vllm-ascend/vllm_ascend/patch/platform/patch_multiproc_executor.py patch in a wrong way ### How was this patch tested? Tested in a single node. When the environment DYNAMIC_EPLB is set to true, the patch works correctly. When it's set to false, the patch do not patch - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: 白永斌 <[email protected]> Co-authored-by: 白永斌 <[email protected]>
…oject#4423) ### What this PR does / why we need it? This PR pins the transformers dependency to 4.57.1. Reason: CI tests (specifically test_completion_with_prompt_embeds.py) are failing with an AttributeError: 'dict' object has no attribute 'model_type' when using newer versions of transformers. The issue stems from a bug in tokenization_utils_base.py where the code attempts to access the model_type field of a configuration dictionary (_config) using dot notation (_config.model_type) instead of dictionary key lookup (_config["model_type"] or _config.get("model_type")). This occurs in the logic block checking for transformers_version <= 4.57.2. Pinning the version to 4.57.1 bypasses this buggy code path and restores CI stability. Error Traceback: ``` shell /usr/local/python3.11.13/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2419: if _is_local and _config.model_type not in [ E AttributeError: 'dict' object has no attribute 'model_type' ``` - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: MrZ20 <[email protected]>
…llm-project#4354) ### What this PR does / why we need it? **Problem**: The Qwen3Next model implementation currently imports chunk_gated_delta_rule directly using `from ... import ...` In frameworks like `verl`, the model file is often imported before `vllm-ascend` initializes and applies its patches. This causes the model to permanently hold a reference to the original (unpatched) vLLM kernel, resulting in execution errors on Ascend devices even if the patch runs later. **Solution**: Changed the import style to `from vllm...ops import chunk` and call `chunk.chunk_gated_delta_rule().` This ensures that the function lookup happens at runtime (dynamic dispatch), allowing the model to correctly pick up the patched function regardless of import order. - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: zjchenn <[email protected]>
### What this PR does / why we need it? To fix ops test, where `model_config` has been set to `None` and doesn't has `hf_config` attribute, we have added a check for `model_config` to guarantee it is not `None_Type`. - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: shen-shanshan <[email protected]>
Torch-npu 2.7.1 has fixed the device check bug. This patch can be removed now. - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: wangxiyuan <[email protected]>
### What this PR does / why we need it? Delete useless comments. ### Does this PR introduce _any_ user-facing change? No - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: GDzhu01 <[email protected]>
### What this PR does / why we need it? mkdir triton package and move triton files - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: shiyuan680 <[email protected]>
Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: 22dimensions <[email protected]> Co-authored-by: shen-shanshan <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: MengqingCao <[email protected]> Signed-off-by: hfadzxy <[email protected]> Signed-off-by: leo-pony <[email protected]> Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]>
1. Run 4-card test only when single and 2-card test passed 2. rename file to make it more clear 3. remove useless pd workflow, it has been managed by nightly test already. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: wangxiyuan <[email protected]>
Currently, there are two paths to judge the chip type in code, `get_ascend_soc_version` use `get_soc_version` api in torch_npu, and `is_310p` `use _build_info.__soc_version__`, which generate when install. We need to unify the two paths. We need to unify these codes based on the following points: 1. We need to ensure consistency in chip type judgment between compiling and running states; 2. In compiling state, we need chip type to complete op's compilation, but in running state, we only need device type(910B/910_93/310P/910_95/etc) to make code branch judgement; 3. In compiling state, torch_npu may not have been installed yet, so we can't use torch_npu's api. Based on the above points, we have made the following changes: 1. When user set env `SOC_VERSION`, use it; when not set, query soc_version by `npu-smi`; 2. generate device_type based on soc_version when compiling, and write `__device_type__` instead of `__soc_version__` in `_build_info.py`; 3. In running state, use `__device_type__` to judge code branch. When not set env `SOC_VERSION`, it will not be `ASCEND910B1` by default, we will query soc_version by `npu-smi`. And env `SOC_VERSION` must be in the list `soc_to_device` in `setup.py`. - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b Signed-off-by: zzzzwwjj <[email protected]>
### What this PR does / why we need it? When running 'python example.py',connection issues often occur.The solution is to comment out the first line the code. Complete the specific names of machines A2 and A3. Standardize document format,a space should be added after the colon. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.2 --------- Signed-off-by: herizhen <[email protected]> Co-authored-by: herizhen <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
module:ops
module:quantization
module:tests
ready
read for review
ready-for-test
start test by label for PR
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it?
Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list into dynamic EPLB to support list-type parameters
This PR also modify the logic of loading model in dynamic-eplb scenario.
The operator is based on this pr: #3804
Does this PR introduce any user-facing change?
no
How was this patch tested?
input&output: 2k 2k

This PR:
Baseline:
