Skip to content

Conversation

@momo609
Copy link
Collaborator

@momo609 momo609 commented Nov 3, 2025

What this PR does / why we need it?

support qwen3-next full_decode_only mode.
bs=1, max_token=1024

branch tps e2e time
piecewise 3.06 8.15
fulldecodeonly 7.2 3.47

How was this patch tested?

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link

github-actions bot commented Nov 3, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for qwen3-next in full_decode_only mode by handling mixed attention types, specifically linear_attn. The changes in _build_dummy_attn_metadata correctly differentiate between attention builders to generate appropriate metadata for different layers. However, there is a block of redundant code that re-calculates attn_state, which should be removed to improve code clarity and maintainability.

Comment on lines 2742 to 2747
attn_state = AscendAttentionState.DecodeOnly
if self.speculative_config and \
self.speculative_config.method == "deepseek_mtp":
attn_state = AscendAttentionState.SpecDecoding


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block of code is redundant as it re-calculates attn_state with the same logic as in lines 2720-2723. This can lead to confusion and potential maintenance issues. Please remove this duplicated block and use the attn_state variable that was already defined.

@momo609 momo609 force-pushed the fulloptimze branch 3 times, most recently from c69a4f5 to 63db91a Compare November 3, 2025 07:28
@yiz-liu
Copy link
Collaborator

yiz-liu commented Nov 3, 2025

@momo609 Please elaborate on why we do not need to update linear attention params in FULL mode and why zip naturally filters out those layer. Also, please add E2E test case.

@momo609 momo609 force-pushed the fulloptimze branch 5 times, most recently from 97bc2b0 to 1cce890 Compare November 4, 2025 06:16
@weijinqian0 weijinqian0 added ready read for review ready-for-test start test by label for PR labels Nov 4, 2025
Signed-off-by: wangxiaoxin-sherie <[email protected]>
@wangxiyuan wangxiyuan merged commit 738bf2b into vllm-project:main Nov 5, 2025
24 checks passed
Pz1116 pushed a commit to Pz1116/vllm-ascend that referenced this pull request Nov 5, 2025
### What this PR does / why we need it?
support qwen3-next full_decode_only mode.
bs=1, max_token=1024
| branch| tps| e2e time|
| --- | --- | --- |
|piecewise  |3.06  | 8.15 |
|fulldecodeonly | 7.2 | 3.47 |

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@83f478b

Signed-off-by: wangxiaoxin-sherie <[email protected]>
Co-authored-by: wangxiaoxin-sherie <[email protected]>
Signed-off-by: Pz1116 <[email protected]>
luolun pushed a commit to luolun/vllm-ascend that referenced this pull request Nov 19, 2025
### What this PR does / why we need it?
support qwen3-next full_decode_only mode. 
bs=1, max_token=1024
| branch| tps| e2e time|
| --- | --- | --- |
|piecewise  |3.06  | 8.15 |
|fulldecodeonly | 7.2 | 3.47 |

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@83f478b

Signed-off-by: wangxiaoxin-sherie <[email protected]>
Co-authored-by: wangxiaoxin-sherie <[email protected]>
Signed-off-by: luolun <[email protected]>
hwhaokun pushed a commit to hwhaokun/vllm-ascend that referenced this pull request Nov 19, 2025
### What this PR does / why we need it?
support qwen3-next full_decode_only mode.
bs=1, max_token=1024
| branch| tps| e2e time|
| --- | --- | --- |
|piecewise  |3.06  | 8.15 |
|fulldecodeonly | 7.2 | 3.47 |

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@83f478b

Signed-off-by: wangxiaoxin-sherie <[email protected]>
Co-authored-by: wangxiaoxin-sherie <[email protected]>
Signed-off-by: hwhaokun <[email protected]>
NSDie pushed a commit to NSDie/vllm-ascend that referenced this pull request Nov 24, 2025
### What this PR does / why we need it?
support qwen3-next full_decode_only mode.
bs=1, max_token=1024
| branch| tps| e2e time|
| --- | --- | --- |
|piecewise  |3.06  | 8.15 |
|fulldecodeonly | 7.2 | 3.47 |

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@83f478b

Signed-off-by: wangxiaoxin-sherie <[email protected]>
Co-authored-by: wangxiaoxin-sherie <[email protected]>
Signed-off-by: nsdie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants