vllm-ascend support Ascend950 with Qwen dense model. #4228

wangyao-i · 2025-11-17T09:37:00Z

What this PR does / why we need it?

vllm-ascend support Ascend950 with Qwen dense model

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.12.0
vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0

github-actions · 2025-11-17T09:37:08Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request adds support for the Ascend 950 (A5) platform and Qwen3 dense models. The changes primarily involve adding conditional code paths for the A5 architecture, utilizing new or different NPU-specific operators for attention mechanisms and other operations. While most of the changes correctly add support for the new hardware, I've identified a few critical issues. There's a potential bug in KV cache handling for prefill operations on A5, where tokens might be re-cached redundantly. Another issue is in attention mask creation, where a local modification to seq_lens has no effect, indicating dead code or a logic error. I've also pointed out a risky use of assert for a runtime check that could fail silently in production.

gemini-code-assist · 2025-11-17T09:40:28Z

vllm_ascend/attention/attention_v1.py

+                if is_A5(): # 这里代码变动较大需要重新适配
+                    num_tokens = attn_metadata.slot_mapping.shape[0]
+                    torch_npu.npu_scatter_pa_kv_cache(
+                        key=key[:num_tokens],
+                        value=value[:num_tokens].contiguous(),
+                        slot_mapping=attn_metadata.slot_mapping,
+                        out=(self.key_cache, self.value_cache)
+                    )


The current implementation for is_A5() in the has_prefill block appears to be re-caching all tokens, including decode tokens that were already handled in the has_decode block. This is inefficient and differs from the else branch, which correctly caches only the prefill tokens. This could lead to performance issues and potential correctness problems.

Suggested change

if is_A5(): # 这里代码变动较大需要重新适配

num_tokens = attn_metadata.slot_mapping.shape[0]

torch_npu.npu_scatter_pa_kv_cache(

key=key[:num_tokens],

value=value[:num_tokens].contiguous(),

slot_mapping=attn_metadata.slot_mapping,

out=(self.key_cache, self.value_cache)

)

if is_A5(): # 这里代码变动较大需要重新适配

start_idx = self.pcp_size * num_decode_tokens

end_idx = attn_metadata.num_actual_tokens_pcp_padded

torch_npu.npu_scatter_pa_kv_cache(

key=key[start_idx:end_idx],

value=value[start_idx:end_idx].contiguous(),

slot_mapping=attn_metadata.slot_mapping[start_idx:end_idx],

out=(self.key_cache, self.value_cache)

)

gemini-code-assist · 2025-11-17T09:40:28Z

vllm_ascend/worker/model_runner_v1.py

+                new_element = torch.tensor([max_seq_len])
+                seq_lens = torch.cat([seq_lens, new_element], dim =0)


The modification to seq_lens via torch.cat is local to this function, and the new seq_lens variable is not used afterwards. This means the operation has no effect outside of this function, which is likely a bug. If the caller needs the modified seq_lens, the function signature should be changed to return it. Otherwise, this is dead code and should be removed.

gemini-code-assist · 2025-11-17T09:40:28Z

vllm_ascend/attention/attention_v1.py

+                input_layout="TND",
+                softmax_scale=self.scale
+            )
+            assert output is not None


The use of assert for runtime checks is risky. assert statements are disabled when Python is run in optimized mode (with the -O flag), which can lead to silent failures in production if npu_fused_infer_attention_score_v2 can return None. It's safer to use an explicit conditional check to ensure correctness.

Suggested change

assert output is not None

if output is None: raise RuntimeError("npu_fused_infer_attention_score_v2 returned None")

yiz-liu

Maybe refactor those attention into one single _forward_ascend_950.

github-actions · 2025-11-19T14:35:30Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

yiz-liu

Overall, this is OK, I have no further comments. But @wangyao-i you should check with @zzzzwwjj as soon as possible regarding #4359

weijinqian0 · 2025-11-25T08:31:35Z

vllm_ascend/attention/attention_v1.py

+                            output: torch.Tensor) -> torch.Tensor:
+        num_tokens = attn_metadata.query_start_loc[-1]
+        if attn_metadata.attn_state == AscendAttentionState.PrefillNoCache:
+            output_data, _ = torch_npu.npu_fused_infer_attention_score_v2(


why not use npu_fused_infer_attention_score_v2?

yiz-liu

After discussion, I withdrawl my approval since the current design does not support ACL Graph, need to change to .out api and let vLLM Ascend handle the workspace. If we can't implement in this PR, we should at least have another PR addressing this issue before we merge this one.

github-actions · 2025-11-26T06:45:33Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-12-02T01:15:40Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

wangyao-i · 2025-12-03T03:48:46Z

After discussion, I withdrawl my approval since the current design does not support ACL Graph, need to change to .out api and let vLLM Ascend handle the workspace. If we can't implement in this PR, we should at least have another PR addressing this issue before we merge this one.

Recently, the community has completed the integration of the ATB interface for the attention operator with pr#4531. Ascend 950 can now directly reuse the operator capabilities of A2/A3, so we have rolled back the adaptation for this part.

Signed-off-by: wangyao <[email protected]>

github-actions bot added module:ops module:core labels Nov 17, 2025

gemini-code-assist bot reviewed Nov 17, 2025

View reviewed changes

wangyao-i force-pushed the br_qwen3_0_6B_1114 branch 2 times, most recently from b688146 to 21a48b2 Compare November 17, 2025 12:01

wangyao-i changed the title ~~vllm-ascend supports 950 with Qwen3 dense model.~~ vllm-ascend support Ascend950 with Qwen dense model. Nov 17, 2025

wangyao-i force-pushed the br_qwen3_0_6B_1114 branch 16 times, most recently from 92050f5 to 0b1cb6f Compare November 18, 2025 07:52

github-actions bot added the module:tests label Nov 18, 2025

wangyao-i force-pushed the br_qwen3_0_6B_1114 branch from 0b1cb6f to cedb14c Compare November 18, 2025 08:05

yiz-liu reviewed Nov 18, 2025

View reviewed changes

wangyao-i force-pushed the br_qwen3_0_6B_1114 branch 2 times, most recently from b040e39 to fa1227c Compare November 19, 2025 12:42

github-actions bot added the merge-conflicts label Nov 19, 2025

wangyao-i requested review from maoxx241 and yiz-liu November 24, 2025 02:12

maoxx241 approved these changes Nov 24, 2025

View reviewed changes

yiz-liu approved these changes Nov 24, 2025

View reviewed changes

yiz-liu added ready read for review ready-for-test start test by label for PR labels Nov 24, 2025

weijinqian0 reviewed Nov 25, 2025

View reviewed changes

weijinqian0 approved these changes Nov 25, 2025

View reviewed changes

yiz-liu requested changes Nov 25, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Nov 26, 2025

wangyao-i force-pushed the br_qwen3_0_6B_1114 branch 2 times, most recently from d4bfca6 to 63bbe41 Compare December 1, 2025 09:09

github-actions bot removed merge-conflicts module:tests module:core labels Dec 1, 2025

wangyao-i force-pushed the br_qwen3_0_6B_1114 branch from 63bbe41 to 394c9c5 Compare December 1, 2025 11:58

github-actions bot added the merge-conflicts label Dec 2, 2025

wangyao-i force-pushed the br_qwen3_0_6B_1114 branch from 394c9c5 to fd5562f Compare December 2, 2025 09:19

github-actions bot removed the merge-conflicts label Dec 2, 2025

wangyao-i force-pushed the br_qwen3_0_6B_1114 branch from fd5562f to c8d5929 Compare December 2, 2025 09:57

yiz-liu approved these changes Dec 3, 2025

View reviewed changes

wangyao-i closed this Dec 3, 2025

wangyao-i reopened this Dec 3, 2025

wangyao-i force-pushed the br_qwen3_0_6B_1114 branch 2 times, most recently from 04d9d54 to 8c287f5 Compare December 3, 2025 10:34

ascend 950 support qwen dense model

d92a81c

Signed-off-by: wangyao <[email protected]>

wangyao-i force-pushed the br_qwen3_0_6B_1114 branch from 8c287f5 to d92a81c Compare December 4, 2025 06:12

		new_element = torch.tensor([max_seq_len])
		seq_lens = torch.cat([seq_lens, new_element], dim =0)

	assert output is not None
	if output is None: raise RuntimeError("npu_fused_infer_attention_score_v2 returned None")

vllm-ascend support Ascend950 with Qwen dense model. #4228

Are you sure you want to change the base?

vllm-ascend support Ascend950 with Qwen dense model. #4228

Conversation

wangyao-i commented Nov 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

yiz-liu left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

yiz-liu left a comment

Choose a reason for hiding this comment

Uh oh!

weijinqian0 Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

yiz-liu left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

wangyao-i commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wangyao-i commented Nov 17, 2025 •

edited by github-actions bot

Loading