Skip to content

Conversation

@wangyao-i
Copy link

@wangyao-i wangyao-i commented Nov 17, 2025

What this PR does / why we need it?

vllm-ascend support Ascend950 with Qwen dense model

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Ascend 950 (A5) platform and Qwen3 dense models. The changes primarily involve adding conditional code paths for the A5 architecture, utilizing new or different NPU-specific operators for attention mechanisms and other operations. While most of the changes correctly add support for the new hardware, I've identified a few critical issues. There's a potential bug in KV cache handling for prefill operations on A5, where tokens might be re-cached redundantly. Another issue is in attention mask creation, where a local modification to seq_lens has no effect, indicating dead code or a logic error. I've also pointed out a risky use of assert for a runtime check that could fail silently in production.

Comment on lines 1621 to 1628
if is_A5(): # 这里代码变动较大需要重新适配
num_tokens = attn_metadata.slot_mapping.shape[0]
torch_npu.npu_scatter_pa_kv_cache(
key=key[:num_tokens],
value=value[:num_tokens].contiguous(),
slot_mapping=attn_metadata.slot_mapping,
out=(self.key_cache, self.value_cache)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current implementation for is_A5() in the has_prefill block appears to be re-caching all tokens, including decode tokens that were already handled in the has_decode block. This is inefficient and differs from the else branch, which correctly caches only the prefill tokens. This could lead to performance issues and potential correctness problems.

Suggested change
if is_A5(): # 这里代码变动较大需要重新适配
num_tokens = attn_metadata.slot_mapping.shape[0]
torch_npu.npu_scatter_pa_kv_cache(
key=key[:num_tokens],
value=value[:num_tokens].contiguous(),
slot_mapping=attn_metadata.slot_mapping,
out=(self.key_cache, self.value_cache)
)
if is_A5(): # 这里代码变动较大需要重新适配
start_idx = self.pcp_size * num_decode_tokens
end_idx = attn_metadata.num_actual_tokens_pcp_padded
torch_npu.npu_scatter_pa_kv_cache(
key=key[start_idx:end_idx],
value=value[start_idx:end_idx].contiguous(),
slot_mapping=attn_metadata.slot_mapping[start_idx:end_idx],
out=(self.key_cache, self.value_cache)
)

Comment on lines 1026 to 1027
new_element = torch.tensor([max_seq_len])
seq_lens = torch.cat([seq_lens, new_element], dim =0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The modification to seq_lens via torch.cat is local to this function, and the new seq_lens variable is not used afterwards. This means the operation has no effect outside of this function, which is likely a bug. If the caller needs the modified seq_lens, the function signature should be changed to return it. Otherwise, this is dead code and should be removed.

input_layout="TND",
softmax_scale=self.scale
)
assert output is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of assert for runtime checks is risky. assert statements are disabled when Python is run in optimized mode (with the -O flag), which can lead to silent failures in production if npu_fused_infer_attention_score_v2 can return None. It's safer to use an explicit conditional check to ensure correctness.

Suggested change
assert output is not None
if output is None: raise RuntimeError("npu_fused_infer_attention_score_v2 returned None")

@wangyao-i wangyao-i force-pushed the br_qwen3_0_6B_1114 branch 2 times, most recently from b688146 to 21a48b2 Compare November 17, 2025 12:01
@wangyao-i wangyao-i changed the title vllm-ascend supports 950 with Qwen3 dense model. vllm-ascend support Ascend950 with Qwen dense model. Nov 17, 2025
@wangyao-i wangyao-i force-pushed the br_qwen3_0_6B_1114 branch 16 times, most recently from 92050f5 to 0b1cb6f Compare November 18, 2025 07:52
Copy link
Collaborator

@yiz-liu yiz-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe refactor those attention into one single _forward_ascend_950.

@wangyao-i wangyao-i force-pushed the br_qwen3_0_6B_1114 branch 2 times, most recently from b040e39 to fa1227c Compare November 19, 2025 12:42
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Copy link
Collaborator

@yiz-liu yiz-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is OK, I have no further comments. But @wangyao-i you should check with @zzzzwwjj as soon as possible regarding #4359

@yiz-liu yiz-liu added ready read for review ready-for-test start test by label for PR labels Nov 24, 2025
output: torch.Tensor) -> torch.Tensor:
num_tokens = attn_metadata.query_start_loc[-1]
if attn_metadata.attn_state == AscendAttentionState.PrefillNoCache:
output_data, _ = torch_npu.npu_fused_infer_attention_score_v2(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use npu_fused_infer_attention_score_v2?

Copy link
Collaborator

@yiz-liu yiz-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion, I withdrawl my approval since the current design does not support ACL Graph, need to change to .out api and let vLLM Ascend handle the workspace. If we can't implement in this PR, we should at least have another PR addressing this issue before we merge this one.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link

github-actions bot commented Dec 2, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@wangyao-i
Copy link
Author

After discussion, I withdrawl my approval since the current design does not support ACL Graph, need to change to .out api and let vLLM Ascend handle the workspace. If we can't implement in this PR, we should at least have another PR addressing this issue before we merge this one.

Recently, the community has completed the integration of the ATB interface for the attention operator with pr#4531. Ascend 950 can now directly reuse the operator capabilities of A2/A3, so we have rolled back the adaptation for this part.

@wangyao-i wangyao-i closed this Dec 3, 2025
@wangyao-i wangyao-i reopened this Dec 3, 2025
@wangyao-i wangyao-i force-pushed the br_qwen3_0_6B_1114 branch 2 times, most recently from 04d9d54 to 8c287f5 Compare December 3, 2025 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ops ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants