-
Notifications
You must be signed in to change notification settings - Fork 624
vllm-ascend support Ascend950 with Qwen dense model. #4228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the Ascend 950 (A5) platform and Qwen3 dense models. The changes primarily involve adding conditional code paths for the A5 architecture, utilizing new or different NPU-specific operators for attention mechanisms and other operations. While most of the changes correctly add support for the new hardware, I've identified a few critical issues. There's a potential bug in KV cache handling for prefill operations on A5, where tokens might be re-cached redundantly. Another issue is in attention mask creation, where a local modification to seq_lens has no effect, indicating dead code or a logic error. I've also pointed out a risky use of assert for a runtime check that could fail silently in production.
| if is_A5(): # 这里代码变动较大需要重新适配 | ||
| num_tokens = attn_metadata.slot_mapping.shape[0] | ||
| torch_npu.npu_scatter_pa_kv_cache( | ||
| key=key[:num_tokens], | ||
| value=value[:num_tokens].contiguous(), | ||
| slot_mapping=attn_metadata.slot_mapping, | ||
| out=(self.key_cache, self.value_cache) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation for is_A5() in the has_prefill block appears to be re-caching all tokens, including decode tokens that were already handled in the has_decode block. This is inefficient and differs from the else branch, which correctly caches only the prefill tokens. This could lead to performance issues and potential correctness problems.
| if is_A5(): # 这里代码变动较大需要重新适配 | |
| num_tokens = attn_metadata.slot_mapping.shape[0] | |
| torch_npu.npu_scatter_pa_kv_cache( | |
| key=key[:num_tokens], | |
| value=value[:num_tokens].contiguous(), | |
| slot_mapping=attn_metadata.slot_mapping, | |
| out=(self.key_cache, self.value_cache) | |
| ) | |
| if is_A5(): # 这里代码变动较大需要重新适配 | |
| start_idx = self.pcp_size * num_decode_tokens | |
| end_idx = attn_metadata.num_actual_tokens_pcp_padded | |
| torch_npu.npu_scatter_pa_kv_cache( | |
| key=key[start_idx:end_idx], | |
| value=value[start_idx:end_idx].contiguous(), | |
| slot_mapping=attn_metadata.slot_mapping[start_idx:end_idx], | |
| out=(self.key_cache, self.value_cache) | |
| ) |
| new_element = torch.tensor([max_seq_len]) | ||
| seq_lens = torch.cat([seq_lens, new_element], dim =0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The modification to seq_lens via torch.cat is local to this function, and the new seq_lens variable is not used afterwards. This means the operation has no effect outside of this function, which is likely a bug. If the caller needs the modified seq_lens, the function signature should be changed to return it. Otherwise, this is dead code and should be removed.
| input_layout="TND", | ||
| softmax_scale=self.scale | ||
| ) | ||
| assert output is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of assert for runtime checks is risky. assert statements are disabled when Python is run in optimized mode (with the -O flag), which can lead to silent failures in production if npu_fused_infer_attention_score_v2 can return None. It's safer to use an explicit conditional check to ensure correctness.
| assert output is not None | |
| if output is None: raise RuntimeError("npu_fused_infer_attention_score_v2 returned None") |
b688146 to
21a48b2
Compare
92050f5 to
0b1cb6f
Compare
0b1cb6f to
cedb14c
Compare
yiz-liu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe refactor those attention into one single _forward_ascend_950.
b040e39 to
fa1227c
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
yiz-liu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this is OK, I have no further comments. But @wangyao-i you should check with @zzzzwwjj as soon as possible regarding #4359
| output: torch.Tensor) -> torch.Tensor: | ||
| num_tokens = attn_metadata.query_start_loc[-1] | ||
| if attn_metadata.attn_state == AscendAttentionState.PrefillNoCache: | ||
| output_data, _ = torch_npu.npu_fused_infer_attention_score_v2( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use npu_fused_infer_attention_score_v2?
yiz-liu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussion, I withdrawl my approval since the current design does not support ACL Graph, need to change to .out api and let vLLM Ascend handle the workspace. If we can't implement in this PR, we should at least have another PR addressing this issue before we merge this one.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
d4bfca6 to
63bbe41
Compare
63bbe41 to
394c9c5
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
394c9c5 to
fd5562f
Compare
fd5562f to
c8d5929
Compare
Recently, the community has completed the integration of the ATB interface for the attention operator with pr#4531. Ascend 950 can now directly reuse the operator capabilities of A2/A3, so we have rolled back the adaptation for this part. |
04d9d54 to
8c287f5
Compare
Signed-off-by: wangyao <[email protected]>
8c287f5 to
d92a81c
Compare
What this PR does / why we need it?
vllm-ascend support Ascend950 with Qwen dense model
Does this PR introduce any user-facing change?
How was this patch tested?