Skip to content

Conversation

@pichangping
Copy link
Contributor

@pichangping pichangping commented Oct 27, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the attention mechanism to use the TND tensor format directly, removing the need for packing and unpacking between TND and BSND formats. It also replaces a manual attention update implementation with a more optimized npu_attention_update kernel.

I've found a critical bug in the graph capture path for decoding, where several typos would lead to an AttributeError. Additionally, I've identified a potential issue with a hardcoded attention mask size that could cause problems for models with long context lengths.

Comment on lines +922 to +923
self.key_cache.shape[1], attn_metadata.decode.
num_computed_tokens_of_cp_dcp[:, self.cp_rank, self.dcp_rank],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There are several typos on these lines within the graph capture logic that will cause an AttributeError at runtime:

  1. attn_metadata.decode should be attn_metadata.decode_meta.
  2. num_computed_tokens_of_cp_dcp should be num_computed_tokens_of_pcp_dcp.
  3. self.cp_rank should be self.pcp_rank.

These seem to be typos introduced during refactoring, as the previous version of the code used the correct names.

Suggested change
self.key_cache.shape[1], attn_metadata.decode.
num_computed_tokens_of_cp_dcp[:, self.cp_rank, self.dcp_rank],
self.key_cache.shape[1], attn_metadata.decode_meta.
num_computed_tokens_of_pcp_dcp[:, self.pcp_rank, self.dcp_rank],

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant