Skip to content

Conversation

@weijinqian0
Copy link
Collaborator

[Refactor] Remove redundant attention operator branches.

Reason:

  1. We replace other attention ops with fused_infer_attention_score expect decode_only state.
  2. clean code and remove 310P support.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the attention operator branches by unifying prefill logic into a single _forward_prefill method and removing support for 310P devices. The changes simplify the control flow in the main forward method, making the code cleaner and easier to maintain. My review focuses on the correctness and clarity of this refactoring. I've identified one high-severity issue related to an unused parameter in the new _forward_prefill method, which should be addressed to improve code quality.

Comment on lines 319 to 326
def _forward_prefill(self,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
kv_cache: Tuple[torch.Tensor],
attn_metadata: AscendMetadata,
output: torch.Tensor,
num_tokens=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The output parameter in the _forward_prefill method is unused. The value passed to it is immediately shadowed by the assignment output, _ = torch_npu.npu_fused_infer_attention_score(...) on line 357. This is misleading as it suggests an in-place operation which is not happening, and the pre-allocated tensor is wasted.

To avoid confusion and make the code cleaner, the output parameter should be removed from the function signature. The call site at line 592 should also be updated to no longer pass this parameter.

    def _forward_prefill(self,
                             query: torch.Tensor,
                             key: torch.Tensor,
                             value: torch.Tensor,
                             kv_cache: Tuple[torch.Tensor],
                             attn_metadata: AscendMetadata,
                             num_tokens=0):

@weijinqian0 weijinqian0 added ready read for review ready-for-test start test by label for PR labels Nov 27, 2025
weijinqian0 added a commit that referenced this pull request Dec 2, 2025
[Refactor] Remove redundant attention operator branches.

Reason:

We replace other attention ops with fused_infer_attention_score expect
decode_only state.
clean code and remove 310P support.

#4455


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: weijinqian_v1 <[email protected]>
Co-authored-by: weijinqian_v1 <[email protected]>
@wangxiyuan
Copy link
Collaborator

replace by #4524

@wangxiyuan wangxiyuan closed this Dec 2, 2025
ChenCangtao pushed a commit to ChenCangtao/vllm-ascend that referenced this pull request Dec 3, 2025
…t#4531)

[Refactor] Remove redundant attention operator branches.

Reason:

We replace other attention ops with fused_infer_attention_score expect
decode_only state.
clean code and remove 310P support.

vllm-project#4455


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: weijinqian_v1 <[email protected]>
Co-authored-by: weijinqian_v1 <[email protected]>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
…t#4531)

[Refactor] Remove redundant attention operator branches.

Reason:

We replace other attention ops with fused_infer_attention_score expect
decode_only state.
clean code and remove 310P support.

vllm-project#4455

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: weijinqian_v1 <[email protected]>
Co-authored-by: weijinqian_v1 <[email protected]>
Signed-off-by: Che Ruan <[email protected]>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
…t#4531)

[Refactor] Remove redundant attention operator branches.

Reason:

We replace other attention ops with fused_infer_attention_score expect
decode_only state.
clean code and remove 310P support.

vllm-project#4455

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: weijinqian_v1 <[email protected]>
Co-authored-by: weijinqian_v1 <[email protected]>
Signed-off-by: Che Ruan <[email protected]>
Meihan-chen pushed a commit to Meihan-chen/vllm-ascend that referenced this pull request Dec 5, 2025
…t#4531)

[Refactor] Remove redundant attention operator branches.

Reason:

We replace other attention ops with fused_infer_attention_score expect
decode_only state.
clean code and remove 310P support.

vllm-project#4455


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: weijinqian_v1 <[email protected]>
Co-authored-by: weijinqian_v1 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants