-
Notifications
You must be signed in to change notification settings - Fork 629
[Refactor] Remove redundant attention operator branches. #4455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Refactor] Remove redundant attention operator branches. #4455
Conversation
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the attention operator branches by unifying prefill logic into a single _forward_prefill method and removing support for 310P devices. The changes simplify the control flow in the main forward method, making the code cleaner and easier to maintain. My review focuses on the correctness and clarity of this refactoring. I've identified one high-severity issue related to an unused parameter in the new _forward_prefill method, which should be addressed to improve code quality.
| def _forward_prefill(self, | ||
| query: torch.Tensor, | ||
| key: torch.Tensor, | ||
| value: torch.Tensor, | ||
| kv_cache: Tuple[torch.Tensor], | ||
| attn_metadata: AscendMetadata, | ||
| output: torch.Tensor, | ||
| num_tokens=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output parameter in the _forward_prefill method is unused. The value passed to it is immediately shadowed by the assignment output, _ = torch_npu.npu_fused_infer_attention_score(...) on line 357. This is misleading as it suggests an in-place operation which is not happening, and the pre-allocated tensor is wasted.
To avoid confusion and make the code cleaner, the output parameter should be removed from the function signature. The call site at line 592 should also be updated to no longer pass this parameter.
def _forward_prefill(self,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
kv_cache: Tuple[torch.Tensor],
attn_metadata: AscendMetadata,
num_tokens=0):Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
[Refactor] Remove redundant attention operator branches. Reason: We replace other attention ops with fused_infer_attention_score expect decode_only state. clean code and remove 310P support. #4455 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: weijinqian_v1 <[email protected]> Co-authored-by: weijinqian_v1 <[email protected]>
|
replace by #4524 |
…t#4531) [Refactor] Remove redundant attention operator branches. Reason: We replace other attention ops with fused_infer_attention_score expect decode_only state. clean code and remove 310P support. vllm-project#4455 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: weijinqian_v1 <[email protected]> Co-authored-by: weijinqian_v1 <[email protected]>
…t#4531) [Refactor] Remove redundant attention operator branches. Reason: We replace other attention ops with fused_infer_attention_score expect decode_only state. clean code and remove 310P support. vllm-project#4455 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: weijinqian_v1 <[email protected]> Co-authored-by: weijinqian_v1 <[email protected]> Signed-off-by: Che Ruan <[email protected]>
…t#4531) [Refactor] Remove redundant attention operator branches. Reason: We replace other attention ops with fused_infer_attention_score expect decode_only state. clean code and remove 310P support. vllm-project#4455 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: weijinqian_v1 <[email protected]> Co-authored-by: weijinqian_v1 <[email protected]> Signed-off-by: Che Ruan <[email protected]>
…t#4531) [Refactor] Remove redundant attention operator branches. Reason: We replace other attention ops with fused_infer_attention_score expect decode_only state. clean code and remove 310P support. vllm-project#4455 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: weijinqian_v1 <[email protected]> Co-authored-by: weijinqian_v1 <[email protected]>
[Refactor] Remove redundant attention operator branches.
Reason: