-
Notifications
You must be signed in to change notification settings - Fork 624
[Refactor] Remove redundant attention operator branches. #4531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Refactor] Remove redundant attention operator branches. #4531
Conversation
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the attention mechanism by removing redundant operator branches and support for the 310P device. The changes unify different prefill attention paths into a single _forward_prefill method using npu_fused_infer_attention_score, which simplifies the codebase significantly. The _forward_v1_style method has been removed, and a new _forward_encode method has been introduced to handle encoder-only attention, making the main forward method cleaner and more readable. The logic for creating attention masks has also been simplified. Overall, the changes are well-executed, improve code maintainability, and appear to be correct. I have no major concerns.
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: weijinqian_v1 <[email protected]>
…pgrade. Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
| return None | ||
| return None | ||
|
|
||
| def _make_fia_attention_mask(self) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fia_mask can also delete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used in pcp branch. It will be removed next PR.
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
…t#4531) [Refactor] Remove redundant attention operator branches. Reason: We replace other attention ops with fused_infer_attention_score expect decode_only state. clean code and remove 310P support. vllm-project#4455 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: weijinqian_v1 <[email protected]> Co-authored-by: weijinqian_v1 <[email protected]>
…t#4531) [Refactor] Remove redundant attention operator branches. Reason: We replace other attention ops with fused_infer_attention_score expect decode_only state. clean code and remove 310P support. vllm-project#4455 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: weijinqian_v1 <[email protected]> Co-authored-by: weijinqian_v1 <[email protected]> Signed-off-by: Che Ruan <[email protected]>
…t#4531) [Refactor] Remove redundant attention operator branches. Reason: We replace other attention ops with fused_infer_attention_score expect decode_only state. clean code and remove 310P support. vllm-project#4455 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: weijinqian_v1 <[email protected]> Co-authored-by: weijinqian_v1 <[email protected]> Signed-off-by: Che Ruan <[email protected]>
|
@weijinqian0 #4713 gemma-2-9b-it & gemma-3-4b-it accuarcy test failed after the pr |
[Refactor] Remove redundant attention operator branches.
Reason:
We replace other attention ops with fused_infer_attention_score expect decode_only state.
clean code and remove 310P support.
#4455