-
Notifications
You must be signed in to change notification settings - Fork 624
[Bugfix] Fixes aclnnApplyRotaryPosEmbV2 Op Tiling Error Due To UB Size Exceeded
#3702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…f required UB size exceeding Signed-off-by: zhoux77899 <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a bugfix for an operator tiling error on Ascend NPUs by checking for potential Unified Buffer (UB) size overflow and providing a fallback path. The core logic of the fix is sound. My review includes suggestions to improve the maintainability and efficiency of the newly added code, specifically by simplifying a complex calculation and reordering operations to avoid unnecessary work.
| ub_required = (q_n + k_n) * 128 * cast_size * 2 + 128 * dtype_size * 4 + ( | ||
| q_n + k_n) * 128 * cast_size + ( | ||
| q_n + k_n) * 128 * cast_size * 2 + cast * 128 * 4 * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculation for ub_required is complex and contains repeated terms, making it hard to read and verify. It can be mathematically simplified to a more concise form. This improves readability and maintainability.
The original expression:
(q_n + k_n) * 128 * cast_size * 2 + 128 * dtype_size * 4 + (q_n + k_n) * 128 * cast_size + (q_n + k_n) * 128 * cast_size * 2 + cast * 128 * 4 * 2
can be simplified to:
(q_n + k_n) * 128 * cast_size * 5 + 128 * dtype_size * 4 + cast * 128 * 8
ub_required = (q_n + k_n) * 128 * cast_size * 5 + 128 * dtype_size * 4 + cast * 128 * 8Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
… for avoiding repeated executing Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: Ruri <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
Signed-off-by: zhoux77899 <[email protected]>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
aclnnApplyRotaryPosEmbV2([main] Optimize rope in Qwen Models #2571) didn't consider UB sizeaclnnApplyRotaryPosEmbV2op tiling error due to required UB size is too large, exceeding the UB size of NPUaclnnApplyRotaryPosEmbV2delivers the best performance, followed by the ATBRopeOperation, while invokingaclnnRotaryPositionEmbeddingtwice yields the poorest resultsaclnnApplyRotaryPosEmbV2are not met, we fall back to using the ATBRopeOperationas an alternativeDoes this PR introduce any user-facing change?
None
How was this patch tested?
Performance tested on

Qwen3-Coder-480B-A35B-Instructwithtp=1,dp=16,ep=16andtp=2,dp=8,ep=16on A3