Skip to content

Conversation

@whx-sjtu
Copy link
Collaborator

@whx-sjtu whx-sjtu commented Nov 24, 2025

What this PR does / why we need it?

This PR adds a triton rope kernel witch supports scenarios of rope_dim != head_dim. This can save the split op before rope and the concat op after rope. Profiling shows improvement.

Original Implementation(2 split+2 rope+ 2 slice +2 concat):
image
Because currently we only support piecewise aclgraph for DS 3.2, so there are plenty of free bubbles. You can see the computing time of all rope related kernels: 35us
image

New Triton Rope: 12us
image

image

Does this PR introduce any user-facing change?

None

How was this patch tested?

I will add related ut after ci integrated with triton.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Triton kernel for Rotary Positional Embedding (RoPE) to support partial RoPE, where the RoPE dimension is not equal to the head dimension. This is a valuable performance optimization as it avoids explicit split and concat operations. The implementation includes a new Triton kernel _triton_rope and a wrapper function rope_forward_triton. The changes in sfa_v1.py correctly use this new kernel when Triton is available, with a fallback to the existing implementation. The Triton kernel itself appears to be well-written, handling both NEOX and non-NEOX styles, and correctly deals with padding and masking for variable dimensions. The upcasting to float32 for intermediate computations is a good practice for maintaining precision. I have one comment regarding a docstring that could be improved for clarity. Overall, the changes are logical and well-structured.

@whx-sjtu whx-sjtu force-pushed the partial_rope_triton branch from a3225c4 to 2d30141 Compare November 27, 2025 05:11
@whx-sjtu
Copy link
Collaborator Author

whx-sjtu commented Nov 27, 2025

For prefill stage with larger bs, the performance gains remains substantial. Here is an example of bs=1001 with DS V3.2:

Without triton rope: 109us
image

With triton rope: 38us
image

@whx-sjtu whx-sjtu force-pushed the partial_rope_triton branch 3 times, most recently from b07e1a4 to 77c3431 Compare December 1, 2025 03:39
@whx-sjtu whx-sjtu added ready read for review ready-for-test start test by label for PR labels Dec 1, 2025
@whx-sjtu whx-sjtu force-pushed the partial_rope_triton branch 2 times, most recently from 816d029 to b0f7e4a Compare December 1, 2025 12:25
Signed-off-by: whx-sjtu <[email protected]>
@wangxiyuan wangxiyuan merged commit 96b2cdf into vllm-project:main Dec 2, 2025
22 checks passed
ChenCangtao pushed a commit to ChenCangtao/vllm-ascend that referenced this pull request Dec 3, 2025
…ect#4413)

### What this PR does / why we need it?
This PR adds a triton rope kernel witch supports scenarios of `rope_dim
!= head_dim`. This can save the split op before rope and the concat op
after rope. Profiling shows improvement.

### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
I will add related ut after ci integrated with triton.


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: whx-sjtu <[email protected]>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
…ect#4413)

### What this PR does / why we need it?
This PR adds a triton rope kernel witch supports scenarios of `rope_dim
!= head_dim`. This can save the split op before rope and the concat op
after rope. Profiling shows improvement.

### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
I will add related ut after ci integrated with triton.

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: whx-sjtu <[email protected]>
Signed-off-by: Che Ruan <[email protected]>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
…ect#4413)

### What this PR does / why we need it?
This PR adds a triton rope kernel witch supports scenarios of `rope_dim
!= head_dim`. This can save the split op before rope and the concat op
after rope. Profiling shows improvement.

### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
I will add related ut after ci integrated with triton.

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: whx-sjtu <[email protected]>
Signed-off-by: Che Ruan <[email protected]>
Meihan-chen pushed a commit to Meihan-chen/vllm-ascend that referenced this pull request Dec 5, 2025
…ect#4413)

### What this PR does / why we need it?
This PR adds a triton rope kernel witch supports scenarios of `rope_dim
!= head_dim`. This can save the split op before rope and the concat op
after rope. Profiling shows improvement.

### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
I will add related ut after ci integrated with triton.


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: whx-sjtu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ops module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants