feat: implement high-performance Triton kernels for rejection sampling #4830

yuxingcyx · 2025-12-09T07:01:30Z

What this PR does / why we need it?

This PR introduces optimized Triton implementations for the rejection_greedy_sample_kernel and expand_kernel, delivering superior performance compared to the existing Triton implementations. The new Triton kernels maintain full functional accuracy while delivering significant performance improvements across various batch sizes and MTP configurations.

Does this PR introduce any user-facing change?

Yes, this PR modifies rejection_sampler.py to use optimized Triton kernels:

rejection_greedy_sample_kernel is enhanced with rejection_greedy_sample_spec_len_1_triton and rejection_greedy_sample_triton implementations
expand_kernel receives a performance-optimized Triton version

These changes provide substantial performance improvements while maintaining backward compatibility

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

github-actions · 2025-12-09T07:02:50Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces new Triton kernels for rejection sampling to improve performance, and the benchmarks provided show significant speedups. However, my review has identified several critical issues in the new Triton kernel implementations. The kernels currently use anti-patterns such as serializing work within a thread block by looping over elements and using data-dependent loop bounds. These patterns can lead to correctness issues and severely limit performance. Refactoring these kernels to use proper vectorization with masking is highly recommended to achieve optimal performance and ensure correctness. Additionally, there is duplicated code for calculating grid and block sizes and the use of magic numbers, which should be addressed to improve maintainability.

gemini-code-assist · 2025-12-09T07:03:03Z

vllm_ascend/sample/rejection_sampler.py

+    for pos in tl.arange(0, BLOCK_SIZE):
+        draft_token_id1 = tl.get_element(draft_token_id, (pos, ))
+        target_argmax1 = tl.get_element(target_argmax_id, (pos, ))
+        position = block_idx * BLOCK_SIZE + pos
+        if draft_token_id1 == target_argmax1:
+            bonus_renew_1(
+                bonus_token_ids_ptr,
+                position,
+                output_token_ids_ptr,
+            )


This loop iterates over BLOCK_SIZE and processes elements serially using tl.get_element. This is a Triton anti-pattern that prevents vectorization and causes thread divergence, which significantly degrades performance on a GPU. The logic can be fully vectorized using masks for much better efficiency. The bonus_renew_1 helper function would also become unnecessary.

accepted_mask = (draft_token_id == target_argmax_id) & mask # The bonus_renew_1 function can be inlined here for vectorization. # Load bonus tokens only for accepted lanes. bonus_token_id = tl.load(bonus_token_ids_ptr + offset, mask=accepted_mask) # Store bonus tokens only for accepted lanes. tl.store(output_token_ids_ptr + offset * 2 + 1, bonus_token_id, mask=accepted_mask)

vllm_ascend/sample/rejection_sampler.py

gemini-code-assist · 2025-12-09T07:03:03Z

vllm_ascend/sample/rejection_sampler.py

+    for i in tl.range(0, BLOCK_SIZE):
+        num_tokens1 = tl.get_element(num_tokens, (i, ))
+        start_idx1 = tl.get_element(start_idx, (i, ))
+        src_val1 = tl.get_element(src_val, (i, ))
+        offset1 = tl.arange(0, MAX_NUM_TOKENS)
+        tl.store(output_ptr + start_idx1 + offset1,
+                 src_val1,
+                 mask=offset1 < num_tokens1)


This loop for i in tl.range(0, BLOCK_SIZE) serializes the processing of elements within a thread block by using tl.get_element. This is a Triton anti-pattern that underutilizes the GPU's parallel processing capabilities. While vectorizing a scatter operation with variable-sized segments is non-trivial, the current implementation is highly inefficient. A more performant, vectorized approach should be used to avoid this serialization.

gemini-code-assist · 2025-12-09T07:03:03Z

vllm_ascend/sample/rejection_sampler.py

+            vec_len = batch_size
+            n = cu_num_draft_tokens.numel()
+            BLOCK_SIZE = 2
+            grid = triton.cdiv(n, BLOCK_SIZE)
+            if n >= 40:
+                grid = 40
+                BLOCK_SIZE = triton.next_power_of_2(n // grid)


This logic for calculating grid and BLOCK_SIZE contains a hardcoded 'magic number' 40 (line 158), which hurts readability and maintainability. It is also duplicated later in this file for expand_kernel (lines 294-300).

To improve the code, I suggest:

Defining 40 as a named constant with a comment explaining its origin (e.g., TRITON_GRID_SIZE = 40 # Empirically tuned value...).

Extracting this entire calculation into a helper function to avoid code duplication and ensure consistency.

This will make the code cleaner and easier to tune in the future.

yuxingcyx · 2025-12-09T11:27:30Z

How was this patch tested?

Performance Benchmarking (vs origin Triton implementations)

rejection_greedy_sample_spec_len_1_triton

Batch Size	MTP	Triton old (μs)	Triton new (μs)
2048	1	152.787	17.278
1024	1	77.089	10.786
512	1	39.209	8.302
256	1	20.07	6.847
128	1	10.918	5.974
64	1	6.343	4.373
32	1	3.614	2.991
8	1	1.913	1.845
1	1	2.776	2.101

rejection_greedy_sample_triton

Batch Size	MTP	Triton old (μs)	Triton new (μs)
2048	2	150.746	17.57
1024	2	76.854	11.102
512	2	38.973	7.66
256	2	20.092	6.06
128	2	10.614	5.16
64	2	6.294	5.418
32	2	3.659	3.007
8	2	1.933	2.03
1	2	2.774	2.203

expand_kernel

Batch Size	MTP	Triton old (μs)	Triton new (μs)
2048	1	153.107	18.099
1024	1	76.195	11.172
512	1	39.148	7.882
256	1	19.676	6.863
128	1	10.639	6.196
64	1	6.155	4.912
32	1	3.615	3.325
8	1	2.283	2.212
1	1	2.688	2.162
2048	2	152.795	17.016
1024	2	75.96	10.567
512	2	39.191	7.909
256	2	19.957	6.211
128	2	10.51	5.541
64	2	6.101	4.765
32	2	3.679	3.524
8	2	2.25	2.123
1	2	2.683	2.222

Accuracy Testing:

The new Triton implementations have passed comprehensive accuracy tests, ensuring full functional equivalence with the original Triton kernels while delivering superior performance.

whx-sjtu

LGTM

gao12312 · 2025-12-10T05:38:07Z

How to use tirton for test ?

(Worker_TP6_EP6 pid=432498) ERROR 12-10 10:47:59 [multiproc_executor.py:822] for pos in tl.arange(0, BLOCK_SIZE):
(Worker_TP7_EP7 pid=432957) ERROR 12-10 10:47:59 [multiproc_executor.py:822] RuntimeError('Only range and static_range iterators are currently supported')

github-actions · 2025-12-11T03:53:34Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: yuxingcyx <[email protected]>

JC-ut0 · 2025-12-12T01:23:55Z

vllm_ascend/sample/rejection_sampler.py

+            if not rejected:
+                bonus_renew(
+                    bonus_token_ids_ptr,
+                    position,
+                    output_token_ids_ptr,
+                    max_spec_len,
+                    num_tokens1,


Please add a comment to state why you encapsulate this method.

For certain inputs, each request does not go through the 'if not rejected' branch. After optimizing and collecting profiling data, it was found that this branch still accounted for a significant portion of MTE3 transfer time. Removing this branch does not affect accuracy and improves performance by 40%. Considering that this branch may still be executed for other inputs, these steps were written into a new operator, which is then called within the operator. Experiments show that performance improves to varying degrees for different inputs. After operator optimization, the setup is as follows: the part on the right represents the internal calls of the 'rejection_greedy_sample_kernel' operator to the left operator 'bonus_renew'.

gemini-code-assist bot reviewed Dec 9, 2025

View reviewed changes

yuxingcyx force-pushed the triton-cyx-new branch from fc08aa7 to ec703ca Compare December 9, 2025 09:07

whx-sjtu added ready read for review ready-for-test start test by label for PR labels Dec 9, 2025

whx-sjtu approved these changes Dec 9, 2025

View reviewed changes

yuxingcyx force-pushed the triton-cyx-new branch from 8795d22 to 242f084 Compare December 9, 2025 12:09

yuxingcyx force-pushed the triton-cyx-new branch 5 times, most recently from c64f377 to a60e4c8 Compare December 11, 2025 03:32

github-actions bot added merge-conflicts and removed merge-conflicts labels Dec 11, 2025

yuxingcyx force-pushed the triton-cyx-new branch from 5b05aef to 5afce44 Compare December 11, 2025 07:01

triton code vector core change with new rejection_sampler.py

353facf

Signed-off-by: yuxingcyx <[email protected]>

yuxingcyx force-pushed the triton-cyx-new branch from 5afce44 to 353facf Compare December 11, 2025 07:37

triton code vector core notation new rejection_sampler.py

66f61e1

Signed-off-by: yuxingcyx <[email protected]>

yuxingcyx force-pushed the triton-cyx-new branch from 97208b3 to 66f61e1 Compare December 11, 2025 11:33

yuxingcyx added 3 commits December 11, 2025 19:37

Merge branch 'vllm-project:main' into triton-cyx-new

e304bc4

Update rejection_sampler.py

6fae243

Signed-off-by: yuxingcyx <[email protected]>

Update rejection_sampler.py

2e49ca1

Signed-off-by: yuxingcyx <[email protected]>

JC-ut0 reviewed Dec 12, 2025

View reviewed changes

yiz-liu approved these changes Dec 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: implement high-performance Triton kernels for rejection sampling #4830

feat: implement high-performance Triton kernels for rejection sampling #4830

yuxingcyx commented Dec 9, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Dec 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 9, 2025

Uh oh!

Uh oh!

gemini-code-assist bot Dec 9, 2025

Uh oh!

gemini-code-assist bot Dec 9, 2025

Uh oh!

yuxingcyx commented Dec 9, 2025

Uh oh!

whx-sjtu left a comment

Uh oh!

gao12312 commented Dec 10, 2025

Uh oh!

github-actions bot commented Dec 11, 2025

Uh oh!

JC-ut0 Dec 12, 2025

Uh oh!

yuxingcyx Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

feat: implement high-performance Triton kernels for rejection sampling #4830

Are you sure you want to change the base?

feat: implement high-performance Triton kernels for rejection sampling #4830

Conversation

yuxingcyx commented Dec 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

Uh oh!

github-actions bot commented Dec 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

yuxingcyx commented Dec 9, 2025

How was this patch tested?

Performance Benchmarking (vs origin Triton implementations)

rejection_greedy_sample_spec_len_1_triton

rejection_greedy_sample_triton

expand_kernel

Accuracy Testing:

Uh oh!

whx-sjtu left a comment

Choose a reason for hiding this comment

Uh oh!

gao12312 commented Dec 10, 2025

Uh oh!

github-actions bot commented Dec 11, 2025

Uh oh!

JC-ut0 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

yuxingcyx Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yuxingcyx commented Dec 9, 2025 •

edited by github-actions bot

Loading