[Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE #30647

elvischenv · 2025-12-14T13:40:05Z

Purpose

Eliminated padding op before the MoE: by setting the alignment in flashinfer mxfp8 quant, the output quantized tensor will be padded.
Eliminated slicing op after the MoE: by passing the output tensor with unpadded hidden size to MoE kernel, this depends on a Flashinfer PR:
- feat: Support unpadded output hidden size for trtllm_fp4_block_scale_moe flashinfer-ai/flashinfer#2217
- This will also resolve the previous AR+Norm fusion broken by slice op issue: [Bugfix] Fix GPT-OSS AR+NORM fusion #28841
Cleaned up the padding logic: for mxfp4 quant, the padded hidden size is calculated in create_weights(), the maybe_roundup_hidden_size() in vllm/model_executor/layers/fused_moe/layer.py seems like a dup.

Test Plan && Test Result(GPT-OSS-120b TP8)

Accuracy

PR:

[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-high_temp1.0_20251213_233136', 'metric': 0.7803030303030303}]
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20251213_234320', 'metric': 0.8875}]

main:

[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-high_temp1.0_20251214_002509', 'metric': 0.7891414141414141}]
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20251214_001505', 'metric': 0.8875}]

Kernel

PR:

void cublasLt::splitKreduce_kernel                          2.400 μs
void tensorrt_llm::kernels::quantize_with_block_size        2.944 μs
void moe::dev::routing::routingRenormalize                  5.216 μs
bmm_MxE4m3_MxE2m1MxE4m3_Fp32_t128x16x256u2                 13.216 μs
bmm_Bfloat16_MxE2m1MxE4m3_Fp32_t128x16x256u2                9.504 μs
void moe::dev::finalize::finalizeKernel                     3.168 μs
void flashinfer::trtllm_allreduce_fusion                    7.744 μs (ar+norm)
nvjet_tst_32x64_64x16_4x1_v_bz_splitK_TNN                   4.448 μs

main:

void cublasLt::splitKreduce_kernel                          2.048 μs
triton_poi_fused_constant_pad_nd_moe_forward_0              1.407 μs (pad)
void tensorrt_llm::kernels::quantize_with_block_size        2.432 μs
void moe::dev::routing::routingRenormalize                  5.056 μs
bmm_MxE4m3_MxE2m1MxE4m3_Fp32_t128x16x256u2                 10.368 μs
bmm_Bfloat16_MxE2m1MxE4m3_Fp32_t128x16x256u2                8.512 μs
void moe::dev::finalize::finalizeKernel                     2.112 μs
void vllm::cross_device_reduce_1stage                       8.320 μs (ar)
triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_slice_1    2.336 μs (norm, slice)
nvjet_tst_32x64_64x16_4x1_v_bz_splitK_TNN                   4.448 μs

Perf (GPT-OSS-120b TP8 con8)

PR: 6% E2E improvement

============ Serving Benchmark Result ============
Successful requests:                     40
Benchmark duration (s):                  15.20
Total input tokens:                      40960
Total generated tokens:                  40960
Request throughput (req/s):              2.63
Output token throughput (tok/s):         2695.06
Total Token throughput (tok/s):          5390.13
---------------Time to First Token----------------
Mean TTFT (ms):                          49.07
Median TTFT (ms):                        51.67
P99 TTFT (ms):                           62.84
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.92
Median TPOT (ms):                        2.92
P99 TPOT (ms):                           2.98
---------------Inter-token Latency----------------
Mean ITL (ms):                           57.47
Median ITL (ms):                         58.57
P99 ITL (ms):                            59.54
----------------End-to-end Latency----------------
Mean E2EL (ms):                          3037.51
Median E2EL (ms):                        3036.09
P99 E2EL (ms):                           3075.32
==================================================

main:

============ Serving Benchmark Result ============
Successful requests:                     40
Benchmark duration (s):                  16.13
Total input tokens:                      40960
Total generated tokens:                  40960
Request throughput (req/s):              2.48
Output token throughput (tok/s):         2539.61
Total Token throughput (tok/s):          5079.22
---------------Time to First Token----------------
Mean TTFT (ms):                          60.40
Median TTFT (ms):                        63.65
P99 TTFT (ms):                           95.39
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.09
Median TPOT (ms):                        3.10
P99 TPOT (ms):                           3.15
---------------Inter-token Latency----------------
Mean ITL (ms):                           60.84
Median ITL (ms):                         61.63
P99 ITL (ms):                            63.25
----------------End-to-end Latency----------------
Mean E2EL (ms):                          3224.12
Median E2EL (ms):                        3237.20
P99 E2EL (ms):                           3256.98
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: elvischenv <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a performance optimization for Mixture-of-Experts (MoE) layers in GPT-OSS models using Flashinfer with MXFP4/MXFP8 quantization. The key changes involve eliminating explicit padding and slicing operations around the MoE computation. This is achieved by leveraging new capabilities in the Flashinfer library to handle padding within the quantization kernel and to write to an unpadded output buffer directly.

The main changes are:

Elimination of Padding/Slicing: The FusedMoE layer no longer performs manual padding before the MoE kernel for supported backends. Instead, the padding is handled by flashinfer::mxfp8_quantize, and the subsequent slicing is effectively done by the MoE kernel writing to a smaller, pre-allocated output tensor. This change enables better fusion opportunities, as seen by the all-reduce + norm fusion now being possible.
Code Refactoring: The logic for rounding up hidden sizes for MXFP4 quantization has been moved from the generic fused_moe/layer.py to the specific quantization/mxfp4.py, which is a more appropriate location. This removes duplicated code and improves modularity.
Conditional Logic: The new behavior is controlled by a support_padded_mxfp8_quant flag, ensuring that it only applies to the SM100_FI_MXFP4_MXFP8_TRTLLM backend on Blackwell GPUs, maintaining compatibility with other configurations.
Testing: New test cases have been added to test_fusions_e2e.py to validate the fusions and performance improvements for GPT-OSS models on Blackwell.

The changes are well-implemented and align with the stated goals of improving performance. The code is clean and the new logic is properly encapsulated. The performance benchmarks in the PR description show a significant 6% end-to-end improvement, which is a great result.

I have reviewed the code and found no critical or high-severity issues. The changes are correct and contribute to better performance and code structure.

elvischenv added 6 commits December 14, 2025 05:11

add gpt-oss fusion e2e test

c400aaf

Signed-off-by: elvischenv <[email protected]>

eliminate padding for flashinfer mxfp8 moe

8cb1661

Signed-off-by: elvischenv <[email protected]>

move slice into moe op

482ca1b

Signed-off-by: elvischenv <[email protected]>

support unpadded output

b031ec7

Signed-off-by: elvischenv <[email protected]>

double check the support

91e577c

Signed-off-by: elvischenv <[email protected]>

update flashinfer api call

4fd26b1

Signed-off-by: elvischenv <[email protected]>

mergify bot added ci/build gpt-oss Related to GPT-OSS models labels Dec 14, 2025

github-project-automation bot added this to gpt-oss Issues & Enhancements Dec 14, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Dec 14, 2025

gemini-code-assist bot reviewed Dec 14, 2025

View reviewed changes

nvpohanh mentioned this pull request Dec 16, 2025

[Tracking Issue][Performance] GPT-OSS B200/GB200 performance optimization tracker #30758

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE #30647

[Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE #30647

elvischenv commented Dec 14, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE #30647

Are you sure you want to change the base?

[Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE #30647

Conversation

elvischenv commented Dec 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan && Test Result(GPT-OSS-120b TP8)

Accuracy

Kernel

Perf (GPT-OSS-120b TP8 con8)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

elvischenv commented Dec 14, 2025 •

edited by github-actions bot

Loading