Skip to content

Conversation

@elvischenv
Copy link
Contributor

@elvischenv elvischenv commented Dec 14, 2025

Purpose

Test Plan && Test Result(GPT-OSS-120b TP8)

Accuracy

PR:

[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-high_temp1.0_20251213_233136', 'metric': 0.7803030303030303}]
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20251213_234320', 'metric': 0.8875}]

main:

[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-high_temp1.0_20251214_002509', 'metric': 0.7891414141414141}]
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20251214_001505', 'metric': 0.8875}]

Kernel

PR:

void cublasLt::splitKreduce_kernel                          2.400 μs
void tensorrt_llm::kernels::quantize_with_block_size        2.944 μs
void moe::dev::routing::routingRenormalize                  5.216 μs
bmm_MxE4m3_MxE2m1MxE4m3_Fp32_t128x16x256u2                 13.216 μs
bmm_Bfloat16_MxE2m1MxE4m3_Fp32_t128x16x256u2                9.504 μs
void moe::dev::finalize::finalizeKernel                     3.168 μs
void flashinfer::trtllm_allreduce_fusion                    7.744 μs (ar+norm)
nvjet_tst_32x64_64x16_4x1_v_bz_splitK_TNN                   4.448 μs

main:

void cublasLt::splitKreduce_kernel                          2.048 μs
triton_poi_fused_constant_pad_nd_moe_forward_0              1.407 μs (pad)
void tensorrt_llm::kernels::quantize_with_block_size        2.432 μs
void moe::dev::routing::routingRenormalize                  5.056 μs
bmm_MxE4m3_MxE2m1MxE4m3_Fp32_t128x16x256u2                 10.368 μs
bmm_Bfloat16_MxE2m1MxE4m3_Fp32_t128x16x256u2                8.512 μs
void moe::dev::finalize::finalizeKernel                     2.112 μs
void vllm::cross_device_reduce_1stage                       8.320 μs (ar)
triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_slice_1    2.336 μs (norm, slice)
nvjet_tst_32x64_64x16_4x1_v_bz_splitK_TNN                   4.448 μs

Perf (GPT-OSS-120b TP8 con8)

PR: 6% E2E improvement

============ Serving Benchmark Result ============
Successful requests:                     40
Benchmark duration (s):                  15.20
Total input tokens:                      40960
Total generated tokens:                  40960
Request throughput (req/s):              2.63
Output token throughput (tok/s):         2695.06
Total Token throughput (tok/s):          5390.13
---------------Time to First Token----------------
Mean TTFT (ms):                          49.07
Median TTFT (ms):                        51.67
P99 TTFT (ms):                           62.84
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.92
Median TPOT (ms):                        2.92
P99 TPOT (ms):                           2.98
---------------Inter-token Latency----------------
Mean ITL (ms):                           57.47
Median ITL (ms):                         58.57
P99 ITL (ms):                            59.54
----------------End-to-end Latency----------------
Mean E2EL (ms):                          3037.51
Median E2EL (ms):                        3036.09
P99 E2EL (ms):                           3075.32
==================================================

main:

============ Serving Benchmark Result ============
Successful requests:                     40
Benchmark duration (s):                  16.13
Total input tokens:                      40960
Total generated tokens:                  40960
Request throughput (req/s):              2.48
Output token throughput (tok/s):         2539.61
Total Token throughput (tok/s):          5079.22
---------------Time to First Token----------------
Mean TTFT (ms):                          60.40
Median TTFT (ms):                        63.65
P99 TTFT (ms):                           95.39
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.09
Median TPOT (ms):                        3.10
P99 TPOT (ms):                           3.15
---------------Inter-token Latency----------------
Mean ITL (ms):                           60.84
Median ITL (ms):                         61.63
P99 ITL (ms):                            63.25
----------------End-to-end Latency----------------
Mean E2EL (ms):                          3224.12
Median E2EL (ms):                        3237.20
P99 E2EL (ms):                           3256.98
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization for Mixture-of-Experts (MoE) layers in GPT-OSS models using Flashinfer with MXFP4/MXFP8 quantization. The key changes involve eliminating explicit padding and slicing operations around the MoE computation. This is achieved by leveraging new capabilities in the Flashinfer library to handle padding within the quantization kernel and to write to an unpadded output buffer directly.

The main changes are:

  1. Elimination of Padding/Slicing: The FusedMoE layer no longer performs manual padding before the MoE kernel for supported backends. Instead, the padding is handled by flashinfer::mxfp8_quantize, and the subsequent slicing is effectively done by the MoE kernel writing to a smaller, pre-allocated output tensor. This change enables better fusion opportunities, as seen by the all-reduce + norm fusion now being possible.
  2. Code Refactoring: The logic for rounding up hidden sizes for MXFP4 quantization has been moved from the generic fused_moe/layer.py to the specific quantization/mxfp4.py, which is a more appropriate location. This removes duplicated code and improves modularity.
  3. Conditional Logic: The new behavior is controlled by a support_padded_mxfp8_quant flag, ensuring that it only applies to the SM100_FI_MXFP4_MXFP8_TRTLLM backend on Blackwell GPUs, maintaining compatibility with other configurations.
  4. Testing: New test cases have been added to test_fusions_e2e.py to validate the fusions and performance improvements for GPT-OSS models on Blackwell.

The changes are well-implemented and align with the stated goals of improving performance. The code is clean and the new logic is properly encapsulated. The performance benchmarks in the PR description show a significant 6% end-to-end improvement, which is a great result.

I have reviewed the code and found no critical or high-severity issues. The changes are correct and contribute to better performance and code structure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build gpt-oss Related to GPT-OSS models

Projects

Status: To Triage

Development

Successfully merging this pull request may close these issues.

1 participant