-
-
Notifications
You must be signed in to change notification settings - Fork 12k
[Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE #30647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE #30647
Conversation
Signed-off-by: elvischenv <[email protected]>
Signed-off-by: elvischenv <[email protected]>
Signed-off-by: elvischenv <[email protected]>
Signed-off-by: elvischenv <[email protected]>
Signed-off-by: elvischenv <[email protected]>
Signed-off-by: elvischenv <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a performance optimization for Mixture-of-Experts (MoE) layers in GPT-OSS models using Flashinfer with MXFP4/MXFP8 quantization. The key changes involve eliminating explicit padding and slicing operations around the MoE computation. This is achieved by leveraging new capabilities in the Flashinfer library to handle padding within the quantization kernel and to write to an unpadded output buffer directly.
The main changes are:
- Elimination of Padding/Slicing: The
FusedMoElayer no longer performs manual padding before the MoE kernel for supported backends. Instead, the padding is handled byflashinfer::mxfp8_quantize, and the subsequent slicing is effectively done by the MoE kernel writing to a smaller, pre-allocated output tensor. This change enables better fusion opportunities, as seen by theall-reduce + normfusion now being possible. - Code Refactoring: The logic for rounding up hidden sizes for MXFP4 quantization has been moved from the generic
fused_moe/layer.pyto the specificquantization/mxfp4.py, which is a more appropriate location. This removes duplicated code and improves modularity. - Conditional Logic: The new behavior is controlled by a
support_padded_mxfp8_quantflag, ensuring that it only applies to theSM100_FI_MXFP4_MXFP8_TRTLLMbackend on Blackwell GPUs, maintaining compatibility with other configurations. - Testing: New test cases have been added to
test_fusions_e2e.pyto validate the fusions and performance improvements for GPT-OSS models on Blackwell.
The changes are well-implemented and align with the stated goals of improving performance. The code is clean and the new logic is properly encapsulated. The performance benchmarks in the PR description show a significant 6% end-to-end improvement, which is a great result.
I have reviewed the code and found no critical or high-severity issues. The changes are correct and contribute to better performance and code structure.
Purpose
create_weights(), themaybe_roundup_hidden_size()invllm/model_executor/layers/fused_moe/layer.pyseems like a dup.Test Plan && Test Result(GPT-OSS-120b TP8)
Accuracy
PR:
main:
Kernel
PR:
main:
Perf (GPT-OSS-120b TP8 con8)
PR: 6% E2E improvement
main:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.