-
-
Notifications
You must be signed in to change notification settings - Fork 12k
Open
Labels
performancePerformance-related issuesPerformance-related issues
Description
Proposal to improve performance
This issue tracks the ungoing/pending performance optimizations for GPT-OSS B200/GB200.
Max-Throughput (e.g. TP1 conc1024)
- Enable FlashInfer autotuning. Done in [Kernel] Add nvfp4 gemm flashinfer backends #22346
- Enable FlashInfer FP8-QKV attention with sink. Done in [Flashinfer][gpt-oss] Support FP8-qkv Flashinfer TRTLLM Sinks Attention #25674
- Support stream_interval to reduce host overhead at high concurrency. Done in [Perf] Support stream interval for reducing host overhead #27869
Min-Latency (e.g. TP8 conc8)
- Avoid additional Slice before AR+Norm fused kernel. Done in [Bugfix] Defunctionalize TRTLLM AR+Norm op for avoiding extra clone kernel before it #29631
- Fuse Pad with MXFP8-Quantize (~1% perf gain, assigned to @elvischenv )
- Fuse MoE Finalize with Slice (~1% perf gain, assigned to @elvischenv )
- FlashInfer MXFP4 MoE supports fusion with Slice. See feat: Support unpadded output hidden size for trtllm_fp4_block_scale_moe flashinfer-ai/flashinfer#2217
- See: [Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE #30647
- RoPE+Q+CacheUpdate fusion (~2% perf gain, will be tracked in [Performance]: ROPE + KV-Cache-Write + pre-attn prepare-ops fusion #24678 )
- We can use FlashInfer rope_quantize_fp8_append_paged_kv_cache().
- Use special BF16 gemm for router gemm (1-2% perf gain, not assigned)
- The current BF16 gemm breaks the PDL. And in theory, we do not need to use too many cuda blocks for this gemm, so this could in theory run in parallel with MXFP8-Quantization.
- Use special BF16 gemm for fc_qkv/fc_o_proj (1-2% perf gain, not assigned)
- The current BF16 gemms break PDL and is not efficient for small concurrency.
Mid-Concurrency (e.g. DEP2 conc128)
- Suboptimal trtllm MoE tactic selection (>20% perf, assigned to @nvjullin )
Spec Decode
- Support nvidia/gpt-oss-120b-Eagle3-short-context EAGLE model. Done in [Spec Decode] Add support for EAGLE3 heads that do not use_aux_hidden_states #27688
Recipe update
- Update the cookbook with the recommended commands. See: https://github.com/vllm-project/recipes/blob/main/OpenAI/GPT-OSS.md
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
The output of `python collect_env.py`
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
performancePerformance-related issuesPerformance-related issues