[Tracking Issue][Performance] GPT-OSS B200/GB200 performance optimization tracker

### Proposal to improve performance

This issue tracks the ungoing/pending performance optimizations for GPT-OSS B200/GB200.

**Max-Throughput** (e.g. TP1 conc1024)

- [x] Enable FlashInfer autotuning. Done in https://github.com/vllm-project/vllm/pull/22346
- [x] Enable FlashInfer FP8-QKV attention with sink. Done in https://github.com/vllm-project/vllm/pull/25674
- [x] Support stream_interval to reduce host overhead at high concurrency. Done in https://github.com/vllm-project/vllm/pull/27869

**Min-Latency** (e.g. TP8 conc8)

- [x] Avoid additional Slice before AR+Norm fused kernel. Done in https://github.com/vllm-project/vllm/pull/29631
- [ ] Fuse Pad with MXFP8-Quantize (~1% perf gain, assigned to @elvischenv )
  - See: https://github.com/vllm-project/vllm/pull/30647
- [ ] Fuse MoE Finalize with Slice (~1% perf gain, assigned to @elvischenv )
  - FlashInfer MXFP4 MoE supports fusion with Slice. See https://github.com/flashinfer-ai/flashinfer/pull/2217
  - See: https://github.com/vllm-project/vllm/pull/30647
- [ ] RoPE+Q+CacheUpdate fusion (~2% perf gain, will be tracked in #24678 )
  - We can use FlashInfer [rope_quantize_fp8_append_paged_kv_cache()](https://github.com/flashinfer-ai/flashinfer/blob/0fa89cde790053a992a31b6d88db515f7ceaf044/flashinfer/rope.py#L1438C5-L1438C44).
- [ ] Use special BF16 gemm for router gemm (1-2% perf gain, not assigned)
  - The current BF16 gemm breaks the PDL. And in theory, we do not need to use too many cuda blocks for this gemm, so this could in theory run in parallel with MXFP8-Quantization.
- [ ] Use special BF16 gemm for fc_qkv/fc_o_proj (1-2% perf gain, not assigned)
  - The current BF16 gemms break PDL and is not efficient for small concurrency.
  
**Mid-Concurrency** (e.g. DEP2 conc128)

- [ ] Suboptimal trtllm MoE tactic selection (>20% perf, assigned to @nvjullin )

**Spec Decode**

- [x] Support [nvidia/gpt-oss-120b-Eagle3-short-context](https://huggingface.co/nvidia/gpt-oss-120b-Eagle3-short-context) EAGLE model. Done in https://github.com/vllm-project/vllm/pull/27688

**Recipe update**

- [x] Update the cookbook with the recommended commands. See: https://github.com/vllm-project/recipes/blob/main/OpenAI/GPT-OSS.md

### Report of performance regression

_No response_

### Misc discussion on performance

_No response_

### Your current environment (if you think it is necessary)

```text
The output of `python collect_env.py`
```


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Tracking Issue][Performance] GPT-OSS B200/GB200 performance optimization tracker #30758

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Tracking Issue][Performance] GPT-OSS B200/GB200 performance optimization tracker #30758

Description

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions