Skip to content

Conversation

@varun-sundar-rabindranath
Copy link
Contributor

@varun-sundar-rabindranath varun-sundar-rabindranath commented Nov 1, 2025

Purpose

Bug:
on main + B200 : vllm serve openai/gpt-oss-20b --enforce-eager fails.
on main + H100 : VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 vllm serve openai/gpt-oss-20b --enforce-eager fails.

Both failures are asserts in the flashinfer code base,

(EngineCore_DP0 pid=3490083)   File "/home/varun/code/vllm/vllm/model_executor/layers/quantization/mxfp4.py", line 1109, in apply
(EngineCore_DP0 pid=3490083)     _ = flashinfer_cutlass_fused_moe(
(EngineCore_DP0 pid=3490083)   File "/home/varun/code/vllm/vllm/utils/flashinfer.py", line 84, in wrapper
(EngineCore_DP0 pid=3490083)     return impl(*args, **kwargs)
(EngineCore_DP0 pid=3490083)   File "/home/varun/code/vllm/vllm-test/lib/python3.10/site-packages/flashinfer/fused_moe/core.py", line 790, in cutlass_fused_moe
(EngineCore_DP0 pid=3490083)     return get_cutlass_fused_moe_module(device_arch).cutlass_fused_moe(
(EngineCore_DP0 pid=3490083)   File "/home/varun/code/vllm/vllm-test/lib/python3.10/site-packages/flashinfer/fused_moe/core.py", line 460, in cutlass_fused_moe
(EngineCore_DP0 pid=3490083)     _, gemm_tactic_1 = tuner.choose_one(
(EngineCore_DP0 pid=3490083)   File "/home/varun/code/vllm/vllm-test/lib/python3.10/site-packages/flashinfer/autotuner.py", line 457, in choose_one
(EngineCore_DP0 pid=3490083)     profiles = self._generate_optimization_profiles(tuning_config, inputs)
(EngineCore_DP0 pid=3490083)   File "/home/varun/code/vllm/vllm-test/lib/python3.10/site-packages/flashinfer/autotuner.py", line 643, in _generate_optimization_profiles
(EngineCore_DP0 pid=3490083)     assert len(opt_shapes) > 0, "Empty tuning buckets are not allowed"
(EngineCore_DP0 pid=3490083) AssertionError: Empty tuning buckets are not allowed

Note that this is the same error reported in #27751

Fix:
Our calls to the flashinfer MoE kernels, set tune_max_num_tokens to the CUDAGraph capture size. When CUDAGraph was disabled, max_capture_size is set 0 and the autotuner asserts. This PR sets tune_max_num_tokens to 1 when CUDAGraphs are disabled (i.e. eager-mode)

Note:
Initially, this issue was thought to manifest in specific scenarios and we resorted to skipping autotuning for those cases in PRs, #27762 and #26729 . This PR reverts the skip logic introduced in those PRs.

Fixes #27751

Test Plan

manually run vllm serve openai/gpt-oss-20b --enforce-eager on B200.
CI

Test Result

Tests Pass

Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a crash that occurred when running MoE models in eager mode by ensuring tune_max_num_tokens is at least 1. The fix is correctly applied across multiple flashinfer kernel invocation sites in trtllm_moe.py and mxfp4.py. Additionally, the removal of the now-redundant flashinfer_autotune_supported function and its associated logic simplifies the codebase and re-enables autotuning for all scenarios, which is a great improvement. The test suite has been updated appropriately to validate the fix. The changes are well-targeted and correct.

@varun-sundar-rabindranath
Copy link
Contributor Author

cc @zyongye @nvpohanh @mgoin PTAL. Sorry about the confusion with intermediate fixes.

"do_finalize": True,
"output": output,
"tune_max_num_tokens": self.max_capture_size,
"tune_max_num_tokens": max(self.max_capture_size, 1),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were we setting this to self.max_capture_size ? Shouldn't we set this to max_num_batched_tokens atleast ?
Just curious cc @pavanimajety @nvpohanh

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh I see, very interesting. Yes I have the same question of why not use the max batch size, since we will want to autotune not only for cudagraphs for the prefill as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvjullin Could you review this PR and comment on this? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It comes from PR23608. After a quick look in flashinfer, I believe this parameter is needed because autotuning on a dummy input won't result in the maximum number of tokens at each EP rank. I agree max_num_batched_tokens makes more sense.

@mgoin mgoin added ready ONLY add when PR is ready to merge/full CI is needed nvidia labels Nov 1, 2025
Varun Sundar Rabindranath added 2 commits November 1, 2025 12:22
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
# Enable autotune when,
# https://github.com/flashinfer-ai/flashinfer/issues/2023 is
# resolved.
trtllm_fp4_block_scale_routed_moe(**kwargs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @nvpohanh
cc @mgoin changes since you last reviewed.

Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work!

from vllm.utils.flashinfer import autotune

with autotune(False):
# Enable autotune when,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Enable autotune when,
# TODO: Enable autotune when,

@mgoin mgoin merged commit 4022a9d into vllm-project:main Nov 4, 2025
55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Issue with Flashinfer Autotune + DP or TP + Eager-Mode

5 participants