-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[BugFix][Performance] Restore flashinfer autotuning for all scenarios #27904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix][Performance] Restore flashinfer autotuning for all scenarios #27904
Conversation
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively resolves a crash that occurred when running MoE models in eager mode by ensuring tune_max_num_tokens is at least 1. The fix is correctly applied across multiple flashinfer kernel invocation sites in trtllm_moe.py and mxfp4.py. Additionally, the removal of the now-redundant flashinfer_autotune_supported function and its associated logic simplifies the codebase and re-enables autotuning for all scenarios, which is a great improvement. The test suite has been updated appropriately to validate the fix. The changes are well-targeted and correct.
| "do_finalize": True, | ||
| "output": output, | ||
| "tune_max_num_tokens": self.max_capture_size, | ||
| "tune_max_num_tokens": max(self.max_capture_size, 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why were we setting this to self.max_capture_size ? Shouldn't we set this to max_num_batched_tokens atleast ?
Just curious cc @pavanimajety @nvpohanh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh I see, very interesting. Yes I have the same question of why not use the max batch size, since we will want to autotune not only for cudagraphs for the prefill as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nvjullin Could you review this PR and comment on this? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It comes from PR23608. After a quick look in flashinfer, I believe this parameter is needed because autotuning on a dummy input won't result in the maximum number of tokens at each EP rank. I agree max_num_batched_tokens makes more sense.
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
| # Enable autotune when, | ||
| # https://github.com/flashinfer-ai/flashinfer/issues/2023 is | ||
| # resolved. | ||
| trtllm_fp4_block_scale_routed_moe(**kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the work!
| from vllm.utils.flashinfer import autotune | ||
|
|
||
| with autotune(False): | ||
| # Enable autotune when, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Enable autotune when, | |
| # TODO: Enable autotune when, |
Purpose
Bug:
on
main+ B200 :vllm serve openai/gpt-oss-20b --enforce-eagerfails.on
main+ H100 :VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 vllm serve openai/gpt-oss-20b --enforce-eagerfails.Both failures are asserts in the flashinfer code base,
Note that this is the same error reported in #27751
Fix:
Our calls to the flashinfer MoE kernels, set
tune_max_num_tokensto the CUDAGraph capture size. When CUDAGraph was disabled,max_capture_sizeis set 0 and the autotuner asserts. This PR setstune_max_num_tokensto 1 when CUDAGraphs are disabled (i.e. eager-mode)Note:
Initially, this issue was thought to manifest in specific scenarios and we resorted to skipping autotuning for those cases in PRs, #27762 and #26729 . This PR reverts the skip logic introduced in those PRs.
Fixes #27751
Test Plan
manually run
vllm serve openai/gpt-oss-20b --enforce-eageron B200.CI
Test Result
Tests Pass