[Bugfix] Flashinfer block size for hybrid ssm models #27843

heheda12345 · 2025-10-30T22:01:51Z

Purpose

Hybrid SSM may use a block size that is different with kv_cache_spec.block_size when kv_cache_spec.block_size doesn't supported by an attention backend. However, we don't know the used block_size now. As a temporary solution, hardcode flashinfer's block_size to one value to make sure model runner always uses this block_size.

Test Plan

VLLM_ATTENTION_BACKEND=FLASHINFER CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4 --enable-expert-parallel --no-enable-prefix-caching --async-scheduling

 lm_eval --model local-completions --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct,base_url=http://localhost:8000/v1/completions -t gsm8k --num_fewshot 5 --batch_size 250

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8522|±  |0.0098|
|     |       |strict-match    |     5|exact_match|↑  |0.8089|±  |0.0108|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Chen Zhang <[email protected]>

mergify · 2025-10-30T22:02:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces a temporary fix for hybrid SSM models using the FlashInfer backend by hardcoding the block size to 32. This ensures that the model runner consistently uses the correct block size, addressing an issue where a different, unsupported block size might be used. The changes are consistent across config.py and flashinfer.py, correctly enforcing the block size of 32. My main feedback is a minor but important correction to a comment in flashinfer.py to avoid future confusion.

gemini-code-assist · 2025-10-30T22:03:40Z

vllm/v1/attention/backends/flashinfer.py

        self._workspace_buffer = None
        self._prefill_wrapper = None  # Wrapper for prefill/append
        self._decode_wrapper = None  # Wrapper for decode (general shape)
+        block_size = 32  # Note(Chen): Hardcode the block_size as 16 temporarily.


The comment here states that the block size is hardcoded to 16, but the value is set to 32. This seems to be a typo in the comment. To avoid confusion, the comment should be updated to reflect the actual value.

Suggested change

block_size = 32 # Note(Chen): Hardcode the block_size as 16 temporarily.

block_size = 32 # Note(Chen): Hardcode the block_size as 32 temporarily.

…er_block_size Signed-off-by: Chen Zhang <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-10-30T22:06:46Z

vllm/v1/attention/backends/flashinfer.py

    @staticmethod
    def get_supported_kernel_block_size() -> list[int | MultipleOf]:
-        # Note: Not sure for all platforms,
-        # but on Blackwell, only support a page size of
-        # 16, 32, 64
-        return [16, 32, 64]
+        # Note(Chen): FlashInfer backend supports other block_sizes. But as
+        # the backend doesn't know the block_size selected, we hardcode it as only
+        # supports 32 for now.
+        return [32]


Restricting FlashInfer to 32-token blocks breaks default configs

The backend now reports only 32 as a supported block size, but CUDA platforms still initialize cache_config.block_size to 16 by default. When a user runs any non-hybrid model with VLLM_ATTENTION_BACKEND=FLASHINFER, _find_compatible_block_sizes in the GPU model runner queries the backend and fails because 16 is not divisible by 32, raising `ValueError("No compatible block size for 16") before the model starts. This regression removes support for the common 16-token block size that previously worked. Either the backend needs to continue advertising 16 (and 64) or the default cache block size must be bumped to 32 when FlashInfer is selected.

Useful? React with 👍 / 👎.

Signed-off-by: Chen Zhang <[email protected]>

vadiklyutiy · 2025-10-30T22:36:47Z

What GPU the test run from?

vadiklyutiy · 2025-10-30T22:54:21Z

The changes in vllm/model_executor/models/config.py correspond to #27704, except that this version aligns by 32, whereas #27704 aligns to the next power of two. I don’t have a strong preference either way.

vadiklyutiy · 2025-10-30T23:07:50Z

vllm/v1/attention/backends/flashinfer.py

        self._workspace_buffer = None
        self._prefill_wrapper = None  # Wrapper for prefill/append
        self._decode_wrapper = None  # Wrapper for decode (general shape)
+        block_size = 32  # Note(Chen): Hardcode the block_size as 32 temporarily.


I believe #27547 proposes a more universal approach: it reads the actual supported sizes instead of hard-coding them here. Otherwise, if someone changes FlashInferBackend.get_supported_kernel_block_size, they would have no way to know this file needs to be updated as well.

vadiklyutiy · 2025-10-30T23:08:19Z

vllm/v1/attention/backends/flashinfer.py

+        # Note(Chen): FlashInfer backend supports other block_sizes. But as
+        # the backend doesn't know the block_size selected, we hardcode it as only
+        # supports 32 for now.
+        return [32]


Do you have any evidence that something wrong with 64?

No problem to 64. But we need to only allow one block_size here. Happy to change it to 64 if it is better.

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 · 2025-10-30T23:38:57Z

It's H100.

heheda12345 · 2025-10-31T00:35:10Z

let's iterate on #27547

heheda12345 added 2 commits October 30, 2025 14:31

fix flashinfer block size

5b3d3b0

Signed-off-by: Chen Zhang <[email protected]>

hardcode to 32

0cb1549

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 requested review from mgoin and pavanimajety as code owners October 30, 2025 22:01

mergify bot added the v1 label Oct 30, 2025

mergify bot added the needs-rebase label Oct 30, 2025

gemini-code-assist bot reviewed Oct 30, 2025

View reviewed changes

heheda12345 requested a review from tdoublep October 30, 2025 22:03

Merge branch 'main' of github.com:vllm-project/vllm into fix_flashinf…

d232b81

…er_block_size Signed-off-by: Chen Zhang <[email protected]>

chatgpt-codex-connector bot reviewed Oct 30, 2025

View reviewed changes

fix typo

26568aa

Signed-off-by: Chen Zhang <[email protected]>

mergify bot removed the needs-rebase label Oct 30, 2025

heheda12345 mentioned this pull request Oct 30, 2025

[BUGFIX] Adjust kv block sizes #27704

Open

vadiklyutiy reviewed Oct 30, 2025

View reviewed changes

less hardcode

35e8fc3

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 30, 2025

heheda12345 marked this pull request as draft October 31, 2025 00:34

	block_size = 32 # Note(Chen): Hardcode the block_size as 16 temporarily.
	block_size = 32 # Note(Chen): Hardcode the block_size as 32 temporarily.

Uh oh!

[Bugfix] Flashinfer block size for hybrid ssm models #27843

Are you sure you want to change the base?

[Bugfix] Flashinfer block size for hybrid ssm models #27843

Conversation

heheda12345 commented Oct 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Oct 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

vadiklyutiy commented Oct 30, 2025

Uh oh!

vadiklyutiy commented Oct 30, 2025

Uh oh!

vadiklyutiy Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

vadiklyutiy Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Oct 30, 2025

Uh oh!

heheda12345 commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

heheda12345 commented Oct 30, 2025 •

edited by github-actions bot

Loading