[ModelRunner][Refactor] Refactor kv cache tensor initialization logic #3106

MengqingCao · 2025-09-22T13:22:31Z

What this PR does / why we need it?

Refactor kv cache tensor initialization logic.

Unify the kvcache tensor initialization logic of deepseek and normal models
spilt initialize_kv_cache_tensors into _allocate_kv_cache_tensors and _reshape_kv_cache_tensors, following gpu modelrunner in vllm

After this pr, all the kvcache-related info is listed as following:

model	KVCacheSpec	page_size_bytes	caches	num_blocks
DeepSeek MLA	`MLAAttentionSpec`	`self.block_size * self.num_kv_heads * self.head_size * et_dtype_size(self.dtype)`	`k_cache + v_cache`	`(k_cache_size + v_cache_size) // kv_cache_spec.page_size_bytes`
DeepSeek MLA + SFA	`FullAttentionSpec`	`2 * self.block_size * self.num_kv_heads * self.head_size * et_dtype_size(self.dtype)`	`k_cache + v_cache + dsa_k_cache` `dsa_k_cache_size = k_cache_size + v_cache`	`(k_cache_size + v_cache_size + dsa_cache_size) // kv_cache_spec.page_size_bytes`
Qwen3-Next Linear attn	`MambaSpec`	`sum(prod(shape) * get_dtype_size(dtype)for (shape, dtype) in zip(self.shapes, self.dtypes))`	`mamba_cache`	`mamba_cache_size // kv_cache_spec.page_size_bytes`
Others	`FullAttentionSpec`	`2 * self.block_size * self.num_kv_heads * self.head_size * et_dtype_size(self.dtype)`	`k_cache + v_cache`	`(k_cache_size + v_cache_size) // kv_cache_spec.page_size_bytes`

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

CI passed with existing test.

prefill disaggregation scenario
deepseek + aclgraph/eager mode
qwen3 next

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@83f478b

github-actions · 2025-09-22T13:25:32Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2025-09-24T03:34:45Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-09-24T03:34:45Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-10-20T01:41:51Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

…ype (#3760) ### What this PR does / why we need it? Part of #3106 Fix Hybrid kvcache sharing bug in same attention type Change the `shared_by` logic so that the same attention spec could share the same buffer instead of allocating more hbm. After this pr, kvcache memory saved 50% in qwen3-next compared with before (`self_attn:linear_attn=1:3` in an `attn_group`), and `gpu_memory_utilization` could increase to `0.8` on Qwen3-Next when running on A2 64G/card with tp4 <img width="2833" height="1540" alt="image" src="https://github.com/user-attachments/assets/2a91fa99-fb0f-447c-9e8b-acd587890fbe" /> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Test pass with the latest e2e test case on qwen3-next - vLLM version: v0.11.0rc3 - vLLM main: vllm-project/vllm@c9461e0 --------- Signed-off-by: MengqingCao <[email protected]>

github-actions · 2025-10-29T06:21:03Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

* Unify the kvcache tensor initialization logic of deepseek and normal models * spilt initialize_kv_cache_tensors into _allocate_kv_cache_tensors and _reshape_kv_cache_tensors, following gpu modelrunner in vllm * Fix the shared_by logic so that the same attention spec could share the same buffer instead of allocating more hbm. Signed-off-by: MengqingCao <[email protected]>

Signed-off-by: MengqingCao <[email protected]>

…vllm-project#3106) ### What this PR does / why we need it? Refactor kv cache tensor initialization logic. 1. Unify the kvcache tensor initialization logic of deepseek and normal models 2. spilt `initialize_kv_cache_tensors` into `_allocate_kv_cache_tensors` and `_reshape_kv_cache_tensors`, following gpu modelrunner in vllm ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. 1. prefill disaggregation scenario 4. deepseek + aclgraph/eager mode 5. qwen3 next - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: MengqingCao <[email protected]> Signed-off-by: Pz1116 <[email protected]>

…ype (vllm-project#3760) ### What this PR does / why we need it? Part of vllm-project#3106 Fix Hybrid kvcache sharing bug in same attention type Change the `shared_by` logic so that the same attention spec could share the same buffer instead of allocating more hbm. After this pr, kvcache memory saved 50% in qwen3-next compared with before (`self_attn:linear_attn=1:3` in an `attn_group`), and `gpu_memory_utilization` could increase to `0.8` on Qwen3-Next when running on A2 64G/card with tp4 <img width="2833" height="1540" alt="image" src="https://github.com/user-attachments/assets/2a91fa99-fb0f-447c-9e8b-acd587890fbe" /> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Test pass with the latest e2e test case on qwen3-next - vLLM version: v0.11.0rc3 - vLLM main: vllm-project/vllm@c9461e0 --------- Signed-off-by: MengqingCao <[email protected]> Signed-off-by: luolun <[email protected]>

…vllm-project#3106) ### What this PR does / why we need it? Refactor kv cache tensor initialization logic. 1. Unify the kvcache tensor initialization logic of deepseek and normal models 2. spilt `initialize_kv_cache_tensors` into `_allocate_kv_cache_tensors` and `_reshape_kv_cache_tensors`, following gpu modelrunner in vllm ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. 1. prefill disaggregation scenario 4. deepseek + aclgraph/eager mode 5. qwen3 next - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: MengqingCao <[email protected]> Signed-off-by: luolun <[email protected]>

…ype (vllm-project#3760) ### What this PR does / why we need it? Part of vllm-project#3106 Fix Hybrid kvcache sharing bug in same attention type Change the `shared_by` logic so that the same attention spec could share the same buffer instead of allocating more hbm. After this pr, kvcache memory saved 50% in qwen3-next compared with before (`self_attn:linear_attn=1:3` in an `attn_group`), and `gpu_memory_utilization` could increase to `0.8` on Qwen3-Next when running on A2 64G/card with tp4 <img width="2833" height="1540" alt="image" src="https://github.com/user-attachments/assets/2a91fa99-fb0f-447c-9e8b-acd587890fbe" /> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Test pass with the latest e2e test case on qwen3-next - vLLM version: v0.11.0rc3 - vLLM main: vllm-project/vllm@c9461e0 --------- Signed-off-by: MengqingCao <[email protected]> Signed-off-by: hwhaokun <[email protected]>

…vllm-project#3106) ### What this PR does / why we need it? Refactor kv cache tensor initialization logic. 1. Unify the kvcache tensor initialization logic of deepseek and normal models 2. spilt `initialize_kv_cache_tensors` into `_allocate_kv_cache_tensors` and `_reshape_kv_cache_tensors`, following gpu modelrunner in vllm ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. 1. prefill disaggregation scenario 4. deepseek + aclgraph/eager mode 5. qwen3 next - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: MengqingCao <[email protected]> Signed-off-by: hwhaokun <[email protected]>

…ype (vllm-project#3760) ### What this PR does / why we need it? Part of vllm-project#3106 Fix Hybrid kvcache sharing bug in same attention type Change the `shared_by` logic so that the same attention spec could share the same buffer instead of allocating more hbm. After this pr, kvcache memory saved 50% in qwen3-next compared with before (`self_attn:linear_attn=1:3` in an `attn_group`), and `gpu_memory_utilization` could increase to `0.8` on Qwen3-Next when running on A2 64G/card with tp4 <img width="2833" height="1540" alt="image" src="https://github.com/user-attachments/assets/2a91fa99-fb0f-447c-9e8b-acd587890fbe" /> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Test pass with the latest e2e test case on qwen3-next - vLLM version: v0.11.0rc3 - vLLM main: vllm-project/vllm@c9461e0 --------- Signed-off-by: MengqingCao <[email protected]> Signed-off-by: nsdie <[email protected]>

…vllm-project#3106) ### What this PR does / why we need it? Refactor kv cache tensor initialization logic. 1. Unify the kvcache tensor initialization logic of deepseek and normal models 2. spilt `initialize_kv_cache_tensors` into `_allocate_kv_cache_tensors` and `_reshape_kv_cache_tensors`, following gpu modelrunner in vllm ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. 1. prefill disaggregation scenario 4. deepseek + aclgraph/eager mode 5. qwen3 next - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: MengqingCao <[email protected]> Signed-off-by: nsdie <[email protected]>

…ype (vllm-project#3760) ### What this PR does / why we need it? Part of vllm-project#3106 Fix Hybrid kvcache sharing bug in same attention type Change the `shared_by` logic so that the same attention spec could share the same buffer instead of allocating more hbm. After this pr, kvcache memory saved 50% in qwen3-next compared with before (`self_attn:linear_attn=1:3` in an `attn_group`), and `gpu_memory_utilization` could increase to `0.8` on Qwen3-Next when running on A2 64G/card with tp4 <img width="2833" height="1540" alt="image" src="https://github.com/user-attachments/assets/2a91fa99-fb0f-447c-9e8b-acd587890fbe" /> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Test pass with the latest e2e test case on qwen3-next - vLLM version: v0.11.0rc3 - vLLM main: vllm-project/vllm@c9461e0 --------- Signed-off-by: MengqingCao <[email protected]>

…vllm-project#3106) ### What this PR does / why we need it? Refactor kv cache tensor initialization logic. 1. Unify the kvcache tensor initialization logic of deepseek and normal models 2. spilt `initialize_kv_cache_tensors` into `_allocate_kv_cache_tensors` and `_reshape_kv_cache_tensors`, following gpu modelrunner in vllm ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. 1. prefill disaggregation scenario 4. deepseek + aclgraph/eager mode 5. qwen3 next - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: MengqingCao <[email protected]>

MengqingCao marked this pull request as ready for review September 23, 2025 07:26

github-actions bot added merge-conflicts labels Sep 24, 2025

MengqingCao force-pushed the refactor branch from 9fc0c32 to 81bb71b Compare October 18, 2025 01:24

github-actions bot added merge-conflicts and removed merge-conflicts labels Oct 18, 2025

MengqingCao mentioned this pull request Oct 21, 2025

[Bugfix] Fix duplicated KV cache allocation in qwen3-Next #3404

Closed

MengqingCao force-pushed the refactor branch from 012f221 to 8beb829 Compare October 24, 2025 09:45

github-actions bot removed the merge-conflicts label Oct 24, 2025

MengqingCao force-pushed the refactor branch from 8beb829 to a5d4f88 Compare October 25, 2025 05:27

MengqingCao mentioned this pull request Oct 25, 2025

[HybridKV][Bugfix] Fix Hybrid kvcache sharing bug in same attention type #3760

Merged

MengqingCao force-pushed the refactor branch from a5d4f88 to a34fc72 Compare October 28, 2025 03:15

github-actions bot added the merge-conflicts label Oct 29, 2025

MengqingCao added 4 commits October 30, 2025 14:23

fix cache reshape

deac452

Signed-off-by: MengqingCao <[email protected]>

fix dsa

2b49ca8

Signed-off-by: MengqingCao <[email protected]>

fix dsa as num_block is correct

8c58f9d

Signed-off-by: MengqingCao <[email protected]>

MengqingCao force-pushed the refactor branch from ab12dee to 8c58f9d Compare October 30, 2025 15:20

github-actions bot removed the merge-conflicts label Oct 30, 2025

MengqingCao added ready read for review ready-for-test start test by label for PR labels Oct 31, 2025

rename dsa_k_cache

9661f91

Signed-off-by: MengqingCao <[email protected]>

wangxiyuan approved these changes Nov 4, 2025

View reviewed changes

wangxiyuan merged commit 5fed166 into vllm-project:main Nov 4, 2025
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ModelRunner][Refactor] Refactor kv cache tensor initialization logic #3106

[ModelRunner][Refactor] Refactor kv cache tensor initialization logic #3106

Uh oh!

MengqingCao commented Sep 22, 2025 •

edited by wangxiyuan

Loading

Uh oh!

github-actions bot commented Sep 22, 2025

Uh oh!

github-actions bot commented Sep 24, 2025

Uh oh!

github-actions bot commented Sep 24, 2025

Uh oh!

github-actions bot commented Oct 20, 2025

Uh oh!

github-actions bot commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[ModelRunner][Refactor] Refactor kv cache tensor initialization logic #3106

[ModelRunner][Refactor] Refactor kv cache tensor initialization logic #3106

Uh oh!

Conversation

MengqingCao commented Sep 22, 2025 • edited by wangxiyuan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Sep 22, 2025

Uh oh!

github-actions bot commented Sep 24, 2025

Uh oh!

github-actions bot commented Sep 24, 2025

Uh oh!

github-actions bot commented Oct 20, 2025

Uh oh!

github-actions bot commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MengqingCao commented Sep 22, 2025 •

edited by wangxiyuan

Loading