Skip to content

Conversation

@MengqingCao
Copy link
Collaborator

@MengqingCao MengqingCao commented Sep 22, 2025

What this PR does / why we need it?

Refactor kv cache tensor initialization logic.

  1. Unify the kvcache tensor initialization logic of deepseek and normal models
  2. spilt initialize_kv_cache_tensors into _allocate_kv_cache_tensors and _reshape_kv_cache_tensors, following gpu modelrunner in vllm

After this pr, all the kvcache-related info is listed as following:

model KVCacheSpec page_size_bytes caches num_blocks
DeepSeek MLA MLAAttentionSpec self.block_size * self.num_kv_heads * self.head_size * et_dtype_size(self.dtype) k_cache + v_cache (k_cache_size + v_cache_size) // kv_cache_spec.page_size_bytes
DeepSeek MLA + SFA FullAttentionSpec 2 * self.block_size * self.num_kv_heads * self.head_size * et_dtype_size(self.dtype) k_cache + v_cache + dsa_k_cache

dsa_k_cache_size = k_cache_size + v_cache
(k_cache_size + v_cache_size + dsa_cache_size) // kv_cache_spec.page_size_bytes
Qwen3-Next Linear attn MambaSpec sum(prod(shape) * get_dtype_size(dtype)for (shape, dtype) in zip(self.shapes, self.dtypes)) mamba_cache mamba_cache_size // kv_cache_spec.page_size_bytes
Others FullAttentionSpec 2 * self.block_size * self.num_kv_heads * self.head_size * et_dtype_size(self.dtype) k_cache + v_cache (k_cache_size + v_cache_size) // kv_cache_spec.page_size_bytes

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

CI passed with existing test.

  1. prefill disaggregation scenario
  2. deepseek + aclgraph/eager mode
  3. qwen3 next

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@MengqingCao MengqingCao marked this pull request as ready for review September 23, 2025 07:26
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

1 similar comment
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

yiz-liu pushed a commit that referenced this pull request Oct 29, 2025
…ype (#3760)

### What this PR does / why we need it?
Part of #3106
Fix Hybrid kvcache sharing bug in same attention type
Change the `shared_by` logic so that the same attention spec could share
the same buffer instead of allocating more hbm.
After this pr, kvcache memory saved 50% in qwen3-next compared with
before (`self_attn:linear_attn=1:3` in an `attn_group`), and
`gpu_memory_utilization` could increase to `0.8` on Qwen3-Next when
running on A2 64G/card with tp4

<img width="2833" height="1540" alt="image"
src="https://github.com/user-attachments/assets/2a91fa99-fb0f-447c-9e8b-acd587890fbe"
/>

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Test pass with the latest e2e test case on qwen3-next

- vLLM version: v0.11.0rc3
- vLLM main:
vllm-project/vllm@c9461e0

---------

Signed-off-by: MengqingCao <[email protected]>
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

  * Unify the kvcache tensor initialization logic of deepseek and normal models
  * spilt initialize_kv_cache_tensors into _allocate_kv_cache_tensors and _reshape_kv_cache_tensors, following gpu modelrunner in vllm
  * Fix the shared_by logic so that the same attention spec could share the same buffer instead of allocating more hbm.

Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
@MengqingCao MengqingCao added ready read for review ready-for-test start test by label for PR labels Oct 31, 2025
Signed-off-by: MengqingCao <[email protected]>
@wangxiyuan wangxiyuan merged commit 5fed166 into vllm-project:main Nov 4, 2025
24 checks passed
Pz1116 pushed a commit to Pz1116/vllm-ascend that referenced this pull request Nov 5, 2025
…vllm-project#3106)

### What this PR does / why we need it?
Refactor kv cache tensor initialization logic.
1. Unify the kvcache tensor initialization logic of deepseek and normal
models
2. spilt `initialize_kv_cache_tensors` into `_allocate_kv_cache_tensors`
and `_reshape_kv_cache_tensors`, following gpu modelrunner in vllm

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.
1. prefill disaggregation scenario
4. deepseek + aclgraph/eager mode
5. qwen3 next

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@83f478b

---------

Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: Pz1116 <[email protected]>
luolun pushed a commit to luolun/vllm-ascend that referenced this pull request Nov 19, 2025
…ype (vllm-project#3760)

### What this PR does / why we need it?
Part of vllm-project#3106
Fix Hybrid kvcache sharing bug in same attention type
Change the `shared_by` logic so that the same attention spec could share
the same buffer instead of allocating more hbm.
After this pr, kvcache memory saved 50% in qwen3-next compared with
before (`self_attn:linear_attn=1:3` in an `attn_group`), and
`gpu_memory_utilization` could increase to `0.8` on Qwen3-Next when
running on A2 64G/card with tp4

<img width="2833" height="1540" alt="image"
src="https://github.com/user-attachments/assets/2a91fa99-fb0f-447c-9e8b-acd587890fbe"
/>

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Test pass with the latest e2e test case on qwen3-next

- vLLM version: v0.11.0rc3
- vLLM main:
vllm-project/vllm@c9461e0

---------

Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: luolun <[email protected]>
luolun pushed a commit to luolun/vllm-ascend that referenced this pull request Nov 19, 2025
…vllm-project#3106)

### What this PR does / why we need it?
Refactor kv cache tensor initialization logic. 
1. Unify the kvcache tensor initialization logic of deepseek and normal
models
2. spilt `initialize_kv_cache_tensors` into `_allocate_kv_cache_tensors`
and `_reshape_kv_cache_tensors`, following gpu modelrunner in vllm

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.
1. prefill disaggregation scenario
4. deepseek + aclgraph/eager mode
5. qwen3 next


- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@83f478b

---------

Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: luolun <[email protected]>
hwhaokun pushed a commit to hwhaokun/vllm-ascend that referenced this pull request Nov 19, 2025
…ype (vllm-project#3760)

### What this PR does / why we need it?
Part of vllm-project#3106
Fix Hybrid kvcache sharing bug in same attention type
Change the `shared_by` logic so that the same attention spec could share
the same buffer instead of allocating more hbm.
After this pr, kvcache memory saved 50% in qwen3-next compared with
before (`self_attn:linear_attn=1:3` in an `attn_group`), and
`gpu_memory_utilization` could increase to `0.8` on Qwen3-Next when
running on A2 64G/card with tp4

<img width="2833" height="1540" alt="image"
src="https://github.com/user-attachments/assets/2a91fa99-fb0f-447c-9e8b-acd587890fbe"
/>

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Test pass with the latest e2e test case on qwen3-next

- vLLM version: v0.11.0rc3
- vLLM main:
vllm-project/vllm@c9461e0

---------

Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: hwhaokun <[email protected]>
hwhaokun pushed a commit to hwhaokun/vllm-ascend that referenced this pull request Nov 19, 2025
…vllm-project#3106)

### What this PR does / why we need it?
Refactor kv cache tensor initialization logic.
1. Unify the kvcache tensor initialization logic of deepseek and normal
models
2. spilt `initialize_kv_cache_tensors` into `_allocate_kv_cache_tensors`
and `_reshape_kv_cache_tensors`, following gpu modelrunner in vllm

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.
1. prefill disaggregation scenario
4. deepseek + aclgraph/eager mode
5. qwen3 next

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@83f478b

---------

Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: hwhaokun <[email protected]>
NSDie pushed a commit to NSDie/vllm-ascend that referenced this pull request Nov 24, 2025
…ype (vllm-project#3760)

### What this PR does / why we need it?
Part of vllm-project#3106
Fix Hybrid kvcache sharing bug in same attention type
Change the `shared_by` logic so that the same attention spec could share
the same buffer instead of allocating more hbm.
After this pr, kvcache memory saved 50% in qwen3-next compared with
before (`self_attn:linear_attn=1:3` in an `attn_group`), and
`gpu_memory_utilization` could increase to `0.8` on Qwen3-Next when
running on A2 64G/card with tp4

<img width="2833" height="1540" alt="image"
src="https://github.com/user-attachments/assets/2a91fa99-fb0f-447c-9e8b-acd587890fbe"
/>

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Test pass with the latest e2e test case on qwen3-next

- vLLM version: v0.11.0rc3
- vLLM main:
vllm-project/vllm@c9461e0

---------

Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: nsdie <[email protected]>
NSDie pushed a commit to NSDie/vllm-ascend that referenced this pull request Nov 24, 2025
…vllm-project#3106)

### What this PR does / why we need it?
Refactor kv cache tensor initialization logic.
1. Unify the kvcache tensor initialization logic of deepseek and normal
models
2. spilt `initialize_kv_cache_tensors` into `_allocate_kv_cache_tensors`
and `_reshape_kv_cache_tensors`, following gpu modelrunner in vllm

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.
1. prefill disaggregation scenario
4. deepseek + aclgraph/eager mode
5. qwen3 next

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@83f478b

---------

Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: nsdie <[email protected]>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 10, 2025
…ype (vllm-project#3760)

### What this PR does / why we need it?
Part of vllm-project#3106
Fix Hybrid kvcache sharing bug in same attention type
Change the `shared_by` logic so that the same attention spec could share
the same buffer instead of allocating more hbm.
After this pr, kvcache memory saved 50% in qwen3-next compared with
before (`self_attn:linear_attn=1:3` in an `attn_group`), and
`gpu_memory_utilization` could increase to `0.8` on Qwen3-Next when
running on A2 64G/card with tp4

<img width="2833" height="1540" alt="image"
src="https://github.com/user-attachments/assets/2a91fa99-fb0f-447c-9e8b-acd587890fbe"
/>

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Test pass with the latest e2e test case on qwen3-next

- vLLM version: v0.11.0rc3
- vLLM main:
vllm-project/vllm@c9461e0

---------

Signed-off-by: MengqingCao <[email protected]>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 10, 2025
…vllm-project#3106)

### What this PR does / why we need it?
Refactor kv cache tensor initialization logic. 
1. Unify the kvcache tensor initialization logic of deepseek and normal
models
2. spilt `initialize_kv_cache_tensors` into `_allocate_kv_cache_tensors`
and `_reshape_kv_cache_tensors`, following gpu modelrunner in vllm

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.
1. prefill disaggregation scenario
4. deepseek + aclgraph/eager mode
5. qwen3 next


- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@83f478b

---------

Signed-off-by: MengqingCao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants