Skip to content

Conversation

@heheda12345
Copy link
Collaborator

@heheda12345 heheda12345 commented Aug 30, 2025

Purpose

Different PP stages have different layers, thus the ratio of full:sw is different. For example, #23883 has 10 full & 11 sw in stage 0, and 11 full & 10 sw in stage 1. This PR supports this case by generating the kv cache groups based on the full:sw ratio of the full model.
fix #23883

UPD 2025/09/10
Need to be careful about how to partition the layers into groups.
In PP case, say if we have

  • stage 0: full.0, sw.0, sw.1
  • stage 1: full.1, sw.2, sw.3

We should have 3 groups: (full.0, full.1), (sw.0, sw.2), (sw.1, sw.3)
It can't be (full.0, full.1), (sw.0, sw.1), (sw.2, sw.3) because the 3 groups in stage 0 will be (full.0), (sw.0, sw.1), (empty group) and it will be padded to (full.0, padding), (sw.0, sw.1), (padding, padding) to ensure the number of layers in each group is the same and will cause memory waste. To avoid this, we assign layers[i::num_groups] to the i-th group instead of layers[i * group_size: (i + 1) * group_size]

I've updated the comments in kv_cache_interface and will update the related figures in design doc in with a follow-up PR.

Test Plan

  1. vllm serve -pp 2 BAAI/bge-multilingual-gemma2 --enforce-eager
  2. several unit tests for TP / PP
  3. add unit test for attention free
  4. re-enable BAAI/bge-multilingual-gemma2 PP test

Test Result

  1. can launch the server
    2-3: can pass locally
  2. let's wait for CI

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Chen Zhang <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for pipeline parallelism where different stages can have varying layer compositions. The core change involves refactoring the KV cache configuration logic to first determine a global grouping strategy based on the entire model's layers, and then applying this to each worker. This is a solid approach that enhances flexibility for pipeline parallel setups. The associated tests are comprehensive and cover various scenarios including TP, PP, and hybrid models.

I've identified one potential issue in the new logic that could lead to a crash in an edge case involving workers with no KV cache layers. A fix is suggested below. Overall, the changes are well-structured and move the project in the right direction.

Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Copy link
Contributor

@maxdebayser maxdebayser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @heheda12345. I'm not very familiar with this working of this part of the code, but it makes sense to me. I like that it's cleaner and that you added detailed comments.

@mergify
Copy link

mergify bot commented Sep 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 1, 2025
@heheda12345 heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 2, 2025
@mergify mergify bot removed the needs-rebase label Sep 2, 2025
Signed-off-by: Chen Zhang <[email protected]>
@heheda12345 heheda12345 enabled auto-merge (squash) September 2, 2025 17:08
Signed-off-by: Chen Zhang <[email protected]>
@mergify mergify bot added the tpu Related to Google TPUs label Sep 2, 2025
@mergify
Copy link

mergify bot commented Sep 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 8, 2025
@heheda12345
Copy link
Collaborator Author

I'm waiting for [buildkite/ci/pr/distributed-tests-2-gpus](https://buildkite.com/vllm/ci/builds/29500#01991676-e49f-4d44-b9cd-f2e5b8295673) to be fixed on main.

@mergify mergify bot removed the needs-rebase label Sep 10, 2025
@heheda12345 heheda12345 merged commit 8e5cdcd into vllm-project:main Sep 14, 2025
47 checks passed
@heheda12345 heheda12345 deleted the hybrid_pp branch September 14, 2025 22:55
dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025
bbartels pushed a commit to bbartels/vllm that referenced this pull request Sep 15, 2025
xuechendi added a commit to vllm-project/vllm-gaudi that referenced this pull request Sep 15, 2025
[PR #23974](vllm-project/vllm#23974) updated the
vllm.v1.core.kv_cache_utils.get_kv_cache_configs api.

Signed-off-by: Chendi Xue <[email protected]>
slokesha pushed a commit to slokesha/vllm-gaudi that referenced this pull request Sep 24, 2025
[PR #23974](vllm-project/vllm#23974) updated the
vllm.v1.core.kv_cache_utils.get_kv_cache_configs api.

Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: slokesha <[email protected]>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: KV cache specs are not equal accross rank

3 participants