[Bugfix] Prevent engine hang during KVCacheSendingThread startup #4754

jianzs · 2025-12-05T14:13:41Z

What this PR does / why we need it?

Previously, if the KVCacheSendingThread couldn't create a socket because of port conflicts or other problems, the main thread would wait endlessly for the ready_event signal, causing the entire engine initialization to freeze. This update fixes the issue by adding timeouts for thread startup and handling unexpected thread exits, so the initialization process no longer gets stuck indefinitely.

Does this PR introduce any user-facing change?

No

How was this patch tested?

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

Signed-off-by: Jade Zheng <[email protected]>

gemini-code-assist

Code Review

This pull request effectively addresses a potential hang during engine initialization by improving the startup logic for the KVCacheSendingThread. The introduction of a try...except block in the thread's run method ensures that socket creation failures are gracefully handled, preventing the thread from dying silently. Complementing this, the main thread now uses a timed wait loop with a check for the worker thread's liveness, which robustly prevents indefinite hangs. The changes are well-implemented and significantly improve the system's resilience. I have one suggestion to further enhance the robustness against a potential race condition during thread startup.

gemini-code-assist · 2025-12-05T14:15:43Z

vllm_ascend/distributed/mooncake_connector.py

+        while not ready_event.is_set():
+            if not thread.is_alive():
+                raise RuntimeError(
+                    "KV Cache sending/receiving thread failed to start.")
+            if time.time() - start_wait_time > 5 * 60:
+                raise RuntimeError(
+                    "Timeout waiting for KV Cache thread to be ready.")
+            time.sleep(3)


There's a potential race condition here. The worker thread could set the ready_event and then immediately terminate due to an error. The main thread would then exit the wait loop, believing the worker is ready, while it's actually dead. This could lead to subsequent failures.

To make this more robust, you should add another check for thread.is_alive() after the while loop to ensure the thread is still running after it has signaled readiness.

while not ready_event.is_set(): if not thread.is_alive(): raise RuntimeError( "KV Cache sending/receiving thread failed to start.") if time.time() - start_wait_time > 5 * 60: raise RuntimeError( "Timeout waiting for KV Cache thread to be ready.") time.sleep(3) if not thread.is_alive(): raise RuntimeError( "KV Cache sending/receiving thread died unexpectedly after startup.")

github-actions · 2025-12-05T14:48:18Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: Jade Zheng <[email protected]>

[Bugfix] Improve error handling and logging in KVCacheSendingThread

85b86bb

Signed-off-by: Jade Zheng <[email protected]>

jianzs added ready read for review ready-for-test start test by label for PR labels Dec 5, 2025

gemini-code-assist bot reviewed Dec 5, 2025

View reviewed changes

jianzs added 2 commits December 5, 2025 23:12

update

27d2fc2

Signed-off-by: Jade Zheng <[email protected]>

update

0eb8785

Signed-off-by: Jade Zheng <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Prevent engine hang during KVCacheSendingThread startup #4754

[Bugfix] Prevent engine hang during KVCacheSendingThread startup #4754

jianzs commented Dec 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Bugfix] Prevent engine hang during KVCacheSendingThread startup #4754

Are you sure you want to change the base?

[Bugfix] Prevent engine hang during KVCacheSendingThread startup #4754

Conversation

jianzs commented Dec 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jianzs commented Dec 5, 2025 •

edited by github-actions bot

Loading