Skip to content

Conversation

@jianzs
Copy link
Collaborator

@jianzs jianzs commented Dec 5, 2025

What this PR does / why we need it?

Previously, if the KVCacheSendingThread couldn't create a socket because of port conflicts or other problems, the main thread would wait endlessly for the ready_event signal, causing the entire engine initialization to freeze. This update fixes the issue by adding timeouts for thread startup and handling unexpected thread exits, so the initialization process no longer gets stuck indefinitely.

Does this PR introduce any user-facing change?

No

How was this patch tested?

@jianzs jianzs added ready read for review ready-for-test start test by label for PR labels Dec 5, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a potential hang during engine initialization by improving the startup logic for the KVCacheSendingThread. The introduction of a try...except block in the thread's run method ensures that socket creation failures are gracefully handled, preventing the thread from dying silently. Complementing this, the main thread now uses a timed wait loop with a check for the worker thread's liveness, which robustly prevents indefinite hangs. The changes are well-implemented and significantly improve the system's resilience. I have one suggestion to further enhance the robustness against a potential race condition during thread startup.

Comment on lines +1129 to +1136
while not ready_event.is_set():
if not thread.is_alive():
raise RuntimeError(
"KV Cache sending/receiving thread failed to start.")
if time.time() - start_wait_time > 5 * 60:
raise RuntimeError(
"Timeout waiting for KV Cache thread to be ready.")
time.sleep(3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a potential race condition here. The worker thread could set the ready_event and then immediately terminate due to an error. The main thread would then exit the wait loop, believing the worker is ready, while it's actually dead. This could lead to subsequent failures.

To make this more robust, you should add another check for thread.is_alive() after the while loop to ensure the thread is still running after it has signaled readiness.

        while not ready_event.is_set():
            if not thread.is_alive():
                raise RuntimeError(
                    "KV Cache sending/receiving thread failed to start.")
            if time.time() - start_wait_time > 5 * 60:
                raise RuntimeError(
                    "Timeout waiting for KV Cache thread to be ready.")
            time.sleep(3)

        if not thread.is_alive():
            raise RuntimeError(
                "KV Cache sending/receiving thread died unexpectedly after startup.")

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: Jade Zheng <[email protected]>
Signed-off-by: Jade Zheng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant