-
Notifications
You must be signed in to change notification settings - Fork 629
[Bugfix] Prevent engine hang during KVCacheSendingThread startup #4754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Jade Zheng <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses a potential hang during engine initialization by improving the startup logic for the KVCacheSendingThread. The introduction of a try...except block in the thread's run method ensures that socket creation failures are gracefully handled, preventing the thread from dying silently. Complementing this, the main thread now uses a timed wait loop with a check for the worker thread's liveness, which robustly prevents indefinite hangs. The changes are well-implemented and significantly improve the system's resilience. I have one suggestion to further enhance the robustness against a potential race condition during thread startup.
| while not ready_event.is_set(): | ||
| if not thread.is_alive(): | ||
| raise RuntimeError( | ||
| "KV Cache sending/receiving thread failed to start.") | ||
| if time.time() - start_wait_time > 5 * 60: | ||
| raise RuntimeError( | ||
| "Timeout waiting for KV Cache thread to be ready.") | ||
| time.sleep(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential race condition here. The worker thread could set the ready_event and then immediately terminate due to an error. The main thread would then exit the wait loop, believing the worker is ready, while it's actually dead. This could lead to subsequent failures.
To make this more robust, you should add another check for thread.is_alive() after the while loop to ensure the thread is still running after it has signaled readiness.
while not ready_event.is_set():
if not thread.is_alive():
raise RuntimeError(
"KV Cache sending/receiving thread failed to start.")
if time.time() - start_wait_time > 5 * 60:
raise RuntimeError(
"Timeout waiting for KV Cache thread to be ready.")
time.sleep(3)
if not thread.is_alive():
raise RuntimeError(
"KV Cache sending/receiving thread died unexpectedly after startup.")|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: Jade Zheng <[email protected]>
Signed-off-by: Jade Zheng <[email protected]>
What this PR does / why we need it?
Previously, if the KVCacheSendingThread couldn't create a socket because of port conflicts or other problems, the main thread would wait endlessly for the ready_event signal, causing the entire engine initialization to freeze. This update fixes the issue by adding timeouts for thread startup and handling unexpected thread exits, so the initialization process no longer gets stuck indefinitely.
Does this PR introduce any user-facing change?
No
How was this patch tested?