-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Fix(llm): Abort orphaned requests when llm.chat() batch fails Fixes #26081 #27420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix(llm): Abort orphaned requests when llm.chat() batch fails Fixes #26081 #27420
Conversation
31a6f39 to
bd01420
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
vllm/entrypoints/llm.py
Outdated
| try: | ||
| for i, prompt in enumerate(it): | ||
| if isinstance(prompt, dict): | ||
| self._validate_mm_data_and_uuids( | ||
| prompt.get("multi_modal_data"), prompt.get("multi_modal_uuids") | ||
| ) | ||
| request_id = self._add_request( | ||
| prompt, | ||
| params[i] if isinstance(params, Sequence) else params, | ||
| lora_request=lora_request[i] | ||
| if isinstance(lora_request, Sequence) | ||
| else lora_request, | ||
| priority=priority[i] if priority else 0, | ||
| ) | ||
| added_request_ids.append(request_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skip adding non-dict prompts in _validate_and_add_requests
The new try/except only calls _add_request inside the isinstance(prompt, dict) branch. Any prompt that is a string, list of tokens, etc. now bypasses _add_request entirely. For LLM.generate, LLM.embed, LLM.encode, and any other path that passes plain text prompts, this method returns without enqueuing anything, so _run_engine sees zero requests and the call produces no output. This is a regression from previous behavior and breaks the core API for all non-dict prompts.
Useful? React with 👍 / 👎.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a fix to prevent orphaned requests in the scheduler queue when a batch of requests to llm.chat() fails mid-validation. The solution makes the batch addition transactional by wrapping the request-adding loop in a try...except block and aborting any successfully enqueued requests from the batch if an exception occurs. While the fix is logically sound and is accompanied by a good regression test, it introduces a critical regression. The change incorrectly moves the request addition logic inside a conditional block that only handles dictionary-based prompts, causing non-dictionary prompts (like simple strings) to be silently ignored.
vllm/entrypoints/llm.py
Outdated
| if isinstance(prompt, dict): | ||
| self._validate_mm_data_and_uuids( | ||
| prompt.get("multi_modal_data"), prompt.get("multi_modal_uuids") | ||
| ) | ||
| request_id = self._add_request( | ||
| prompt, | ||
| params[i] if isinstance(params, Sequence) else params, | ||
| lora_request=lora_request[i] | ||
| if isinstance(lora_request, Sequence) | ||
| else lora_request, | ||
| priority=priority[i] if priority else 0, | ||
| ) | ||
| added_request_ids.append(request_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change introduces a regression. The call to _add_request and appending to added_request_ids has been moved inside the if isinstance(prompt, dict): block. This means that prompts that are not dictionaries (e.g., simple strings passed to llm.generate()) will be silently ignored instead of being added to the engine, breaking existing functionality.
The _add_request call and the append to added_request_ids should be moved outside the if block to process all prompt types, restoring the original behavior while keeping the transactional logic.
| if isinstance(prompt, dict): | |
| self._validate_mm_data_and_uuids( | |
| prompt.get("multi_modal_data"), prompt.get("multi_modal_uuids") | |
| ) | |
| request_id = self._add_request( | |
| prompt, | |
| params[i] if isinstance(params, Sequence) else params, | |
| lora_request=lora_request[i] | |
| if isinstance(lora_request, Sequence) | |
| else lora_request, | |
| priority=priority[i] if priority else 0, | |
| ) | |
| added_request_ids.append(request_id) | |
| if isinstance(prompt, dict): | |
| self._validate_mm_data_and_uuids( | |
| prompt.get("multi_modal_data"), prompt.get("multi_modal_uuids") | |
| ) | |
| request_id = self._add_request( | |
| prompt, | |
| params[i] if isinstance(params, Sequence) else params, | |
| lora_request=lora_request[i] | |
| if isinstance(lora_request, Sequence) | |
| else lora_request, | |
| priority=priority[i] if priority else 0, | |
| ) | |
| added_request_ids.append(request_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @DarkLight1337 , I just saw message, okay, I'm making changes now. Once it's tested on the GPU and works fine, I'll push the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @DarkLight1337 , I just update code logic and push it. The screenshots below show the execution of several methods, including llm.chat, llm.generate, and llm.embed.
…-project#26081 Signed-off-by: vensenmu <[email protected]>
bd01420 to
b2114f6
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a crucial fix to prevent orphaned requests from polluting the scheduler queue when a batch operation in llm.chat() fails. The approach of making the batch addition transactional by tracking added requests and aborting them on failure is sound and well-implemented. The accompanying regression test is also well-designed and effectively verifies the fix.
I've identified one area for improvement in the new exception handling logic to make it more robust, ensuring that the original exception is not lost if the cleanup operation itself fails. Overall, this is a great contribution that improves the reliability of batch processing.
| except Exception as e: | ||
| if added_request_ids: | ||
| self.llm_engine.abort_request(added_request_ids) | ||
| raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current exception handling in this except block has a potential issue. If self.llm_engine.abort_request(added_request_ids) itself raises an exception, the original exception e that triggered this cleanup logic will be lost. The new exception from abort_request will propagate, masking the root cause of the failure.
To make the error handling more robust, it's better to wrap the cleanup call in its own try...except block. This ensures that even if the cleanup fails, the original, more important exception is the one that gets re-raised. The cleanup failure can be logged for debugging purposes.
Also, it's a common Python best practice to use a bare raise statement inside an except block to re-raise the original exception, as it preserves the full traceback.
| except Exception as e: | |
| if added_request_ids: | |
| self.llm_engine.abort_request(added_request_ids) | |
| raise e | |
| except Exception: | |
| if added_request_ids: | |
| try: | |
| self.llm_engine.abort_request(added_request_ids) | |
| except Exception as abort_exc: | |
| logger.warning( | |
| "Failed to abort orphaned requests after an error during " | |
| "batch addition. The original error will be raised.", | |
| exc_info=abort_exc) | |
| raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now, thanks for fixing!
|
Hi @DarkLight1337 , Thanks for your approval and help. |
Purpose
Fixes #26081
This PR addresses a bug where
llm.chat()pollutes the scheduler queue if a batch of requests fails validation mid-way.The Bug:
When a batch (e.g., 4 requests) is sent to
llm.chat()and one request (3) is invalid (too long), the engine enqueues the valid requests (1, 2) and then throws an exception for 3. These enqueued requests (1, 2) become "orphaned" and are not cleaned up.The Consequence:
The next call to
llm.chat()(in a recursive retry logic) picks up these orphaned requests, leading to more outputs than inputs and breaking the API contract.The Fix:
This PR makes the batch addition in
_validate_and_add_requeststransactional.try...exceptblock.request_idof each successfully enqueued request from the current batch.exceptblock iterates through the tracked IDs and callsself.llm_engine.abort_request()on each "orphan" before re-raising the exception.Test Plan
infer()function from [Bug]: .chat() does not clean up in case of validation failure #26081 to demonstrate the bug (orphaned requests causing incorrect output counts) and verify the fix (correct output counts). See results below.test_chat_batch_failure_cleanuptotests/entrypoints/llm/test_chat.py. This test (a) triggers a batch failure with an invalid prompt, (b) immediately sends a new, valid batch, and (c) asserts that the output count for the new batch is correct, confirming the queue was cleaned.Test Result
The fix is confirmed by both the manual reproduction script and the new automated test.
1. Manual Reproduction (Before vs. After)
Before Fix (on
mainbranch): Bug ReproducedThe recursive call processed 5 real outputs instead of the expected 3, due to 2 orphaned requests.
After Fix (on this branch): Fix Verified
The recursive call now correctly processes 3 real outputs and 1 error, as the orphaned requests were successfully aborted.
2. Automated Regression Test (Pytest)
The new
pytestcase passes, ensuring this bug does not regress.commands:
pytest tests/entrypoints/llm/test_chat.py -k test_chat_batch_failure_cleanup