Skip to content

Conversation

@Flink-ddd
Copy link
Contributor

@Flink-ddd Flink-ddd commented Oct 23, 2025

Purpose

Fixes #26081

This PR addresses a bug where llm.chat() pollutes the scheduler queue if a batch of requests fails validation mid-way.

The Bug:
When a batch (e.g., 4 requests) is sent to llm.chat() and one request (3) is invalid (too long), the engine enqueues the valid requests (1, 2) and then throws an exception for 3. These enqueued requests (1, 2) become "orphaned" and are not cleaned up.

The Consequence:
The next call to llm.chat() (in a recursive retry logic) picks up these orphaned requests, leading to more outputs than inputs and breaking the API contract.

The Fix:
This PR makes the batch addition in _validate_and_add_requests transactional.

  1. It wraps the request-adding loop in a try...except block.
  2. It tracks the request_id of each successfully enqueued request from the current batch.
  3. If an exception occurs, the except block iterates through the tracked IDs and calls self.llm_engine.abort_request() on each "orphan" before re-raising the exception.

Test Plan

  1. Manual Reproduction (OP's Example): Used the recursive infer() function from [Bug]: .chat() does not clean up in case of validation failure #26081 to demonstrate the bug (orphaned requests causing incorrect output counts) and verify the fix (correct output counts). See results below.
  2. Automated Regression Test: Added test_chat_batch_failure_cleanup to tests/entrypoints/llm/test_chat.py. This test (a) triggers a batch failure with an invalid prompt, (b) immediately sends a new, valid batch, and (c) asserts that the output count for the new batch is correct, confirming the queue was cleaned.

Test Result

The fix is confirmed by both the manual reproduction script and the new automated test.

1. Manual Reproduction (Before vs. After)

Before Fix (on main branch): Bug Reproduced
The recursive call processed 5 real outputs instead of the expected 3, due to 2 orphaned requests.

Screenshot 2025-10-23 at 22 39 38

After Fix (on this branch): Fix Verified
The recursive call now correctly processes 3 real outputs and 1 error, as the orphaned requests were successfully aborted.

Screenshot 2025-10-23 at 22 41 21

2. Automated Regression Test (Pytest)

The new pytest case passes, ensuring this bug does not regress.

commands:
pytest tests/entrypoints/llm/test_chat.py -k test_chat_batch_failure_cleanup

Screenshot 2025-10-23 at 22 29 14

@mergify mergify bot added the frontend label Oct 23, 2025
@Flink-ddd Flink-ddd force-pushed the fix/26081-chat-batch-cleanup branch 4 times, most recently from 31a6f39 to bd01420 Compare October 23, 2025 15:43
@Flink-ddd Flink-ddd marked this pull request as ready for review November 2, 2025 05:04
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 1565 to 1579
try:
for i, prompt in enumerate(it):
if isinstance(prompt, dict):
self._validate_mm_data_and_uuids(
prompt.get("multi_modal_data"), prompt.get("multi_modal_uuids")
)
request_id = self._add_request(
prompt,
params[i] if isinstance(params, Sequence) else params,
lora_request=lora_request[i]
if isinstance(lora_request, Sequence)
else lora_request,
priority=priority[i] if priority else 0,
)
added_request_ids.append(request_id)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Skip adding non-dict prompts in _validate_and_add_requests

The new try/except only calls _add_request inside the isinstance(prompt, dict) branch. Any prompt that is a string, list of tokens, etc. now bypasses _add_request entirely. For LLM.generate, LLM.embed, LLM.encode, and any other path that passes plain text prompts, this method returns without enqueuing anything, so _run_engine sees zero requests and the call produces no output. This is a regression from previous behavior and breaks the core API for all non-dict prompts.

Useful? React with 👍 / 👎.

@DarkLight1337
Copy link
Member

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix to prevent orphaned requests in the scheduler queue when a batch of requests to llm.chat() fails mid-validation. The solution makes the batch addition transactional by wrapping the request-adding loop in a try...except block and aborting any successfully enqueued requests from the batch if an exception occurs. While the fix is logically sound and is accompanied by a good regression test, it introduces a critical regression. The change incorrectly moves the request addition logic inside a conditional block that only handles dictionary-based prompts, causing non-dictionary prompts (like simple strings) to be silently ignored.

Comment on lines 1567 to 1579
if isinstance(prompt, dict):
self._validate_mm_data_and_uuids(
prompt.get("multi_modal_data"), prompt.get("multi_modal_uuids")
)
request_id = self._add_request(
prompt,
params[i] if isinstance(params, Sequence) else params,
lora_request=lora_request[i]
if isinstance(lora_request, Sequence)
else lora_request,
priority=priority[i] if priority else 0,
)
added_request_ids.append(request_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change introduces a regression. The call to _add_request and appending to added_request_ids has been moved inside the if isinstance(prompt, dict): block. This means that prompts that are not dictionaries (e.g., simple strings passed to llm.generate()) will be silently ignored instead of being added to the engine, breaking existing functionality.

The _add_request call and the append to added_request_ids should be moved outside the if block to process all prompt types, restoring the original behavior while keeping the transactional logic.

Suggested change
if isinstance(prompt, dict):
self._validate_mm_data_and_uuids(
prompt.get("multi_modal_data"), prompt.get("multi_modal_uuids")
)
request_id = self._add_request(
prompt,
params[i] if isinstance(params, Sequence) else params,
lora_request=lora_request[i]
if isinstance(lora_request, Sequence)
else lora_request,
priority=priority[i] if priority else 0,
)
added_request_ids.append(request_id)
if isinstance(prompt, dict):
self._validate_mm_data_and_uuids(
prompt.get("multi_modal_data"), prompt.get("multi_modal_uuids")
)
request_id = self._add_request(
prompt,
params[i] if isinstance(params, Sequence) else params,
lora_request=lora_request[i]
if isinstance(lora_request, Sequence)
else lora_request,
priority=priority[i] if priority else 0,
)
added_request_ids.append(request_id)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @DarkLight1337 , I just saw message, okay, I'm making changes now. Once it's tested on the GPU and works fine, I'll push the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @DarkLight1337 , I just update code logic and push it. The screenshots below show the execution of several methods, including llm.chat, llm.generate, and llm.embed.

Screenshot 2025-11-02 at 19 59 25 Screenshot 2025-11-02 at 20 06 31

@Flink-ddd Flink-ddd force-pushed the fix/26081-chat-batch-cleanup branch from bd01420 to b2114f6 Compare November 2, 2025 12:07
@DarkLight1337
Copy link
Member

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a crucial fix to prevent orphaned requests from polluting the scheduler queue when a batch operation in llm.chat() fails. The approach of making the batch addition transactional by tracking added requests and aborting them on failure is sound and well-implemented. The accompanying regression test is also well-designed and effectively verifies the fix.

I've identified one area for improvement in the new exception handling logic to make it more robust, ensuring that the original exception is not lost if the cleanup operation itself fails. Overall, this is a great contribution that improves the reliability of batch processing.

Comment on lines +1580 to +1583
except Exception as e:
if added_request_ids:
self.llm_engine.abort_request(added_request_ids)
raise e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current exception handling in this except block has a potential issue. If self.llm_engine.abort_request(added_request_ids) itself raises an exception, the original exception e that triggered this cleanup logic will be lost. The new exception from abort_request will propagate, masking the root cause of the failure.

To make the error handling more robust, it's better to wrap the cleanup call in its own try...except block. This ensures that even if the cleanup fails, the original, more important exception is the one that gets re-raised. The cleanup failure can be logged for debugging purposes.

Also, it's a common Python best practice to use a bare raise statement inside an except block to re-raise the original exception, as it preserves the full traceback.

Suggested change
except Exception as e:
if added_request_ids:
self.llm_engine.abort_request(added_request_ids)
raise e
except Exception:
if added_request_ids:
try:
self.llm_engine.abort_request(added_request_ids)
except Exception as abort_exc:
logger.warning(
"Failed to abort orphaned requests after an error during "
"batch addition. The original error will be raised.",
exc_info=abort_exc)
raise

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks for fixing!

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) November 2, 2025 12:19
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 2, 2025
@Flink-ddd
Copy link
Contributor Author

Hi @DarkLight1337 , Thanks for your approval and help.

@DarkLight1337 DarkLight1337 merged commit 0ce743f into vllm-project:main Nov 2, 2025
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: .chat() does not clean up in case of validation failure

2 participants