bug: vllm-router throws exception when using disaggregated prefill

### Describe the bug

When you don't provide the `max_tokens` the router throws an exception:

```bash
➜  curl http://localhost:30080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "prompt": "Explain the origin of Llamas?"
    }' | jq

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   110    0     0  100   110      0     62  0:00:01  0:00:01 --:--:--    62
curl: (18) transfer closed with outstanding read data remaining
```

On the router side I see this:

```bash
✗  kubectl logs -f pd-deployment-router-6589f888d6-zm9cb
...
INFO:     10.224.0.5:46048 - "GET /health HTTP/1.1" 200 OK
[2025-10-31 21:08:42,434] INFO: Prefiller time (TTFT): 0.5266 (request.py:308:vllm_router.services.request_service.request)
INFO:     127.0.0.1:40698 - "POST /v1/completions HTTP/1.1" 200 OK
ERROR:    Exception in ASGI application
  + Exception Group Traceback (most recent call last):
  |   File "/usr/local/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
  |     return await self.app(scope, receive, send)
  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/applications.py", line 112, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/middleware/errors.py", line 187, in __call__
  |     raise exc
  |   File "/usr/local/lib/python3.12/site-packages/starlette/middleware/errors.py", line 165, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 715, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 735, in app
  |     await route.handle(scope, receive, send)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 288, in handle
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 76, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 74, in app
  |     await response(scope, receive, send)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/responses.py", line 261, in __call__
  |     async with anyio.create_task_group() as task_group:
  |                ^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 772, in __aexit__
  |     raise BaseExceptionGroup(
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/usr/local/lib/python3.12/site-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.12/site-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.12/site-packages/vllm_router/services/request_service/request.py", line 312, in generate_stream
    |     async for chunk in stream_service_response(
    |   File "/usr/local/lib/python3.12/site-packages/vllm_router/services/request_service/request.py", line 287, in stream_service_response
    |     response.raise_for_status()
    |   File "/usr/local/lib/python3.12/site-packages/httpx/_models.py", line 829, in raise_for_status
    |     raise HTTPStatusError(message, request=request, response=self)
    | httpx.HTTPStatusError: Client error '400 Bad Request' for url 'http://10.244.1.164:8000/v1/completions'
    | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400
    +------------------------------------
INFO:     10.224.0.5:37106 - "GET /health HTTP/1.1" 200 OK
...
```

On the prefill side:

```bash
✗  kubectl logs pd-llama-prefill-deployment-vllm-59d64d89bd-jd5zw
...
INFO:     10.224.0.5:36338 - "GET /health HTTP/1.1" 200 OK
INFO 10-31 21:08:41 [logger.py:39] Received request cmpl-82a2f7b4642147c0b2e9c6445d772794-0: prompt: 'Explain the origin of Llamas?', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.9, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [128000, 849, 21435, 279, 6371, 315, 445, 24705, 300, 30], lora_request: None, prompt_adapter_request: None.
INFO 10-31 21:08:41 [async_llm.py:256] Added request cmpl-82a2f7b4642147c0b2e9c6445d772794-0.
[2025-10-31 21:08:41,928] LMCache INFO: Storing KV cache for 10 out of 10 tokens for request cmpl-82a2f7b4642147c0b2e9c6445d772794-0 (vllm_v1_adapter.py:634:lmcache.integration.vllm.vllm_v1_adapter)
[2025-10-31 21:08:41,928] LMCache DEBUG: Sent the request with 1 keys (nixl_connector_v2.py:532:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:41,929] LMCache DEBUG: Received ACK from remote peer with UUID: 88dbe307e3af41e287bc7c7cd79a0678 (nixl_connector_v2.py:382:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:41,929] LMCache DEBUG: Committing write of 1.25 MB with 1 transfers (nixl_connector_v2.py:320:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:42,428] LMCache DEBUG: Transfer 88dbe307e3af41e287bc7c7cd79a0678 completed in 499.1316 ms, creating the transfer: 0.0224 ms, transfer time: 499.1092 ms, pure transfer throughput: 0.0024 GB/s (nixl_connector_v2.py:349:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:42,429] LMCache INFO: Store 10 tokens takes: 500.5628 ms, throughput: 0.0024 GB/s; offload_time: 0.3823 ms, put_time: 499.8056 ms (cache_engine.py:191:lmcache.experimental.cache_engine)
[2025-10-31 21:08:42,431] LMCache WARNING: In connector.start_load_kv, but the attn_metadata is None (vllm_v1_adapter.py:424:lmcache.integration.vllm.vllm_v1_adapter)
INFO:     10.244.1.183:56036 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     10.224.0.5:34448 - "GET /health HTTP/1.1" 200 OK
...
```

On the decode side:

```bash
✗  kubectl logs pd-llama-decode-deployment-vllm-77fc57b64d-k5stp
...
INFO:     10.224.0.5:55424 - "GET /health HTTP/1.1" 200 OK
[2025-10-31 21:08:41,929] LMCache DEBUG: Received request with 1 keys from sender sender-7902872ed1a2487792bd022552947934 (nixl_connector_v2.py:744:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:42,429] LMCache DEBUG: Transfer for UUID '88dbe307e3af41e287bc7c7cd79a0678' completed on the remote side (NixlRole.SENDERcuda:0) (nixl_connector_v2.py:423:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:42,429] LMCache DEBUG: Nixl Observer: clone time: 0.1898 msec, Add time: 0.0260 msec for 1 objects (nixl_backend.py:226:lmcache.experimental.storage_backend.nixl_backend)
[2025-10-31 21:08:42,430] LMCache DEBUG: Observers processing in 0.4156 ms (nixl_connector_v2.py:705:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:42,430] LMCache DEBUG: Receiver acked the data with new UUID: c82b30b8dd7b4afab7ace46ba3d405a4 (nixl_connector_v2.py:436:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
INFO:     10.244.1.183:54698 - "POST /v1/completions HTTP/1.1" 400 Bad Request
...
```

### To Reproduce

I overcame this bug to get the PD working: https://github.com/vllm-project/production-stack/issues/746#issue-3576249747 This bug describes how to get the setup running.

### Expected behavior

I see a response.

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: vllm-router throws exception when using disaggregated prefill #747

Describe the bug

To Reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: vllm-router throws exception when using disaggregated prefill #747

Description

Describe the bug

To Reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions