Skip to content

bug: vllm-router throws exception when using disaggregated prefill #747

@surajssd

Description

@surajssd

Describe the bug

When you don't provide the max_tokens the router throws an exception:

➜  curl http://localhost:30080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "prompt": "Explain the origin of Llamas?"
    }' | jq

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   110    0     0  100   110      0     62  0:00:01  0:00:01 --:--:--    62
curl: (18) transfer closed with outstanding read data remaining

On the router side I see this:

✗  kubectl logs -f pd-deployment-router-6589f888d6-zm9cb
...
INFO:     10.224.0.5:46048 - "GET /health HTTP/1.1" 200 OK
[2025-10-31 21:08:42,434] INFO: Prefiller time (TTFT): 0.5266 (request.py:308:vllm_router.services.request_service.request)
INFO:     127.0.0.1:40698 - "POST /v1/completions HTTP/1.1" 200 OK
ERROR:    Exception in ASGI application
  + Exception Group Traceback (most recent call last):
  |   File "/usr/local/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
  |     return await self.app(scope, receive, send)
  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/applications.py", line 112, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/middleware/errors.py", line 187, in __call__
  |     raise exc
  |   File "/usr/local/lib/python3.12/site-packages/starlette/middleware/errors.py", line 165, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 715, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 735, in app
  |     await route.handle(scope, receive, send)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 288, in handle
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 76, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 74, in app
  |     await response(scope, receive, send)
  |   File "/usr/local/lib/python3.12/site-packages/starlette/responses.py", line 261, in __call__
  |     async with anyio.create_task_group() as task_group:
  |                ^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 772, in __aexit__
  |     raise BaseExceptionGroup(
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/usr/local/lib/python3.12/site-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.12/site-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.12/site-packages/vllm_router/services/request_service/request.py", line 312, in generate_stream
    |     async for chunk in stream_service_response(
    |   File "/usr/local/lib/python3.12/site-packages/vllm_router/services/request_service/request.py", line 287, in stream_service_response
    |     response.raise_for_status()
    |   File "/usr/local/lib/python3.12/site-packages/httpx/_models.py", line 829, in raise_for_status
    |     raise HTTPStatusError(message, request=request, response=self)
    | httpx.HTTPStatusError: Client error '400 Bad Request' for url 'http://10.244.1.164:8000/v1/completions'
    | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400
    +------------------------------------
INFO:     10.224.0.5:37106 - "GET /health HTTP/1.1" 200 OK
...

On the prefill side:

✗  kubectl logs pd-llama-prefill-deployment-vllm-59d64d89bd-jd5zw
...
INFO:     10.224.0.5:36338 - "GET /health HTTP/1.1" 200 OK
INFO 10-31 21:08:41 [logger.py:39] Received request cmpl-82a2f7b4642147c0b2e9c6445d772794-0: prompt: 'Explain the origin of Llamas?', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.9, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [128000, 849, 21435, 279, 6371, 315, 445, 24705, 300, 30], lora_request: None, prompt_adapter_request: None.
INFO 10-31 21:08:41 [async_llm.py:256] Added request cmpl-82a2f7b4642147c0b2e9c6445d772794-0.
[2025-10-31 21:08:41,928] LMCache INFO: Storing KV cache for 10 out of 10 tokens for request cmpl-82a2f7b4642147c0b2e9c6445d772794-0 (vllm_v1_adapter.py:634:lmcache.integration.vllm.vllm_v1_adapter)
[2025-10-31 21:08:41,928] LMCache DEBUG: Sent the request with 1 keys (nixl_connector_v2.py:532:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:41,929] LMCache DEBUG: Received ACK from remote peer with UUID: 88dbe307e3af41e287bc7c7cd79a0678 (nixl_connector_v2.py:382:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:41,929] LMCache DEBUG: Committing write of 1.25 MB with 1 transfers (nixl_connector_v2.py:320:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:42,428] LMCache DEBUG: Transfer 88dbe307e3af41e287bc7c7cd79a0678 completed in 499.1316 ms, creating the transfer: 0.0224 ms, transfer time: 499.1092 ms, pure transfer throughput: 0.0024 GB/s (nixl_connector_v2.py:349:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:42,429] LMCache INFO: Store 10 tokens takes: 500.5628 ms, throughput: 0.0024 GB/s; offload_time: 0.3823 ms, put_time: 499.8056 ms (cache_engine.py:191:lmcache.experimental.cache_engine)
[2025-10-31 21:08:42,431] LMCache WARNING: In connector.start_load_kv, but the attn_metadata is None (vllm_v1_adapter.py:424:lmcache.integration.vllm.vllm_v1_adapter)
INFO:     10.244.1.183:56036 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     10.224.0.5:34448 - "GET /health HTTP/1.1" 200 OK
...

On the decode side:

✗  kubectl logs pd-llama-decode-deployment-vllm-77fc57b64d-k5stp
...
INFO:     10.224.0.5:55424 - "GET /health HTTP/1.1" 200 OK
[2025-10-31 21:08:41,929] LMCache DEBUG: Received request with 1 keys from sender sender-7902872ed1a2487792bd022552947934 (nixl_connector_v2.py:744:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:42,429] LMCache DEBUG: Transfer for UUID '88dbe307e3af41e287bc7c7cd79a0678' completed on the remote side (NixlRole.SENDERcuda:0) (nixl_connector_v2.py:423:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:42,429] LMCache DEBUG: Nixl Observer: clone time: 0.1898 msec, Add time: 0.0260 msec for 1 objects (nixl_backend.py:226:lmcache.experimental.storage_backend.nixl_backend)
[2025-10-31 21:08:42,430] LMCache DEBUG: Observers processing in 0.4156 ms (nixl_connector_v2.py:705:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:42,430] LMCache DEBUG: Receiver acked the data with new UUID: c82b30b8dd7b4afab7ace46ba3d405a4 (nixl_connector_v2.py:436:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
INFO:     10.244.1.183:54698 - "POST /v1/completions HTTP/1.1" 400 Bad Request
...

To Reproduce

I overcame this bug to get the PD working: #746 (comment) This bug describes how to get the setup running.

Expected behavior

I see a response.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions