-
Notifications
You must be signed in to change notification settings - Fork 330
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
When you don't provide the max_tokens the router throws an exception:
➜ curl http://localhost:30080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "Explain the origin of Llamas?"
}' | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 110 0 0 100 110 0 62 0:00:01 0:00:01 --:--:-- 62
curl: (18) transfer closed with outstanding read data remainingOn the router side I see this:
✗ kubectl logs -f pd-deployment-router-6589f888d6-zm9cb
...
INFO: 10.224.0.5:46048 - "GET /health HTTP/1.1" 200 OK
[2025-10-31 21:08:42,434] INFO: Prefiller time (TTFT): 0.5266 (request.py:308:vllm_router.services.request_service.request)
INFO: 127.0.0.1:40698 - "POST /v1/completions HTTP/1.1" 200 OK
ERROR: Exception in ASGI application
+ Exception Group Traceback (most recent call last):
| File "/usr/local/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
| result = await app( # type: ignore[func-returns-value]
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/usr/local/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
| return await self.app(scope, receive, send)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/usr/local/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in __call__
| await super().__call__(scope, receive, send)
| File "/usr/local/lib/python3.12/site-packages/starlette/applications.py", line 112, in __call__
| await self.middleware_stack(scope, receive, send)
| File "/usr/local/lib/python3.12/site-packages/starlette/middleware/errors.py", line 187, in __call__
| raise exc
| File "/usr/local/lib/python3.12/site-packages/starlette/middleware/errors.py", line 165, in __call__
| await self.app(scope, receive, _send)
| File "/usr/local/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
| await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
| File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
| raise exc
| File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
| await app(scope, receive, sender)
| File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 715, in __call__
| await self.middleware_stack(scope, receive, send)
| File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 735, in app
| await route.handle(scope, receive, send)
| File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 288, in handle
| await self.app(scope, receive, send)
| File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 76, in app
| await wrap_app_handling_exceptions(app, request)(scope, receive, send)
| File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
| raise exc
| File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
| await app(scope, receive, sender)
| File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 74, in app
| await response(scope, receive, send)
| File "/usr/local/lib/python3.12/site-packages/starlette/responses.py", line 261, in __call__
| async with anyio.create_task_group() as task_group:
| ^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/usr/local/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 772, in __aexit__
| raise BaseExceptionGroup(
| ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "/usr/local/lib/python3.12/site-packages/starlette/responses.py", line 264, in wrap
| await func()
| File "/usr/local/lib/python3.12/site-packages/starlette/responses.py", line 245, in stream_response
| async for chunk in self.body_iterator:
| File "/usr/local/lib/python3.12/site-packages/vllm_router/services/request_service/request.py", line 312, in generate_stream
| async for chunk in stream_service_response(
| File "/usr/local/lib/python3.12/site-packages/vllm_router/services/request_service/request.py", line 287, in stream_service_response
| response.raise_for_status()
| File "/usr/local/lib/python3.12/site-packages/httpx/_models.py", line 829, in raise_for_status
| raise HTTPStatusError(message, request=request, response=self)
| httpx.HTTPStatusError: Client error '400 Bad Request' for url 'http://10.244.1.164:8000/v1/completions'
| For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400
+------------------------------------
INFO: 10.224.0.5:37106 - "GET /health HTTP/1.1" 200 OK
...On the prefill side:
✗ kubectl logs pd-llama-prefill-deployment-vllm-59d64d89bd-jd5zw
...
INFO: 10.224.0.5:36338 - "GET /health HTTP/1.1" 200 OK
INFO 10-31 21:08:41 [logger.py:39] Received request cmpl-82a2f7b4642147c0b2e9c6445d772794-0: prompt: 'Explain the origin of Llamas?', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.9, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [128000, 849, 21435, 279, 6371, 315, 445, 24705, 300, 30], lora_request: None, prompt_adapter_request: None.
INFO 10-31 21:08:41 [async_llm.py:256] Added request cmpl-82a2f7b4642147c0b2e9c6445d772794-0.
[2025-10-31 21:08:41,928] LMCache INFO: Storing KV cache for 10 out of 10 tokens for request cmpl-82a2f7b4642147c0b2e9c6445d772794-0 (vllm_v1_adapter.py:634:lmcache.integration.vllm.vllm_v1_adapter)
[2025-10-31 21:08:41,928] LMCache DEBUG: Sent the request with 1 keys (nixl_connector_v2.py:532:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:41,929] LMCache DEBUG: Received ACK from remote peer with UUID: 88dbe307e3af41e287bc7c7cd79a0678 (nixl_connector_v2.py:382:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:41,929] LMCache DEBUG: Committing write of 1.25 MB with 1 transfers (nixl_connector_v2.py:320:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:42,428] LMCache DEBUG: Transfer 88dbe307e3af41e287bc7c7cd79a0678 completed in 499.1316 ms, creating the transfer: 0.0224 ms, transfer time: 499.1092 ms, pure transfer throughput: 0.0024 GB/s (nixl_connector_v2.py:349:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:42,429] LMCache INFO: Store 10 tokens takes: 500.5628 ms, throughput: 0.0024 GB/s; offload_time: 0.3823 ms, put_time: 499.8056 ms (cache_engine.py:191:lmcache.experimental.cache_engine)
[2025-10-31 21:08:42,431] LMCache WARNING: In connector.start_load_kv, but the attn_metadata is None (vllm_v1_adapter.py:424:lmcache.integration.vllm.vllm_v1_adapter)
INFO: 10.244.1.183:56036 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 10.224.0.5:34448 - "GET /health HTTP/1.1" 200 OK
...On the decode side:
✗ kubectl logs pd-llama-decode-deployment-vllm-77fc57b64d-k5stp
...
INFO: 10.224.0.5:55424 - "GET /health HTTP/1.1" 200 OK
[2025-10-31 21:08:41,929] LMCache DEBUG: Received request with 1 keys from sender sender-7902872ed1a2487792bd022552947934 (nixl_connector_v2.py:744:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:42,429] LMCache DEBUG: Transfer for UUID '88dbe307e3af41e287bc7c7cd79a0678' completed on the remote side (NixlRole.SENDERcuda:0) (nixl_connector_v2.py:423:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:42,429] LMCache DEBUG: Nixl Observer: clone time: 0.1898 msec, Add time: 0.0260 msec for 1 objects (nixl_backend.py:226:lmcache.experimental.storage_backend.nixl_backend)
[2025-10-31 21:08:42,430] LMCache DEBUG: Observers processing in 0.4156 ms (nixl_connector_v2.py:705:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
[2025-10-31 21:08:42,430] LMCache DEBUG: Receiver acked the data with new UUID: c82b30b8dd7b4afab7ace46ba3d405a4 (nixl_connector_v2.py:436:lmcache.experimental.storage_backend.connector.nixl_connector_v2)
INFO: 10.244.1.183:54698 - "POST /v1/completions HTTP/1.1" 400 Bad Request
...To Reproduce
I overcame this bug to get the PD working: #746 (comment) This bug describes how to get the setup running.
Expected behavior
I see a response.
Additional context
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working