Skip to content

Conversation

@yuxinshan
Copy link

@yuxinshan yuxinshan commented Nov 28, 2025

[RFC]: Elastic Scaling Support for P/D Instances Based on KV Pool: #3380

What this PR does / why we need it?

This file provides a elastic proxy demo to support elastic scaling for P/D instances based on KV pool.

We can launch multiple vllm instances (2 for prefill and 2 for decode), and
launch this proxy demo through:

export ADMIN_API_KEY=YOUR_ADMIN_API_KEY
python3 examples/elastic_scaling/elastic_proxy.py  \
   --model $model_name  \
   --prefill localhost:8100 localhost:8101   \
   --decode localhost:8200 localhost:8201   \
   --port 8000

Support API routes

  • /v1/completions: get completions request response.
  • /v1/chat/completions: get chat request response.
  • /status: get the supported prefill nodes and decode nodes list.
  • /instances/add: add prefill nodes or decode nodes to the list.
  • /instances/remove: remove prefill nodes or decode nodes from the list.

Support functions

  • Support adding prefill nodes or decode nodes at any time.
    • If prefill or decode server has been deployed, proxy can add nodes when the proxy is deployed.
    • If prefill or decode server deployed after the proxy deployed, server can use /instances/add API to join the proxy server. The prefill server or decode server sends a signal to the proxy server, and the proxy server will check the status of the node util the node is available.
  • Support removing nodes for the following two situations:
    • Support removing nodes when the prefill or decode server failed more than a certain number of times.
    • Support using /instances/remove API to delete the node from the proxy server.
  • Support elastic scaling.
    • When the current node is unavailable, the proxy server will schedule to the next available node.

Does this PR introduce any user-facing change?

None

How was this patch tested?

Deploy the proxy server and get request response:

  • /status
{"prefill_node_count":x,"decode_node_count":x,"prefill_nodes":[xx.xx.xx.xx:xxxx],"decode_nodes":[xx.xx.xx.xx:xxxx]}
  • /instance/add

Case 1: If the node is not available, the server will waiting for the node to be available:

{"message":"Waiting for prefill_instance xx.xx.xx.xx:xxxx to start."}

Case 2: If the node is available, try to add the node to the server:

{"message":"Added xx.xx.xx.xx:xxxx to prefill_instances."}
  • /instance/remove

Case 1: If the node is in the corresponding nodes list:

{"message":"Removed xx.xx.xx.xx:xxxx from prefill_instances."}

Case 2: If the node is not in the corresponding nodes list:

{"message": f"Instance xx.xx.xx.xx:xxxx is not in the prefill_instances."}

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 28, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an elastic proxy for vLLM to support scaling of prefill and decode instances. The implementation uses FastAPI and provides several API endpoints for completions, status checks, and dynamic instance management. While this is a great feature addition, the current implementation has several critical issues that need to be addressed. These include a blocking call in the server's __init__, incorrect logic for handling instance removal and health checks, and fundamental flaws in handling streaming responses which will cause requests to fail. Additionally, there's a bug in route registration for adding new instances. These issues will prevent the proxy from functioning correctly.

@CalvinXKY
Copy link
Contributor

Describe the relevant RFC that was added, and include the corresponding test output in the description.

Signed-off-by: yuxinshan <[email protected]>
Signed-off-by: CalvinXKY <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants