-
Notifications
You must be signed in to change notification settings - Fork 629
[feat] add elastic proxy #4545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[feat] add elastic proxy #4545
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an elastic proxy for vLLM to support scaling of prefill and decode instances. The implementation uses FastAPI and provides several API endpoints for completions, status checks, and dynamic instance management. While this is a great feature addition, the current implementation has several critical issues that need to be addressed. These include a blocking call in the server's __init__, incorrect logic for handling instance removal and health checks, and fundamental flaws in handling streaming responses which will cause requests to fail. Additionally, there's a bug in route registration for adding new instances. These issues will prevent the proxy from functioning correctly.
|
Describe the relevant RFC that was added, and include the corresponding test output in the description. |
4638113 to
e6ffb13
Compare
Signed-off-by: yuxinshan <[email protected]> Signed-off-by: CalvinXKY <[email protected]>
e6ffb13 to
a8d0cbc
Compare
[RFC]: Elastic Scaling Support for P/D Instances Based on KV Pool: #3380
What this PR does / why we need it?
This file provides a elastic proxy demo to support elastic scaling for P/D instances based on KV pool.
We can launch multiple vllm instances (2 for prefill and 2 for decode), and
launch this proxy demo through:
Support API routes
/v1/completions: get completions request response./v1/chat/completions: get chat request response./status: get the supported prefill nodes and decode nodes list./instances/add: add prefill nodes or decode nodes to the list./instances/remove: remove prefill nodes or decode nodes from the list.Support functions
/instances/addAPI to join the proxy server. The prefill server or decode server sends a signal to the proxy server, and the proxy server will check the status of the node util the node is available./instances/removeAPI to delete the node from the proxy server.Does this PR introduce any user-facing change?
None
How was this patch tested?
Deploy the proxy server and get request response:
/status/instance/addCase 1: If the node is not available, the server will waiting for the node to be available:
Case 2: If the node is available, try to add the node to the server:
/instance/removeCase 1: If the node is in the corresponding nodes list:
Case 2: If the node is not in the corresponding nodes list: