|
| 1 | +This file provides a elastic proxy demo to support elastic scaling for P/D instances based on KV pool. |
| 2 | + |
| 3 | +We can launch multiple vllm instances (2 for prefill and 2 for decode), and |
| 4 | +launch this proxy demo through: |
| 5 | + |
| 6 | +```shell |
| 7 | +export ADMIN_API_KEY=YOUR_ADMIN_API_KEY |
| 8 | +python3 examples/elastic_scaling/elastic_proxy.py \ |
| 9 | + --model $model_name \ |
| 10 | + --prefill localhost:8100 localhost:8101 \ |
| 11 | + --decode localhost:8200 localhost:8201 \ |
| 12 | + --port 8000 |
| 13 | +``` |
| 14 | + |
| 15 | +### Support API routes |
| 16 | +* `/v1/completions`: get completions request response. |
| 17 | +* `/v1/chat/completions`: get chat request response. |
| 18 | +* `/status`: get the supported prefill nodes and decode nodes list. |
| 19 | +* `/instances/add`: add prefill nodes or decode nodes to the list. |
| 20 | + |
| 21 | +examples: |
| 22 | +```shell |
| 23 | +# /v1/completions |
| 24 | +curl -X POST http://0.0.0.0:8000/v1/completions \ |
| 25 | +-H "Content-Type: application/json" \ |
| 26 | +-d '{"model": "'$model_name'", "max_tokens": 50, "prompt": "hello"}' |
| 27 | + |
| 28 | +# /v1/chat/completions |
| 29 | +curl -X POST http://0.0.0.0:8000/v1/chat/completions \ |
| 30 | +-H "Content-Type: application/json" \ |
| 31 | +-d '{"model": "'$model_name'", "max_tokens": 50, |
| 32 | + "messages": [{ |
| 33 | + "role": "user", |
| 34 | + "content": "hello" |
| 35 | + }]}' |
| 36 | + |
| 37 | +# /status |
| 38 | +curl -X POST http://0.0.0.0:8000/status |
| 39 | + |
| 40 | +# /instance/add |
| 41 | +curl -X POST http://0.0.0.0:8000/instances/add \ |
| 42 | +-H "Content-Type: application/json" \ |
| 43 | +-H "X-Api-Key: YOUR_ADMIN_API_KEY" \ |
| 44 | +-d '{"type": "prefill", "instance": "0.0.0.0:8100"}' |
| 45 | +``` |
| 46 | + |
| 47 | +### Support functions |
| 48 | + |
| 49 | +* Support adding prefill nodes or decode nodes at any time. |
| 50 | + - If prefill or decode server has been deployed, proxy can add nodes when the proxy is deployed. |
| 51 | + - If prefill or decode server deployed after the proxy deployed, server can use `/instances/add` API to join the proxy server. The prefill server or decode server sends a signal to the proxy server, and the proxy server will check the status of the node util the node is available. |
| 52 | +* Support removing nodes when the prefill or decode server failed more than a certain number of times. |
| 53 | +* Support elastic scaling. When the current node is unavailable, the proxy server will schedule to the next available node. |
0 commit comments