Skip to content

Commit ab51fce

Browse files
mazhixin000mazhixin
andauthored
[Doc]Add single node PD disaggregation instructions (#4337)
### What this PR does / why we need it? add single node PD disaggregation instructions for Qwen 2.5VL model. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: mazhixin <[email protected]> Signed-off-by: mazhixin000 <[email protected]> Co-authored-by: mazhixin <[email protected]>
1 parent ea3372f commit ab51fce

File tree

2 files changed

+174
-0
lines changed

2 files changed

+174
-0
lines changed

docs/source/tutorials/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ single_npu_qwen2_audio
99
single_npu_qwen3_embedding
1010
single_npu_qwen3_quantization
1111
single_npu_qwen3_w4a4
12+
single_node_pd_disaggregation_llmdatadist
1213
multi_npu_qwen3_next
1314
multi_npu
1415
multi_npu_moge
Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
# Prefill-Decode Disaggregation Llmdatadist Verification (Qwen2.5-VL)
2+
3+
## Getting Start
4+
5+
vLLM-Ascend now supports prefill-decode (PD) disaggregation. This guide takes one-by-one steps to verify these features with constrained resources.
6+
7+
Using the Qwen2.5-VL-7B-Instruct model as an example, use vllm-ascend v0.11.0rc1 (with vLLM v0.11.0) on 1 Atlas 800T A2 server to deploy the "1P1D" architecture. Assume the IP address is 192.0.0.1.
8+
9+
## Verify Communication Environment
10+
11+
### Verification Process
12+
13+
1. Single Node Verification:
14+
15+
Execute the following commands in sequence. The results must all be `success` and the status must be `UP`:
16+
17+
```bash
18+
# Check the remote switch ports
19+
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
20+
# Get the link status of the Ethernet ports (UP or DOWN)
21+
for i in {0..7}; do hccn_tool -i $i -link -g ; done
22+
# Check the network health status
23+
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
24+
# View the network detected IP configuration
25+
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
26+
# View gateway configuration
27+
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
28+
# View NPU network configuration
29+
cat /etc/hccn.conf
30+
```
31+
32+
2. Get NPU IP Addresses
33+
34+
```bash
35+
for i in {0..7}; do hccn_tool -i $i -ip -g;done
36+
```
37+
38+
## Generate Ranktable
39+
40+
The rank table is a JSON file that specifies the mapping of Ascend NPU ranks to nodes. For more details, please refer to the [vllm-ascend examples](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/README.md). Execute the following commands for reference.
41+
42+
```shell
43+
cd vllm-ascend/examples/disaggregate_prefill_v1/
44+
bash gen_ranktable.sh --ips 192.0.0.1 \
45+
--npus-per-node 2 --network-card-name eth0 --prefill-device-cnt 1 --decode-device-cnt 1
46+
```
47+
48+
The rank table will be generated at /vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
49+
50+
|Parameter | Meaning |
51+
| --- | --- |
52+
| --ips | Each node's local IP address (prefiller nodes should be in front of decoder nodes) |
53+
| --npus-per-node | Each node's NPU clips |
54+
| --network-card-name | The physical machines' NIC |
55+
|--prefill-device-cnt | NPU clips used for prefill |
56+
|--decode-device-cnt |NPU clips used for decode |
57+
58+
## Prefiller/Decoder Deployment
59+
60+
We can run the following scripts to launch a server on the prefiller/decoder NPU, respectively.
61+
62+
:::::{tab-set}
63+
64+
::::{tab-item} Prefiller
65+
66+
```shell
67+
export ASCEND_RT_VISIBLE_DEVICES=0
68+
export HCCL_IF_IP=192.0.0.1 # node ip
69+
export GLOO_SOCKET_IFNAME="eth0" # network card name
70+
export TP_SOCKET_IFNAME="eth0"
71+
export HCCL_SOCKET_IFNAME="eth0"
72+
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH="/path/to/your/generated/ranktable.json"
73+
export OMP_PROC_BIND=false
74+
export OMP_NUM_THREADS=10
75+
export VLLM_ASCEND_LLMDD_RPC_PORT=5959
76+
77+
vllm serve /model/Qwen2.5-VL-7B-Instruct \
78+
--host 0.0.0.0 \
79+
--port 13700 \
80+
--tensor-parallel-size 1 \
81+
--no-enable-prefix-caching \
82+
--seed 1024 \
83+
--served-model-name qwen25vl \
84+
--max-model-len 40000 \
85+
--max-num-batched-tokens 40000 \
86+
--trust-remote-code \
87+
--gpu-memory-utilization 0.9 \
88+
--kv-transfer-config \
89+
'{"kv_connector": "LLMDataDistCMgrConnector",
90+
"kv_buffer_device": "npu",
91+
"kv_role": "kv_producer",
92+
"kv_parallel_size": 1,
93+
"kv_port": "20001",
94+
"engine_id": "0",
95+
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
96+
}'
97+
```
98+
99+
::::
100+
101+
::::{tab-item} Decoder
102+
103+
```shell
104+
export ASCEND_RT_VISIBLE_DEVICES=1
105+
export HCCL_IF_IP=192.0.0.1 # node ip
106+
export GLOO_SOCKET_IFNAME="eth0" # network card name
107+
export TP_SOCKET_IFNAME="eth0"
108+
export HCCL_SOCKET_IFNAME="eth0"
109+
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH="/path/to/your/generated/ranktable.json"
110+
export OMP_PROC_BIND=false
111+
export OMP_NUM_THREADS=10
112+
export VLLM_ASCEND_LLMDD_RPC_PORT=5979
113+
114+
vllm serve /model/Qwen2.5-VL-7B-Instruct \
115+
--host 0.0.0.0 \
116+
--port 13701 \
117+
--no-enable-prefix-caching \
118+
--tensor-parallel-size 1 \
119+
--seed 1024 \
120+
--served-model-name qwen25vl \
121+
--max-model-len 40000 \
122+
--max-num-batched-tokens 40000 \
123+
--trust-remote-code \
124+
--gpu-memory-utilization 0.9 \
125+
--kv-transfer-config \
126+
'{"kv_connector": "LLMDataDistCMgrConnector",
127+
"kv_buffer_device": "npu",
128+
"kv_role": "kv_consumer",
129+
"kv_parallel_size": 1,
130+
"kv_port": "20001",
131+
"engine_id": "0",
132+
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
133+
}'
134+
```
135+
136+
::::
137+
138+
:::::
139+
140+
## Example Proxy for Deployment
141+
142+
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
143+
144+
```shell
145+
python load_balance_proxy_server_example.py \
146+
--host 192.0.0.1 \
147+
--port 8080 \
148+
--prefiller-hosts 192.0.0.1 \
149+
--prefiller-port 13700 \
150+
--decoder-hosts 192.0.0.1 \
151+
--decoder-ports 13701
152+
```
153+
154+
## Verification
155+
156+
Check service health using the proxy server endpoint.
157+
158+
```shell
159+
curl http://192.0.0.1:8080/v1/chat/completions \
160+
-H "Content-Type: application/json" \
161+
-d '{
162+
"model": "qwen25vl",
163+
"messages": [
164+
{"role": "system", "content": "You are a helpful assistant."},
165+
{"role": "user", "content": [
166+
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
167+
{"type": "text", "text": "What is the text in the illustrate?"}
168+
]}
169+
],
170+
"max_tokens": 100,
171+
"temperature": 0
172+
}'
173+
```

0 commit comments

Comments
 (0)