Skip to content

Commit e1bb6f4

Browse files
[doc] Add Qwen2.5 tutorials (#4636)
### What this PR does / why we need it? Add qwen2.5 turorial - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: yangshihao6 <[email protected]> Co-authored-by: wangxiyuan <[email protected]>
1 parent 332b547 commit e1bb6f4

File tree

2 files changed

+178
-0
lines changed

2 files changed

+178
-0
lines changed

docs/source/tutorials/Qwen2.5.md

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
# Qwen2.5-7B-Instruct
2+
3+
## Introduction
4+
5+
Qwen2.5-7B-Instruct is the flagship instruction-tuned variant of Alibaba Cloud’s Qwen 2.5 LLM series. It supports a maximum context window of 128K, enables generation of up to 8K tokens, and delivers enhanced capabilities in multilingual processing, instruction following, programming, mathematical computation, and structured data handling.
6+
7+
This document details the complete deployment and verification workflow for the model, including supported features, environment preparation, single-node deployment, functional verification, accuracy and performance evaluation, and troubleshooting of common issues. It is designed to help users quickly complete model deployment and validation.
8+
9+
The `Qwen2.5-7B-Instruct` model was supported since `vllm-ascend:v0.9.0`.
10+
11+
## Supported Features
12+
13+
Refer to [supported features](../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
14+
15+
Refer to [feature guide](../user_guide/feature_guide/index.md) to get the feature's configuration.
16+
17+
## Environment Preparation
18+
19+
### Model Weight
20+
21+
- `Qwen2.5-7B-Instruct`(BF16 version): require 1 910B4 cards(32G × 1). [Qwen2.5-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct)
22+
23+
It is recommended to download the model weights to a local directory (e.g., `./Qwen2.5-7B-Instruct/`) for quick access during deployment.
24+
25+
### Installation
26+
27+
You can using our official docker image and install extra operator for supporting `Qwen2.5-7B-Instruct`.
28+
29+
:::::{tab-set}
30+
:sync-group: install
31+
32+
::::{tab-item} A3 series
33+
:sync: A3
34+
35+
1. Start the docker image on your each node.
36+
37+
```{code-block} bash
38+
:substitutions:
39+
40+
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
41+
docker run --rm \
42+
--name vllm-ascend \
43+
--shm-size=1g \
44+
--net=host \
45+
--device /dev/davinci0 \
46+
--device /dev/davinci_manager \
47+
--device /dev/devmm_svm \
48+
--device /dev/hisi_hdc \
49+
-v /usr/local/dcmi:/usr/local/dcmi \
50+
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
51+
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
52+
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
53+
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
54+
-v /etc/ascend_install.info:/etc/ascend_install.info \
55+
-v /root/.cache:/root/.cache \
56+
-it $IMAGE bash
57+
```
58+
59+
::::
60+
::::{tab-item} A2 series
61+
:sync: A2
62+
63+
Start the docker image on your each node.
64+
65+
```{code-block} bash
66+
:substitutions:
67+
68+
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
69+
docker run --rm \
70+
--name vllm-ascend \
71+
--shm-size=1g \
72+
--net=host \
73+
--device /dev/davinci0 \
74+
--device /dev/davinci_manager \
75+
--device /dev/devmm_svm \
76+
--device /dev/hisi_hdc \
77+
-v /usr/local/dcmi:/usr/local/dcmi \
78+
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
79+
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
80+
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
81+
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
82+
-v /etc/ascend_install.info:/etc/ascend_install.info \
83+
-v /root/.cache:/root/.cache \
84+
-it $IMAGE bash
85+
```
86+
87+
::::
88+
:::::
89+
90+
## Deployment
91+
92+
### Single-node Deployment
93+
94+
Qwen2.5-7B-Instruct supports single-node single-card deployment on the 910B4 platform. Follow these steps to start the inference service:
95+
96+
1. Prepare model weights: Ensure the downloaded model weights are stored in the `./Qwen2.5-7B-Instruct/` directory.
97+
2. Create and execute the deployment script (save as `deploy.sh`):
98+
99+
```shell
100+
#!/bin/sh
101+
export ASCEBD_RT_VISIBLE_DEVICES=0
102+
103+
vllm serve ${MODEL_PATH} \
104+
--host 0.0.0.0 \
105+
--port 8000 \
106+
--served-model-name qwen-2.5-7b-instruct \
107+
--trust-remote-code \
108+
--max-model-len 32768
109+
```
110+
111+
### Multi-node Deployment
112+
113+
Single-node deployment is recommended.
114+
115+
### Prefill-Decode Disaggregation
116+
117+
Not supported yet.
118+
119+
## Functional Verification
120+
121+
After starting the service, verify functionality using a `curl` request:
122+
123+
```shell
124+
curl http://<IP>:<Port>/v1/completions \
125+
-H "Content-Type: application/json" \
126+
-d '{
127+
"model": "qwen-2.5-7b-instruct",
128+
"prompt": "Beijing is a",
129+
"max_tokens": 5,
130+
"temperature": 0
131+
}'
132+
```
133+
134+
A valid response (e.g., `"Beijing is a vibrant and historic capital city"`) indicates successful deployment.
135+
136+
## Accuracy Evaluation
137+
138+
### Using AISBench
139+
140+
Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details.
141+
142+
Results and logs are saved to `benchmark/outputs/default/`. A sample accuracy report is shown below:
143+
144+
| dataset | version | metric | mode | vllm-api-general-chat |
145+
|----- | ----- | ----- | ----- |--------------|
146+
| gsm8k | - | accuracy | gen | 75.00 |
147+
148+
## Performance
149+
150+
### Using AISBench
151+
152+
Refer to [Using AISBench for performance evaluation](../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
153+
154+
### Using vLLM Benchmark
155+
Run performance evaluation of `Qwen2.5-7B-Instruct` as an example.
156+
157+
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
158+
159+
There are three `vllm bench` subcommand:
160+
- `latency`: Benchmark the latency of a single batch of requests.
161+
- `serve`: Benchmark the online serving throughput.
162+
- `throughput`: Benchmark offline inference throughput.
163+
164+
Take the `serve` as an example. Run the code as follows.
165+
166+
```shell
167+
vllm bench serve \
168+
--model ./Qwen2.5-7B-Instruct/ \
169+
--dataset-name random \
170+
--random-input 200 \
171+
--num-prompt 200 \
172+
--request-rate 1 \
173+
--save-result \
174+
--result-dir ./perf_results/
175+
```
176+
177+
After about several minutes, you can get the performance evaluation result.

docs/source/tutorials/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ single_npu_qwen3_embedding
1010
single_npu_qwen3_quantization
1111
single_npu_qwen3_w4a4
1212
single_node_pd_disaggregation_mooncake
13+
Qwen2.5
1314
multi_npu_qwen3_next
1415
multi_npu
1516
multi_npu_kimi-k2-thinking

0 commit comments

Comments
 (0)