Skip to content

Commit 3e5ae49

Browse files
[MM][Doc] Update online serving tutorials for Qwen2-Audio (#3606)
### What this PR does / why we need it? Update online serving tutorials for `Qwen2-Audio`. Part of #3508. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: shen-shanshan <[email protected]>
1 parent d8ca7fe commit 3e5ae49

File tree

5 files changed

+245
-125
lines changed

5 files changed

+245
-125
lines changed

docs/source/tutorials/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44
:caption: Deployment
55
:maxdepth: 1
66
single_npu
7-
single_npu_multimodal
8-
single_npu_audio
7+
single_npu_qwen2.5_vl
8+
single_npu_qwen2_audio
99
single_npu_qwen3_embedding
1010
single_npu_qwen3_quantization
1111
multi_npu_qwen3_next

docs/source/tutorials/single_npu_audio.md

Lines changed: 0 additions & 123 deletions
This file was deleted.
File renamed without changes.
Lines changed: 220 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
# Single NPU (Qwen2-Audio 7B)
2+
3+
## Run vllm-ascend on Single NPU
4+
5+
### Offline Inference on Single NPU
6+
7+
Run docker container:
8+
9+
```{code-block} bash
10+
:substitutions:
11+
# Update the vllm-ascend image
12+
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
13+
docker run --rm \
14+
--name vllm-ascend \
15+
--shm-size=1g \
16+
--device /dev/davinci0 \
17+
--device /dev/davinci_manager \
18+
--device /dev/devmm_svm \
19+
--device /dev/hisi_hdc \
20+
-v /usr/local/dcmi:/usr/local/dcmi \
21+
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
22+
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
23+
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
24+
-v /etc/ascend_install.info:/etc/ascend_install.info \
25+
-v /root/.cache:/root/.cache \
26+
-p 8000:8000 \
27+
-it $IMAGE bash
28+
```
29+
30+
Setup environment variables:
31+
32+
```bash
33+
# Load model from ModelScope to speed up download
34+
export VLLM_USE_MODELSCOPE=True
35+
36+
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
37+
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
38+
```
39+
40+
:::{note}
41+
`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
42+
:::
43+
44+
Install packages required for audio processing:
45+
46+
```bash
47+
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
48+
pip install librosa soundfile
49+
```
50+
51+
Run the following script to execute offline inference on a single NPU:
52+
53+
```python
54+
from vllm import LLM, SamplingParams
55+
from vllm.assets.audio import AudioAsset
56+
from vllm.utils import FlexibleArgumentParser
57+
58+
# If network issues prevent AudioAsset from fetching remote audio files, retry or check your network.
59+
audio_assets = [AudioAsset("mary_had_lamb"), AudioAsset("winning_call")]
60+
question_per_audio_count = {
61+
1: "What is recited in the audio?",
62+
2: "What sport and what nursery rhyme are referenced?"
63+
}
64+
65+
66+
def prepare_inputs(audio_count: int):
67+
audio_in_prompt = "".join([
68+
f"Audio {idx+1}: <|audio_bos|><|AUDIO|><|audio_eos|>\n"
69+
for idx in range(audio_count)
70+
])
71+
question = question_per_audio_count[audio_count]
72+
prompt = ("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
73+
"<|im_start|>user\n"
74+
f"{audio_in_prompt}{question}<|im_end|>\n"
75+
"<|im_start|>assistant\n")
76+
77+
mm_data = {
78+
"audio":
79+
[asset.audio_and_sample_rate for asset in audio_assets[:audio_count]]
80+
}
81+
82+
# Merge text prompt and audio data into inputs
83+
inputs = {"prompt": prompt, "multi_modal_data": mm_data}
84+
return inputs
85+
86+
87+
def main(audio_count: int):
88+
# NOTE: The default `max_num_seqs` and `max_model_len` may result in OOM on
89+
# lower-end GPUs.
90+
# Unless specified, these settings have been tested to work on a single L4.
91+
# `limit_mm_per_prompt`: the max num items for each modality per prompt.
92+
llm = LLM(model="Qwen/Qwen2-Audio-7B-Instruct",
93+
max_model_len=4096,
94+
max_num_seqs=5,
95+
limit_mm_per_prompt={"audio": audio_count})
96+
97+
inputs = prepare_inputs(audio_count)
98+
99+
sampling_params = SamplingParams(temperature=0.2,
100+
max_tokens=64,
101+
stop_token_ids=None)
102+
103+
outputs = llm.generate(inputs, sampling_params=sampling_params)
104+
105+
for o in outputs:
106+
generated_text = o.outputs[0].text
107+
print(generated_text)
108+
109+
110+
if __name__ == "__main__":
111+
audio_count = 2
112+
main(audio_count)
113+
```
114+
115+
If you run this script successfully, you can see the info shown below:
116+
117+
```bash
118+
The sport referenced is baseball, and the nursery rhyme is 'Mary Had a Little Lamb'.
119+
```
120+
121+
### Online Serving on Single NPU
122+
123+
Currently, the `chat_template` for `Qwen2-Audio` has some issues which caused audio placeholder failed to be inserted, find more details [<u>here</u>](https://github.com/vllm-project/vllm/issues/19977).
124+
125+
Nevertheless, we could use a custom template for online serving, which is shown below:
126+
127+
```jinja
128+
{% set audio_count = namespace(value=0) %}
129+
{% for message in messages %}
130+
{% if loop.first and message['role'] != 'system' %}
131+
<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n
132+
{% endif %}
133+
<|im_start|>{{ message['role'] }}\n
134+
{% if message['content'] is string %}
135+
{{ message['content'] }}<|im_end|>\n
136+
{% else %}
137+
{% for content in message['content'] %}
138+
{% if 'audio' in content or 'audio_url' in content or message['type'] == 'audio' or content['type'] == 'audio' %}
139+
{% set audio_count.value = audio_count.value + 1 %}
140+
Audio {{ audio_count.value }}: <|audio_bos|><|AUDIO|><|audio_eos|>\n
141+
{% elif 'text' in content %}
142+
{{ content['text'] }}
143+
{% endif %}
144+
{% endfor %}
145+
<|im_end|>\n
146+
{% endif %}
147+
{% endfor %}
148+
{% if add_generation_prompt %}
149+
<|im_start|>assistant\n
150+
{% endif %}
151+
```
152+
153+
:::{note}
154+
You can find this template at `vllm-ascend/examples/chat_templates/template_qwen2_audio.jinja`.
155+
:::
156+
157+
Run docker container to start the vLLM server on a single NPU:
158+
159+
```{code-block} bash
160+
# Update the vllm-ascend image
161+
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
162+
docker run --rm \
163+
--name vllm-ascend \
164+
--shm-size=1g \
165+
--device /dev/davinci0 \
166+
--device /dev/davinci_manager \
167+
--device /dev/devmm_svm \
168+
--device /dev/hisi_hdc \
169+
-v /usr/local/dcmi:/usr/local/dcmi \
170+
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
171+
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
172+
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
173+
-v /etc/ascend_install.info:/etc/ascend_install.info \
174+
-v /root/.cache:/root/.cache \
175+
-p 8000:8000 \
176+
-e VLLM_USE_MODELSCOPE=True \
177+
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
178+
-it $IMAGE \
179+
vllm serve Qwen/Qwen2-Audio-7B-Instruct \
180+
--max_model_len 16384 \
181+
--max-num-batched-tokens 16384 \
182+
--limit-mm-per-prompt '{"audio":2}' \
183+
--chat-template /path/to/your/vllm-ascend/examples/chat_templates/template_qwen2_audio.jinja
184+
```
185+
186+
:::{note}
187+
Replace `/path/to/your/vllm-ascend` with your own path.
188+
:::
189+
190+
If your service start successfully, you can see the info shown below:
191+
192+
```bash
193+
INFO: Started server process [2736]
194+
INFO: Waiting for application startup.
195+
INFO: Application startup complete.
196+
```
197+
198+
Once your server is started, you can query the model with input prompts:
199+
200+
```bash
201+
curl -X POST http://localhost:8000/v1/chat/completions \
202+
-H "Content-Type: application/json" \
203+
-d '{
204+
"model": "/root/.cache/modelscope/models/Qwen/Qwen2-Audio-7B-Instruct",
205+
"messages": [
206+
{"role": "system", "content": "You are a helpful assistant."},
207+
{"role": "user", "content": [
208+
{"type": "audio_url", "audio_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/winning_call.ogg"}},
209+
{"type": "text", "text": "What is in this audio? How does it sound?"}
210+
]}
211+
],
212+
"max_tokens": 100
213+
}'
214+
```
215+
216+
If you query the server successfully, you can see the info shown below (client):
217+
218+
```bash
219+
{"id":"chatcmpl-31f5f698f6734a4297f6492a830edb3f","object":"chat.completion","created":1761097383,"model":"/root/.cache/modelscope/models/Qwen/Qwen2-Audio-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The audio contains a background of a crowd cheering, a ball bouncing, and an object being hit. A man speaks in English saying 'and the o one pitch on the way to edgar martinez swung on and lined out.' The speech has a happy mood.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":689,"total_tokens":743,"completion_tokens":54,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
220+
```
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
{% set audio_count = namespace(value=0) %}
2+
{% for message in messages %}
3+
{% if loop.first and message['role'] != 'system' %}
4+
<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n
5+
{% endif %}
6+
<|im_start|>{{ message['role'] }}\n
7+
{% if message['content'] is string %}
8+
{{ message['content'] }}<|im_end|>\n
9+
{% else %}
10+
{% for content in message['content'] %}
11+
{% if 'audio' in content or 'audio_url' in content or message['type'] == 'audio' or content['type'] == 'audio' %}
12+
{% set audio_count.value = audio_count.value + 1 %}
13+
Audio {{ audio_count.value }}: <|audio_bos|><|AUDIO|><|audio_eos|>\n
14+
{% elif 'text' in content %}
15+
{{ content['text'] }}
16+
{% endif %}
17+
{% endfor %}
18+
<|im_end|>\n
19+
{% endif %}
20+
{% endfor %}
21+
{% if add_generation_prompt %}
22+
<|im_start|>assistant\n
23+
{% endif %}

0 commit comments

Comments
 (0)