Skip to content

Commit da5f2cc

Browse files
authored
[Doc] Update FAQ (#3792)
Many FAQ content is out of date, this PR refresh it. - vLLM version: v0.11.0rc3 - vLLM main: vllm-project/vllm@c9461e0 Signed-off-by: wangxiyuan <[email protected]>
1 parent 00aa0bf commit da5f2cc

File tree

1 file changed

+27
-29
lines changed

1 file changed

+27
-29
lines changed

docs/source/faqs.md

Lines changed: 27 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
## Version Specific FAQs
44

55
- [[v0.9.1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/2643)
6-
- [[v0.11.0rc1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/3222)
6+
- [[v0.11.0rc0] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/3222)
77

88
## General FAQs
99

@@ -27,12 +27,14 @@ From a technical view, vllm-ascend support would be possible if the torch-npu is
2727

2828
You can get our containers at `Quay.io`, e.g., [<u>vllm-ascend</u>](https://quay.io/repository/ascend/vllm-ascend?tab=tags) and [<u>cann</u>](https://quay.io/repository/ascend/cann?tab=tags).
2929

30-
If you are in China, you can use `daocloud` to accelerate your downloading:
30+
If you are in China, you can use `daocloud` or some other mirror sites to accelerate your downloading:
3131

3232
```bash
3333
# Replace with tag you want to pull
34-
TAG=v0.7.3rc2
34+
TAG=v0.9.1
3535
docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:$TAG
36+
# or
37+
docker pull quay.nju.edu.cn/ascend/vllm-ascend:$TAG
3638
```
3739

3840
#### Load Docker Images for offline environment
@@ -96,30 +98,22 @@ import vllm
9698

9799
If all above steps are not working, feel free to submit a GitHub issue.
98100

99-
### 7. How does vllm-ascend perform?
101+
### 7. How vllm-ascend work with vLLM?
102+
vllm-ascend is a hardware plugin for vLLM. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.9.1, you should use vllm-ascend 0.9.1 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
100103

101-
Currently, only some models are improved. Such as `Qwen2.5 VL`, `Qwen3`, `Deepseek V3`. Others are not good enough. From 0.9.0rc2, Qwen and Deepseek works with graph mode to play a good performance. What's more, you can install `mindie-turbo` with `vllm-ascend v0.7.3` to speed up the inference as well.
104+
### 8. Does vllm-ascend support Prefill Disaggregation feature?
102105

103-
### 8. How vllm-ascend work with vllm?
104-
vllm-ascend is a plugin for vllm. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.7.3, you should use vllm-ascend 0.7.3 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
106+
Yes, vllm-ascend supports Prefill Disaggregation feature with LLMdatadist, Mooncake backend. Take [official tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node_pd_disaggregation_llmdatadist.html) for example.
105107

106-
### 9. Does vllm-ascend support Prefill Disaggregation feature?
108+
### 9. Does vllm-ascend support quantization method?
107109

108-
Currently, only 1P1D is supported on V0 Engine. For V1 Engine or NPND support, We will make it stable and supported by vllm-ascend in the future.
110+
Currently, w8a8, w4a8 and w4a4 quantization methods are already supported by vllm-ascend.
109111

110-
### 10. Does vllm-ascend support quantization method?
112+
### 10. How to run w8a8 DeepSeek model?
111113

112-
Currently, w8a8 quantization is already supported by vllm-ascend originally on v0.8.4rc2 or higher, If you're using vllm 0.7.3 version, w8a8 quantization is supporeted with the integration of vllm-ascend and mindie-turbo, please use `pip install vllm-ascend[mindie-turbo]`.
114+
Please following the [inferencing tutorail](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_quantization.html) and replace model to DeepSeek.
113115

114-
### 11. How to run w8a8 DeepSeek model?
115-
116-
Please following the [inferencing tutorail](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html) and replace model to DeepSeek.
117-
118-
### 12. There is no output in log when loading models using vllm-ascend, How to solve it?
119-
120-
If you're using vllm 0.7.3 version, this is a known progress bar display issue in VLLM, which has been resolved in [this PR](https://github.com/vllm-project/vllm/pull/12428), please cherry-pick it locally by yourself. Otherwise, please fill up an issue.
121-
122-
### 13. How vllm-ascend is tested
116+
### 11. How vllm-ascend is tested
123117

124118
vllm-ascend is tested by functional test, performance test and accuracy test.
125119

@@ -129,21 +123,25 @@ vllm-ascend is tested by functional test, performance test and accuracy test.
129123

130124
- **Accuracy test**: we're working on adding accuracy test to CI as well.
131125

126+
- **Nightly test**: we'll run full test every night to make sure the code is working.
127+
132128
Finnall, for each release, we'll publish the performance test and accuracy test report in the future.
133129

134-
### 14. How to fix the error "InvalidVersion" when using vllm-ascend?
130+
### 12. How to fix the error "InvalidVersion" when using vllm-ascend?
135131
It's usually because you have installed an dev/editable version of vLLM package. In this case, we provide the env variable `VLLM_VERSION` to let users specify the version of vLLM package to use. Please set the env variable `VLLM_VERSION` to the version of vLLM package you have installed. The format of `VLLM_VERSION` should be `X.Y.Z`.
136132

137-
### 15. How to handle Out Of Memory?
133+
### 13. How to handle Out Of Memory?
138134
OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM's OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#out-of-memory).
139135

140136
In scenarios where NPUs have limited HBM (High Bandwidth Memory) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:
141137

138+
- **Limit --max-model-len**: It can save the HBM usage for kv cache initialization step.
139+
142140
- **Adjust `--gpu-memory-utilization`**: If unspecified, will use the default value of `0.9`. You can decrease this param to reserve more memory to reduce fragmentation risks. See more note in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/serving/engine_args.html#vllm.engine.arg_utils-_engine_args_parser-cacheconfig).
143141

144142
- **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime, see more note in: [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).
145143

146-
### 16. Failed to enable NPU graph mode when running DeepSeek?
144+
### 14. Failed to enable NPU graph mode when running DeepSeek?
147145
You may encounter the following error if running DeepSeek with NPU graph mode enabled. The allowed number of queries per kv when enabling both MLA and Graph mode only support {32, 64, 128}, **Thus this is not supported for DeepSeek-V2-Lite**, as it only has 16 attention heads. The NPU graph mode support on DeepSeek-V2-Lite will be done in the future.
148146

149147
And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, num_heads / num_kv_heads in {32, 64, 128}.
@@ -153,10 +151,10 @@ And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tenso
153151
[rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218]
154152
```
155153

156-
### 17. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend?
154+
### 15. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend?
157155
You may encounter the problem of C compilation failure when reinstalling vllm-ascend from source using pip. If the installation fails, it is recommended to use `python setup.py install` to install, or use `python setup.py clean` to clear the cache.
158156

159-
### 18. How to generate determinitic results when using vllm-ascend?
157+
### 16. How to generate determinitic results when using vllm-ascend?
160158
There are several factors that affect output certainty:
161159

162160
1. Sampler Method: using **Greedy sample** by setting `temperature=0` in `SamplingParams`, e.g.:
@@ -193,11 +191,11 @@ export ATB_MATMUL_SHUFFLE_K_ENABLE=0
193191
export ATB_LLM_LCOC_ENABLE=0
194192
```
195193

196-
### 19. How to fix the error "ImportError: Please install vllm[audio] for audio support" for Qwen2.5-Omni model?
194+
### 17. How to fix the error "ImportError: Please install vllm[audio] for audio support" for Qwen2.5-Omni model?
197195
The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met `pip install qwen-omni-utils`,
198196
this package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensuring audio processing functionality works correctly.
199197

200-
### 20. How to troubleshoot and resolve size capture failures resulting from stream resource exhaustion, and what are the underlying causes?
198+
### 18. How to troubleshoot and resolve size capture failures resulting from stream resource exhaustion, and what are the underlying causes?
201199

202200
```
203201
error example in detail:
@@ -212,5 +210,5 @@ Recommended mitigation strategies:
212210
Root cause analysis:
213211
The current stream requirement calculation for size captures only accounts for measurable factors including: data parallel size, tensor parallel size, expert parallel configuration, piece graph count, multistream overlap shared expert settings, and HCCL communication mode (AIV/AICPU). However, numerous unquantifiable elements - such as operator characteristics and specific hardware features - consume additional streams outside of this calculation framework, resulting in stream resource exhaustion during size capture operations.
214212

215-
### 21. Installing vllm-ascend will overwrite the existing torch-npu package?
216-
Installing vllm-ascend will overwrite the existing torch-npu package. If you need to install a specific version of torch-npu, you can manually install the specified version of torch-npu after installing vllm-ascend.
213+
### 19. How to install custom version of torch_npu?
214+
torch-npu will be overried when installing vllm-ascend. If you need to install a specific version of torch-npu, you can manually install the specified version of torch-npu after vllm-ascend is installed.

0 commit comments

Comments
 (0)