You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -27,12 +27,14 @@ From a technical view, vllm-ascend support would be possible if the torch-npu is
27
27
28
28
You can get our containers at `Quay.io`, e.g., [<u>vllm-ascend</u>](https://quay.io/repository/ascend/vllm-ascend?tab=tags) and [<u>cann</u>](https://quay.io/repository/ascend/cann?tab=tags).
29
29
30
-
If you are in China, you can use `daocloud` to accelerate your downloading:
30
+
If you are in China, you can use `daocloud`or some other mirror sites to accelerate your downloading:
If all above steps are not working, feel free to submit a GitHub issue.
98
100
99
-
### 7. How does vllm-ascend perform?
101
+
### 7. How vllm-ascend work with vLLM?
102
+
vllm-ascend is a hardware plugin for vLLM. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.9.1, you should use vllm-ascend 0.9.1 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
100
103
101
-
Currently, only some models are improved. Such as `Qwen2.5 VL`, `Qwen3`, `Deepseek V3`. Others are not good enough. From 0.9.0rc2, Qwen and Deepseek works with graph mode to play a good performance. What's more, you can install `mindie-turbo` with `vllm-ascend v0.7.3` to speed up the inference as well.
104
+
### 8. Does vllm-ascend support Prefill Disaggregation feature?
102
105
103
-
### 8. How vllm-ascend work with vllm?
104
-
vllm-ascend is a plugin for vllm. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.7.3, you should use vllm-ascend 0.7.3 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
106
+
Yes, vllm-ascend supports Prefill Disaggregation feature with LLMdatadist, Mooncake backend. Take [official tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node_pd_disaggregation_llmdatadist.html) for example.
105
107
106
-
### 9. Does vllm-ascend support Prefill Disaggregation feature?
108
+
### 9. Does vllm-ascend support quantization method?
107
109
108
-
Currently, only 1P1D is supported on V0 Engine. For V1 Engine or NPND support, We will make it stable and supported by vllm-ascend in the future.
110
+
Currently, w8a8, w4a8 and w4a4 quantization methods are already supported by vllm-ascend.
109
111
110
-
### 10. Does vllm-ascend support quantization method?
112
+
### 10. How to run w8a8 DeepSeek model?
111
113
112
-
Currently, w8a8 quantization is already supported by vllm-ascend originally on v0.8.4rc2 or higher, If you're using vllm 0.7.3 version, w8a8 quantization is supporeted with the integration of vllm-ascend and mindie-turbo, please use `pip install vllm-ascend[mindie-turbo]`.
114
+
Please following the [inferencing tutorail](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_quantization.html)and replace model to DeepSeek.
113
115
114
-
### 11. How to run w8a8 DeepSeek model?
115
-
116
-
Please following the [inferencing tutorail](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html) and replace model to DeepSeek.
117
-
118
-
### 12. There is no output in log when loading models using vllm-ascend, How to solve it?
119
-
120
-
If you're using vllm 0.7.3 version, this is a known progress bar display issue in VLLM, which has been resolved in [this PR](https://github.com/vllm-project/vllm/pull/12428), please cherry-pick it locally by yourself. Otherwise, please fill up an issue.
121
-
122
-
### 13. How vllm-ascend is tested
116
+
### 11. How vllm-ascend is tested
123
117
124
118
vllm-ascend is tested by functional test, performance test and accuracy test.
125
119
@@ -129,21 +123,25 @@ vllm-ascend is tested by functional test, performance test and accuracy test.
129
123
130
124
-**Accuracy test**: we're working on adding accuracy test to CI as well.
131
125
126
+
-**Nightly test**: we'll run full test every night to make sure the code is working.
127
+
132
128
Finnall, for each release, we'll publish the performance test and accuracy test report in the future.
133
129
134
-
### 14. How to fix the error "InvalidVersion" when using vllm-ascend?
130
+
### 12. How to fix the error "InvalidVersion" when using vllm-ascend?
135
131
It's usually because you have installed an dev/editable version of vLLM package. In this case, we provide the env variable `VLLM_VERSION` to let users specify the version of vLLM package to use. Please set the env variable `VLLM_VERSION` to the version of vLLM package you have installed. The format of `VLLM_VERSION` should be `X.Y.Z`.
136
132
137
-
### 15. How to handle Out Of Memory?
133
+
### 13. How to handle Out Of Memory?
138
134
OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM's OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#out-of-memory).
139
135
140
136
In scenarios where NPUs have limited HBM (High Bandwidth Memory) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:
141
137
138
+
-**Limit --max-model-len**: It can save the HBM usage for kv cache initialization step.
139
+
142
140
-**Adjust `--gpu-memory-utilization`**: If unspecified, will use the default value of `0.9`. You can decrease this param to reserve more memory to reduce fragmentation risks. See more note in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/serving/engine_args.html#vllm.engine.arg_utils-_engine_args_parser-cacheconfig).
143
141
144
142
-**Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime, see more note in: [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).
145
143
146
-
### 16. Failed to enable NPU graph mode when running DeepSeek?
144
+
### 14. Failed to enable NPU graph mode when running DeepSeek?
147
145
You may encounter the following error if running DeepSeek with NPU graph mode enabled. The allowed number of queries per kv when enabling both MLA and Graph mode only support {32, 64, 128}, **Thus this is not supported for DeepSeek-V2-Lite**, as it only has 16 attention heads. The NPU graph mode support on DeepSeek-V2-Lite will be done in the future.
148
146
149
147
And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, num_heads / num_kv_heads in {32, 64, 128}.
@@ -153,10 +151,10 @@ And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tenso
153
151
[rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218]
154
152
```
155
153
156
-
### 17. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend?
154
+
### 15. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend?
157
155
You may encounter the problem of C compilation failure when reinstalling vllm-ascend from source using pip. If the installation fails, it is recommended to use `python setup.py install` to install, or use `python setup.py clean` to clear the cache.
158
156
159
-
### 18. How to generate determinitic results when using vllm-ascend?
157
+
### 16. How to generate determinitic results when using vllm-ascend?
160
158
There are several factors that affect output certainty:
161
159
162
160
1. Sampler Method: using **Greedy sample** by setting `temperature=0` in `SamplingParams`, e.g.:
### 19. How to fix the error "ImportError: Please install vllm[audio] for audio support" for Qwen2.5-Omni model?
194
+
### 17. How to fix the error "ImportError: Please install vllm[audio] for audio support" for Qwen2.5-Omni model?
197
195
The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met `pip install qwen-omni-utils`,
198
196
this package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensuring audio processing functionality works correctly.
199
197
200
-
### 20. How to troubleshoot and resolve size capture failures resulting from stream resource exhaustion, and what are the underlying causes?
198
+
### 18. How to troubleshoot and resolve size capture failures resulting from stream resource exhaustion, and what are the underlying causes?
The current stream requirement calculation for size captures only accounts for measurable factors including: data parallel size, tensor parallel size, expert parallel configuration, piece graph count, multistream overlap shared expert settings, and HCCL communication mode (AIV/AICPU). However, numerous unquantifiable elements - such as operator characteristics and specific hardware features - consume additional streams outside of this calculation framework, resulting in stream resource exhaustion during size capture operations.
214
212
215
-
### 21. Installing vllm-ascend will overwrite the existing torch-npu package?
216
-
Installing vllm-ascend will overwrite the existing torch-npu package. If you need to install a specific version of torch-npu, you can manually install the specified version of torch-npu after installing vllm-ascend.
213
+
### 19. How to install custom version of torch_npu?
214
+
torch-npu will be overried when installing vllm-ascend. If you need to install a specific version of torch-npu, you can manually install the specified version of torch-npu after vllm-ascend is installed.
0 commit comments