Skip to content

Commit 3362be7

Browse files
authored
Update patch doc (#4869)
Update patch doc. After this PR is merged, all the new patch PR should update this doc as well. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: wangxiyuan <[email protected]>
1 parent 0fb1dc4 commit 3362be7

File tree

3 files changed

+144
-45
lines changed

3 files changed

+144
-45
lines changed

vllm_ascend/patch/__init__.py

Lines changed: 142 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -30,18 +30,9 @@
3030
# --------------------------------
3131
# * Platform Patch:
3232
# =================
33-
# ** File: platform/patch_distributed.py**
33+
# ** 1. File: platform/patch_distributed.py**
3434
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
35-
# 1. `vllm.config.ParallelConfig.get_next_dp_init_port`
36-
# Why:
37-
# vllm doesn't support get port from environment.
38-
# How:
39-
# Add the logic to get port from environment.
40-
# Related PR (if no, explain why):
41-
# Need a PR to vllm to support get port from environment.
42-
# Future Plan:
43-
# Remove those patch when vllm merged them
44-
# 2. `torch.distributed.all_reduce`, `torch.distributed.broadcast`
35+
# 1. `torch.distributed.all_reduce`, `torch.distributed.broadcast`
4536
# Why:
4637
# tensor alignment for 310p
4738
# How:
@@ -51,21 +42,85 @@
5142
# Future Plan:
5243
# Find a better way to support tensor alignment for 310p without this patch.
5344
#
54-
# ** File: worker/patch_multimodal_merge.py**
45+
# ** 2. File: platform/patch_ec_connector.py**
5546
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
56-
# 1. `vllm.model_executor.models.utils._merge_multimodal_embeddings`
47+
# 1. `vllm.distributed.ec_transfer.ec_connector.shared_storage_connector.ECSharedStorageConnector.start_load_caches`
5748
# Why:
58-
# '_merge_multimodal_embeddings' func of vllm is incompatible with Ascend.
49+
# it's hard code to cuda
5950
# How:
60-
# Replace with CPU operation that can be executed asynchronously.
51+
# change the cuda to npu
6152
# Related PR (if no, explain why):
62-
# This is a bug by Ascend only. It can' be fixed in vLLM.
53+
# https://github.com/vllm-project/vllm/pull/30225
6354
# Future Plan:
64-
# Identify this pattern in torch-npu and remove this patch.
55+
# Remove this patch when vllm merges the PR.
56+
#
57+
# ** 3. File: platform/patch_mamba_config.py**
58+
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
59+
# 1. `vllm.model_executor.models.config.HybridAttentionMambaModelConfig.verify_and_update_config`
60+
# Why:
61+
# block size is set to 16 in vLLM which is not supported by Ascend.
62+
# How:
63+
# Set block size to 128 on npu.
64+
# Related PR (if no, explain why):
65+
# we'll fix this in vLLM soon.
66+
# Future Plan:
67+
# Remove this patch when vLLM merges the PR.
68+
#
69+
# ** 4. File: platform/patch_multiproc_executor.py**
70+
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
71+
# 1. `vllm.v1.executor.multiproc_executor.MultiprocExecutor`
72+
# Why:
73+
# vLLM create child process with daemon=True, which doesn't work with EPLB case, since EPLB will create
74+
# a new process which is not allowed by daemon=True.
75+
# How:
76+
# Set daemon=False in MultiprocExecutor.
77+
# Related PR (if no, explain why):
78+
# Find a way to support daemon=False in vLLM
79+
# Future Plan:
80+
# Remove this patch when vLLM fix the issue.
81+
#
82+
# ** 5. File: platform/patch_sched_yield.py**
83+
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
84+
# 1. `vllm.distributed.utils.USE_SCHED_YIELD`
85+
# Why:
86+
# os.sched_yield() doesn't work on Arm systems.
87+
# How:
88+
# avoid using os.sched_yield() on Arm systems.
89+
# Related PR (if no, explain why):
90+
# https://github.com/vllm-project/vllm/pull/30228
91+
# Future Plan:
92+
# Remove this patch when vLLM merge the PR.
93+
#
6594
#
6695
# * Worker Patch:
6796
# ===============
68-
# ** File: worker/patch_minicpm.py **
97+
#
98+
# ** 1. File: worker/patch_deepseek.py **
99+
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
100+
# 1. `DeepseekV2Model.forward`
101+
# Why:
102+
# getattr(self.config, "llama_4_scaling", None) will raise AttributeError
103+
# on npu with graph mode.
104+
# How:
105+
# catch the AttributeError and set llama_4_scaling to None.
106+
# Related PR (if no, explain why):
107+
# No, this is a bug in vLLM Ascend
108+
# Future Plan:
109+
# Find the root cause of this bug and fix it in vLLM Ascend.
110+
#
111+
# ** 2. File: worker/patch_distributed.py **
112+
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
113+
# 1. `vllm.distributed.parallel_state.GroupCoordinator`
114+
# Why:
115+
# vllm doesn't support all_to_all for GroupCoordinator.
116+
# How:
117+
# Add all_to_all implementation for GroupCoordinator.
118+
# Related PR (if no, explain why):
119+
# No, we should use vlLM all2all manager to support all_to_all for npu.
120+
# Future Plan:
121+
# Remove this patch when the refactor of all2all manager is done.
122+
#
123+
# ** 3. File: worker/patch_minicpm.py **
69124
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
70125
# 1. `vllm.model_executor.models.minicpm.MiniCPMAttention.forward`
71126
# Why:
@@ -79,32 +134,65 @@
79134
# Future Plan:
80135
# Keep this patch in vllm-ascend.
81136
#
82-
# ** File: worker/patch_distributed.py **
83-
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
84-
# 1. `vllm.distributed.parallel_state.GroupCoordinator`
85-
# (1) __init__()
137+
# ** 4. File: worker/patch_multimodal_merge.py**
138+
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
139+
# 1. `vllm.model_executor.models.utils._merge_multimodal_embeddings`
86140
# Why:
87-
# The original GroupCoordinator initialization lacks pg_options to generate new
88-
# process group with customized options.
89-
# How:
90-
# Inject HCCL options during process group initialization.
141+
# '_merge_multimodal_embeddings' func of vllm is incompatible with Ascend.
142+
# How:
143+
# Replace with CPU operation that can be executed asynchronously.
91144
# Related PR (if no, explain why):
92-
# Need a PR to vllm to support a dictionary as input while initializing distributed
93-
# environment (e.g., Dict[str, torch.distributed.ProcessGroupHCCL.Options])
94-
# https://github.com/vllm-project/vllm/pull/25417
145+
# This is a bug by Ascend only. It can' be fixed in vLLM.
95146
# Future Plan:
96-
# Remove this patch when vllm merges this PR.
97-
# (2) all_to_all()
147+
# Identify this pattern in torch-npu and remove this patch.
148+
#
149+
# ** 5. File: worker/patch_qwen2_5_omni.py**
150+
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
151+
# 1. `vllm.model_executor.models.qwen2_5_omni_thinker.Qwen2_5OmniThinkerForConditionalGeneration`
98152
# Why:
99-
# vllm doesn't support all_to_all for GroupCoordinator.
153+
# we have ascend forward context which doesn't work with upstream.
100154
# How:
101-
# Add all_to_all implementation for GroupCoordinator.
155+
# override forward_context in the model file
156+
# Related PR (if no, explain why):
157+
# This is a bug by Ascend only. we should drop set_ascend_forward_context
158+
# Future Plan:
159+
# Remove this patch once forward_context is refactor.
160+
#
161+
# ** 6. File: worker/patch_qwen2_5_vl.py**
162+
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
163+
# 1. `vllm.model_executor.models.qwen2_5_vl.Qwen2_5_VLForConditionalGeneration`
164+
# Why:
165+
# we have ascend forward context which doesn't work with upstream.
166+
# How:
167+
# override forward_context in the model file
168+
# Related PR (if no, explain why):
169+
# This is a bug by Ascend only. we should drop set_ascend_forward_context
170+
# Future Plan:
171+
# Remove this patch once forward_context is refactor.
172+
#
173+
# 2. `vllm.model_executor.models.qwen2_vl.Qwen2VisionAttention.forward`
174+
# Why:
175+
# the attention is not custom ops
176+
# How:
177+
# make it to custom ops and pluggable
178+
# Related PR (if no, explain why):
179+
# https://github.com/vllm-project/vllm/pull/30125
180+
# Future Plan:
181+
# Remove this patch one the PR is merged into vLLM.
182+
#
183+
# ** 7. File: worker/patch_qwen3_vl.py**
184+
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
185+
# 1. `vllm.model_executor.models.qwen3_vl.Qwen3_VisionTransformer.forward`
186+
# Why:
187+
# the attention is not custom ops
188+
# How:
189+
# make it to custom ops and pluggable
102190
# Related PR (if no, explain why):
103-
# Need a PR to vllm to support all_to_all for GroupCoordinator.
191+
# https://github.com/vllm-project/vllm/pull/30125
104192
# Future Plan:
105-
# Remove this patch when vllm merged them.
193+
# Remove this patch one the PR is merged into vLLM.
106194
#
107-
# ** File: worker/patch_roberta.py **
195+
# ** 8. File: worker/patch_roberta.py **
108196
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
109197
# 1. `vllm.model_executor.models.bert `
110198
# Why:
@@ -116,18 +204,29 @@
116204
# Future Plan:
117205
# Revert this when CANN support shift aclnn operation
118206
#
119-
# ** File: worker/patch_deepseek_mtp.py**
207+
# ** 9. File: worker/patch_triton.py**
208+
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
209+
# 1. `vllm.model_executor.layers.mamba.ops`, `vllm.model_executor.layers.fla.ops`
210+
# Why:
211+
# triton ops in vLLM perform not good on NPU. And there is no dispatch mechanism for triton ops.
212+
# How:
213+
# override triton ops in vLLM with ascend implementation
214+
# Related PR (if no, explain why):
215+
# Let vLLM support triton ops dispatch.
216+
# Future Plan:
217+
# Remove this patch when vLLM support the dispatch function.
218+
#
219+
# ** 10. File: worker/patch_weight_loader.py**
120220
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
121-
# 1. `vllm.model_executor.models.deepseek_mtp.DeepSeekMultiTokenPredictorLayer.__init__`
221+
# 1. `vllm.model_executor.layers.linear.UnquantizedLinearMethod`
122222
# Why:
123-
# '__init__' func of DeepSeekMultiTokenPredictorLayer didn't pass prefix to SharedHead.
223+
# vLLM Ascend doesn't work with weight loader v2
124224
# How:
125-
# Replace with a new __init__.
126-
# Use a new SharedHead which passes prefix to ParallelLMHead.
225+
# patch it to fix the bug.
127226
# Related PR (if no, explain why):
128-
# https://github.com/vllm-project/vllm/pull/25805
227+
# This is a bug by Ascend only. We should fix it soon
129228
# Future Plan:
130-
# Remove this patch when adapted vllm version contains the above PR.
229+
# Remove this patch when the bug is fixed.
131230
#
132231
# ** File: worker/patch_qwen3_next_mtp.py**
133232
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

vllm_ascend/patch/worker/patch_qwen2_5_omni.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,6 @@ def _process_video_input(
6767
return video_embeds.split(sizes.tolist())
6868

6969

70-
# NOTE: These will be removed after https://github.com/vllm-project/vllm/pull/29388 is merged.
70+
# NOTE: These will be removed after ascend_forward_context is refactored.
7171
Qwen2_5OmniThinkerForConditionalGeneration._process_image_input = AscendQwen2_5OmniThinkerForConditionalGeneration._process_image_input
7272
Qwen2_5OmniThinkerForConditionalGeneration._process_video_input = AscendQwen2_5OmniThinkerForConditionalGeneration._process_video_input

vllm_ascend/patch/worker/patch_qwen2_5_vl.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,6 @@ def _process_video_input(
170170
Qwen2VisionAttention.forward = AscendQwen2_5_VisionAttention.forward
171171
Qwen2_5_VisionAttention.forward = AscendQwen2_5_VisionAttention.forward
172172

173-
# NOTE: These will be removed after https://github.com/vllm-project/vllm/pull/29388 is merged.
173+
# NOTE: These will be removed after ascend_forward_context is refactored.
174174
Qwen2_5_VLForConditionalGeneration._process_image_input = AscendQwen2_5_VLForConditionalGeneration._process_image_input
175175
Qwen2_5_VLForConditionalGeneration._process_video_input = AscendQwen2_5_VLForConditionalGeneration._process_video_input

0 commit comments

Comments
 (0)