Skip to content

Commit 0d09453

Browse files
[bugfix] Fixed the bug in retrieving the quantization method for mlp.… (#4797)
When retrieving the quantization method for MOE (e.g., the quantization file of DeepSeek v3.2 exp do not match the model's naming convention in eager mode), a KeyError is raised: "model.layers.3.mlp.experts.weight not in self.quant_description". However the quantization file is like : ```bash "model.layers.3.mlp.experts.255.gate_proj.weight": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.gate_proj.weight_scale": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.gate_proj.weight_offset": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.down_proj.weight": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.down_proj.weight_scale": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.down_proj.weight_offset": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.up_proj.weight": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.up_proj.weight_scale": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.up_proj.weight_offset": "W8A8_DYNAMIC", ``` Co-Authored-By: yangqinghao-cmss <[email protected]> Signed-off-by: hfadzxy <[email protected]> Co-authored-by: yangqinghao-cmss <[email protected]>
1 parent 4e728f1 commit 0d09453

File tree

2 files changed

+20
-0
lines changed

2 files changed

+20
-0
lines changed

vllm_ascend/quantization/quant_config.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,15 @@ def is_layer_skipped_ascend(
157157
f"Detected some but not all shards of {prefix} "
158158
"are quantized. All shards of fused layers "
159159
"to have the same precision.")
160+
elif "experts" in prefix:
161+
# For the experts' prefix (e.g., "model.layers.3.mlp.experts")
162+
# Assume all experts within the same MLP use the same quantization method
163+
experts_quant_description = [
164+
self.quant_description[layer]
165+
for layer in self.quant_description if prefix in layer
166+
]
167+
is_skipped = any(quantization == "FLOAT"
168+
for quantization in experts_quant_description)
160169
else:
161170
is_skipped = self.quant_description[prefix + '.weight'] == "FLOAT"
162171

vllm_ascend/quantization/utils.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,17 @@ def get_linear_quant_type(quant_description: Dict[str, Any], prefix: str,
5252
f"Not all shards of {prefix} are quantized with same quant type."
5353
f"Shard {proj_name} uses {shard_quant_type}, but another shard"
5454
f"use {quant_type}. Please check quantization config.")
55+
elif "experts" in prefix:
56+
# For the experts' prefix (e.g., "model.layers.3.mlp.experts")
57+
# Assume all experts within the same MLP use the same quantization method
58+
experts_quant_description = set(quant_description[layer]
59+
for layer in quant_description
60+
if prefix in layer)
61+
if not len(experts_quant_description) == 1:
62+
raise RuntimeError(
63+
f"{prefix} has different quantization type: {experts_quant_description}."
64+
)
65+
quant_type = experts_quant_description.pop()
5566
else:
5667
quant_type = quant_description[prefix + '.weight']
5768
return quant_type

0 commit comments

Comments
 (0)