Skip to content

Conversation

@wangxiyuan
Copy link
Collaborator

@wangxiyuan wangxiyuan commented Dec 12, 2025

quantization ops rely on NZ by force, we should remove the nz check for it.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request simplifies the logic for applying the ACL_FORMAT_FRACTAL_NZ format during quantization by making it unconditional in several places. While this removes a dependency on the is_enable_nz utility function, it also removes a model-specific exception for qwen3_next. This could introduce a critical regression if that model still has incompatibilities with the NZ format. The review comments highlight these locations and suggest that the model-specific check should be preserved if it is still required.

Comment on lines +485 to +488
layer.w13_weight.data = torch_npu.npu_format_cast(
layer.w13_weight.data, ACL_FORMAT_FRACTAL_NZ)
layer.w2_weight.data = torch_npu.npu_format_cast(
layer.w2_weight.data, ACL_FORMAT_FRACTAL_NZ)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change makes the ACL_FORMAT_FRACTAL_NZ conversion unconditional. However, the removed is_enable_nz() check contained specific logic to disable this conversion for the qwen3_next model, which could be for compatibility reasons. Applying this format conversion to qwen3_next may cause a regression if the model does not support it for W4A8 dynamic quantization. If this model-specific limitation still exists, the check should be preserved, perhaps without the dependency on the VLLM_ASCEND_ENABLE_NZ environment variable.

# npu_quant_grouped_matmul_dequant in eager mode does not accept
# ACL_FORMAT_FRACTAL_NZ.
if not is_310p() and is_enable_nz():
if not is_310p():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Removing the is_enable_nz() check from this condition makes the NZ format conversion unconditional for non-310p devices. The is_enable_nz() function contained an exception to disable NZ format for the qwen3_next model. This change will now apply the conversion to qwen3_next, which could cause a regression if this model does not support NZ format for W8A8 quantization. If the qwen3_next model still has this incompatibility, the check should be restored.

Comment on lines +273 to +274
torch_npu.npu_format_cast_(layer.w13_weight, ACL_FORMAT_FRACTAL_NZ)
torch_npu.npu_format_cast_(layer.w2_weight, ACL_FORMAT_FRACTAL_NZ)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change makes the ACL_FORMAT_FRACTAL_NZ conversion unconditional. The removed is_enable_nz() check included logic to disable this conversion for the qwen3_next model. This could lead to a regression for qwen3_next if it does not support the NZ format for W8A8 dynamic quantization. If qwen3_next still has this limitation, the model-specific check should be preserved.

@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Dec 12, 2025
Signed-off-by: wangxiyuan <[email protected]>
@wangxiyuan wangxiyuan merged commit 01a13a9 into vllm-project:v0.11.0-dev Dec 12, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant