-
Notifications
You must be signed in to change notification settings - Fork 659
fix nz for quantization #4943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix nz for quantization #4943
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request simplifies the logic for applying the ACL_FORMAT_FRACTAL_NZ format during quantization by making it unconditional in several places. While this removes a dependency on the is_enable_nz utility function, it also removes a model-specific exception for qwen3_next. This could introduce a critical regression if that model still has incompatibilities with the NZ format. The review comments highlight these locations and suggest that the model-specific check should be preserved if it is still required.
| layer.w13_weight.data = torch_npu.npu_format_cast( | ||
| layer.w13_weight.data, ACL_FORMAT_FRACTAL_NZ) | ||
| layer.w2_weight.data = torch_npu.npu_format_cast( | ||
| layer.w2_weight.data, ACL_FORMAT_FRACTAL_NZ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes the ACL_FORMAT_FRACTAL_NZ conversion unconditional. However, the removed is_enable_nz() check contained specific logic to disable this conversion for the qwen3_next model, which could be for compatibility reasons. Applying this format conversion to qwen3_next may cause a regression if the model does not support it for W4A8 dynamic quantization. If this model-specific limitation still exists, the check should be preserved, perhaps without the dependency on the VLLM_ASCEND_ENABLE_NZ environment variable.
| # npu_quant_grouped_matmul_dequant in eager mode does not accept | ||
| # ACL_FORMAT_FRACTAL_NZ. | ||
| if not is_310p() and is_enable_nz(): | ||
| if not is_310p(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the is_enable_nz() check from this condition makes the NZ format conversion unconditional for non-310p devices. The is_enable_nz() function contained an exception to disable NZ format for the qwen3_next model. This change will now apply the conversion to qwen3_next, which could cause a regression if this model does not support NZ format for W8A8 quantization. If the qwen3_next model still has this incompatibility, the check should be restored.
| torch_npu.npu_format_cast_(layer.w13_weight, ACL_FORMAT_FRACTAL_NZ) | ||
| torch_npu.npu_format_cast_(layer.w2_weight, ACL_FORMAT_FRACTAL_NZ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes the ACL_FORMAT_FRACTAL_NZ conversion unconditional. The removed is_enable_nz() check included logic to disable this conversion for the qwen3_next model. This could lead to a regression for qwen3_next if it does not support the NZ format for W8A8 dynamic quantization. If qwen3_next still has this limitation, the model-specific check should be preserved.
Signed-off-by: wangxiyuan <[email protected]>
3d44523 to
cb596f5
Compare
quantization ops rely on NZ by force, we should remove the nz check for it.