Skip to content

Conversation

@MekkCyber
Copy link
Contributor

What does this PR do?

Adds a simple kernel for per tensor quantization, where the matmul is done per blocks of 128x128, and the weights scales, and activation scales are expected to be scalars

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding this ! Just a minor comment

Comment on lines 182 to 262
assert len(block_size) == 2
block_n, block_k = block_size[0], block_size[1]

# if we have per-tensor quantization, we use 128x128 block size for tiled matmul multiplication
if block_n == B.shape[-2] and block_k == B.shape[-1]:
block_n = 128
block_k = 128

Copy link
Member

@SunMarc SunMarc Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't make sense before to set blocks to something else than None when doing per tensor in the FP8Linear. Can we change that so that we fix it here also ?

Comment on lines 187 to 190
"""Triton-accelerated function used to perform linear operations (dot
product) on input tensors `A` and `B` with block-wise quantization, and
store the result in output tensor `C`.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update

@SunMarc SunMarc merged commit 51c5a7a into main Dec 2, 2025
24 checks passed
@SunMarc SunMarc deleted the use-kernel-fp8 branch December 2, 2025 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants