float8 inference: fix bmm semantics #3296

vkuzo · 2025-11-05T20:49:55Z

Summary:

Fixes the Float8Tensor torch.bmm override to match the semantics of the
high precision op. Specifically, input 1 is of shape (B, M, K) and input
2 is of shape (B, K, N).

Previously, the shape expectation from torch.bmm was not consistent between high precision and quantized versions, which is confusing.

This is important for quantizing LLaMa 4 MoE variants, which use
torch.bmm in the HF implementation.

Test Plan:

pytest test/quantization/quantize_/workflows/float8/test_float8_tensor.py -s -x -k bmm

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]

vkuzo · 2025-11-05T20:49:57Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2025-11-05T20:49:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3296

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c97b030 with merge base a257166 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Fixes the `Float8Tensor` `torch.bmm` override to match the semantics of the high precision op. Specifically, input 1 is of shape (B, M, K) and input 2 is of shape (B, K, N). Previously, the shape expectation from `torch.bmm`, which is confusing. This is important for quantizing LLaMa 4 MoE variants, which use `torch.bmm` in the HF implementation. Test Plan: ``` pytest test/quantization/quantize_/workflows/float8/test_float8_tensor.py -s -x -k bmm ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 3f8887b ghstack-comment-id: 3493356198 Pull-Request: #3296

jerryzh168 · 2025-11-05T20:57:29Z

torchao/quantization/quantize_/workflows/float8/float8_tensor.py

        res = torch.ops.fbgemm.f8f8bf16_rowwise_batched(
            a_data,
-            b_data,
+            b_data.transpose(-2, -1),


will performance be a concern?

transpose is just metadata, no impact on performance

jerryzh168 · 2025-11-05T20:57:44Z

test/quantization/quantize_/workflows/float8/test_float8_tensor.py

+        m = Model(weight).eval()
        original = m(input)
-        # we need to transpose the weight first for bmm
-        m.weight = torch.nn.Parameter(m.weight.transpose(1, 2).contiguous())


this was from llama-models I think: https://github.com/meta-llama/llama-models/blob/0e0b8c519242d5833d8c11bffc1232b77ad7f301/models/llama4/quantization/loader.py#L142, although not as important now

but I guess the important thing is how do we implement it in a way that it can be used by different implementations, would current fp8 bmm implementation work for different ways people use bmm?

this is overriding torch.bmm, so we definitely should match the semantics of torch.bmm in terms of input shapes. It doesn't make sense to do a bmm with shapes that aren't (B, M, K) and (B, K, N). If that breaks llama-models, then they should fix it to match bmm semantics.

jerryzh168

looks good, discussed offline that we want to do a contiguous() call for weight if the model weight for bmm is not transposed (will happen in a separate PR)

[ghstack-poisoned]

Update

d76d3ff

[ghstack-poisoned]

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 5, 2025

vkuzo added the topic: bug fix Use this tag for PRs that fix bugs label Nov 5, 2025

jerryzh168 reviewed Nov 5, 2025

View reviewed changes

jerryzh168 approved these changes Nov 5, 2025

View reviewed changes

Update

c97b030

[ghstack-poisoned]

vkuzo mentioned this pull request Nov 6, 2025

Enable PerRow(axis) to support axes other than -1 #3303

Open

vkuzo merged commit 6815e57 into main Nov 6, 2025
50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

float8 inference: fix bmm semantics #3296

float8 inference: fix bmm semantics #3296

vkuzo commented Nov 5, 2025 •

edited

Loading

Uh oh!

vkuzo commented Nov 5, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 5, 2025 •

edited

Loading

Uh oh!

jerryzh168 Nov 5, 2025

Uh oh!

vkuzo Nov 5, 2025

Uh oh!

jerryzh168 Nov 5, 2025 •

edited

Loading

Uh oh!

vkuzo Nov 5, 2025

Uh oh!

jerryzh168 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

float8 inference: fix bmm semantics #3296

float8 inference: fix bmm semantics #3296

Conversation

vkuzo commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkuzo commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3296

✅ No Failures

Uh oh!

jerryzh168 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

vkuzo Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vkuzo Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

jerryzh168 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vkuzo commented Nov 5, 2025 •

edited

Loading

vkuzo commented Nov 5, 2025 •

edited

Loading

pytorch-bot bot commented Nov 5, 2025 •

edited

Loading

jerryzh168 Nov 5, 2025 •

edited

Loading