Refactor RMSNorm implementations to use torch.nn.functional.rms_norm #42461

mstojkovicTT · 2025-11-27T16:43:41Z

What does this PR do?

This PR replaces custom RMSNorm/T5-style norm implementations (e.g. in Llama) that manually compute variance and scaling with the built-in torch.nn.functional.rms_norm. For example, code like:

input_dtype = hidden_states.dtype
hidden_states = hidden_states.to(torch.float32)
variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
return self.weight * hidden_states.to(input_dtype)

is simplified to:

return F.rms_norm(hidden_states, hidden_states.shape[-1:], self.weight, self.variance_epsilon)

This keeps the behavior and epsilon handling the same while reducing the number of ops, this should improve performance for users without requiring any additional changes on their side.

To verify the performance and the numerical stability, i have wrote the following test

import timeit
import torch
import torch.nn as nn

# Original implementation
class LlamaRMSNormHF(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states.to(input_dtype)


# New implementation using torch.nn.functional.rms_norm
class LlamaRMSNormTorch(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        return nn.functional.rms_norm(
            hidden_states,
            hidden_states.shape[-1:],
            self.weight,
            self.variance_epsilon,
        )

def bench(module, x, iters=1000):
    # Warmup
    for _ in range(10):
        module(x)

    if x.is_cuda:
        torch.cuda.synchronize()
    start = timeit.default_timer()
    for _ in range(iters):
        module(x)
    if x.is_cuda:
        torch.cuda.synchronize()
    end = timeit.default_timer()

    return (end - start) / iters


def test_llama_rms_norm_equivalence():
    torch.manual_seed(0)

    hidden_size = 64
    batch_size = 2
    seq_len = 3

    dtype = torch.bfloat16
    device = "cpu" 

    x = torch.randn(batch_size, seq_len, hidden_size, dtype=dtype, device=device)

    hf_module = LlamaRMSNormHF(hidden_size, eps=1e-6).to(device)
    new_module = LlamaRMSNormTorch(hidden_size, eps=1e-6).to(device)

    # make sure they have the same weights
    with torch.no_grad():
        new_module.weight.copy_(hf_module.weight)

    y_hf = hf_module(x)
    y_new = new_module(x)
    y_new = y_new.to(y_hf.dtype) # torch.allclose needs same dtype

    # Check numerically close
    print(torch.allclose(y_hf, y_new, atol=1e-5, rtol=1e-5))

    # speed benchmark
    t_hf = bench(hf_module, x)
    t_new = bench(new_module, x)

    print(f"HF   RMSNorm: {t_hf * 1e6:.2f} µs / call")
    print(f"F.rms_norm  : {t_new * 1e6:.2f} µs / call")

test_llama_rms_norm_equivalence()

The results show the following:

for cpu device:

True
HF   RMSNorm: 86.75 µs / call
F.rms_norm  : 47.25 µs / call

for cuda device:

True
HF   RMSNorm: 112.27 µs / call
F.rms_norm  : 83.05 µs / call

note: I have encountered that when I try dtypes that are lower then float32, old implementation will keep it at float32, but my new one will have the dtype of the input tensor. Thats why i have to cast to y_hf.dtype (trying float64 for example will make both implementation output float64). This can be changed, depending on what we want to accomplish.

Who can review?

@Rocketknight1

github-actions · 2025-11-27T16:44:43Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: aimv2, apertus, arcee, aria, bamba, bitnet, blt, chameleon, clvp, csm, cwm, deepseek_v2, deepseek_v3, dia, diffllama, doge

Rocketknight1 · 2025-11-28T15:19:03Z

Hey @mstojkovicTT, thanks for the PR! We definitely want the functions to be a drop-in replacement, so they should return exactly the same dtype as the old functions did.

Also, in your tests you're initializing self.weight = nn.Parameter(torch.ones(hidden_size)) which means that the final scaling step is just multiplying by 1 and having no effect. Can you try with randomly initialized weights with a mean of 1 but a little bit of variance instead, so we can see if everything is equivalent with the original?

mstojkovicTT · 2025-12-01T18:10:36Z

We definitely want the functions to be a drop-in replacement, so they should return exactly the same dtype as the old functions did.

This may be problematic, because of the following scenario:

Lets say that we have input that is bfloat16
When i run the huggingface implementation i get a return dtype is float32
I dont think this is intended, considering that we do

return self.weight * hidden_states.to(input_dtype)

Here, one of the following should be our goal: ether we just want to upcast the input to float32 because of precision in accumulating, in which case we dont care about the input_dtype and have it for no reason, or we want to accumulate and preserve the dtype (which is not the case).
One case i see hidden_states.to(input_dtype) being usefull is when we have "larger" dtypes then float32 like float64, but in that case, we did downcast to float32 for no reason

Here is the pytorch default implementation of rms_norm when it is not using custom CUDA or MPS kernels. Here we can see that they do pretty much the same thing, except that they just cast the output of the op to the input type, not the hidden_states that is later multiplied with weight (and that causes this wierd dtype change)

Can you try with randomly initialized weights with a mean of 1 but a little bit of variance instead, so we can see if everything is equivalent with the original?

I did the same testing just with the additional

with torch.no_grad():
        random_weight = torch.randn(hidden_size, device=device) * 0.05 + 1.0
        hf_module.weight.copy_(random_weight)
        new_module.weight.copy_(random_weight)

and everything still works.

And also @Rocketknight1, thank you for taking a time to review this!

Rocketknight1 · 2025-12-02T17:10:04Z

Yeah, I understand! My guess is that the original code was made to follow the original model implementation, even if it seems weird. Downcasting hidden_states to bfloat16 and then back to float32 seems to just pointlessly lose precision for no purpose, but probably if we "fix" it for pretrained models their accuracy will get a bit worse because we've created a shift from how they were trained. You need to be extremely careful when dealing with pretrained models.

mstojkovicTT · 2025-12-03T11:28:24Z

Ah, that makes sense. What do you think about trying CI run against the changes for now, and if it breaks something I will take a closer look? @Rocketknight1

Rocketknight1 · 2025-12-03T14:08:44Z

Sure, let's see how it goes

Rocketknight1 · 2025-12-03T14:09:24Z

run-slow: llama

github-actions · 2025-12-03T14:10:33Z

This comment contains run-slow, running the specified jobs:

models: ["models/llama"]
quantizations: []

github-actions · 2025-12-03T14:31:19Z

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

llama:
tests/models/llama/test_modeling_llama.py::LlamaIntegrationTest::test_llama_3_1_hard
tests/models/llama/test_modeling_llama.py::LlamaIntegrationTest::test_model_7b_logits_bf16

RMSNorm module implementation change -> torch.nn.functional.rms_norm

90ddc6c

check_modular_conversion fix

dd4bfbd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor RMSNorm implementations to use torch.nn.functional.rms_norm #42461

Refactor RMSNorm implementations to use torch.nn.functional.rms_norm #42461

mstojkovicTT commented Nov 27, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

Rocketknight1 commented Nov 28, 2025

Uh oh!

mstojkovicTT commented Dec 1, 2025 •

edited

Loading

Uh oh!

Rocketknight1 commented Dec 2, 2025

Uh oh!

mstojkovicTT commented Dec 3, 2025 •

edited

Loading

Uh oh!

Rocketknight1 commented Dec 3, 2025

Uh oh!

Rocketknight1 commented Dec 3, 2025

Uh oh!

github-actions bot commented Dec 3, 2025

Uh oh!

github-actions bot commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactor RMSNorm implementations to use torch.nn.functional.rms_norm #42461

Are you sure you want to change the base?

Refactor RMSNorm implementations to use torch.nn.functional.rms_norm #42461

Conversation

mstojkovicTT commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Who can review?

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

Rocketknight1 commented Nov 28, 2025

Uh oh!

mstojkovicTT commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rocketknight1 commented Dec 2, 2025

Uh oh!

mstojkovicTT commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rocketknight1 commented Dec 3, 2025

Uh oh!

Rocketknight1 commented Dec 3, 2025

Uh oh!

github-actions bot commented Dec 3, 2025

Uh oh!

github-actions bot commented Dec 3, 2025

CI Results

Model CI Report

❌ Failed tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mstojkovicTT commented Nov 27, 2025 •

edited

Loading

mstojkovicTT commented Dec 1, 2025 •

edited

Loading

mstojkovicTT commented Dec 3, 2025 •

edited

Loading