Implementing a Unified LayerNorm Vulkan Shader for NCNN #6310

futz12 · 2025-09-10T02:06:34Z

futz12
Sep 10, 2025

The Motivation

As we know, Layer Normalization (LayerNorm) is a critical component in many modern neural networks, especially in the Self-Attention structure of Transformers. While working on deploying a QWen-based model, I noticed that NCNN, a high-performance neural network inference framework, would likely lack a Vulkan shader for RMSNorm. My initial thought was to adapt the existing LayerNorm shader, as it's a very common operation.

To my surprise, NCNN has no Vulkan implementation for LayerNorm at all! While LayerNorm itself isn't computationally intensive, the absence of a Vulkan shader means data must be transferred between the GPU and CPU during inference. This round-trip movement introduces significant latency and severely impacts performance. Therefore, I decided to implement it myself.

Analysis of the Existing CPU Implementation

To maintain consistency with the CPU-side logic, I started by analyzing src/layer/layernorm.cpp. The forward pass logic can be summarized by the following flowchart:

flowchart TD
    Start[LayerNorm Forward Inplace] --> DimCheck{Check Input Dimensions}
    
    DimCheck --> Dim1[dims=1]
    DimCheck --> Dim2[dims=2]
    DimCheck --> Dim3[dims=3]
    
    Dim1 --> Process1D[Process 1D Tensor]
    Process1D --> Layernorm1D[Call layernorm function<br>Normalize the entire vector]
    
    Dim2 --> Process2D[Process 2D Tensor]
    Process2D --> ForLoop2D[Loop over each row]
    ForLoop2D --> Layernorm2D[Call layernorm function<br>Normalize each row]
    
    Dim3 --> CheckAffineSize{Check affine_size}
    CheckAffineSize -- affine_size == w --> Case1[Case 1: Row-wise Norm]
    CheckAffineSize -- affine_size != w --> Case2[Case 2: Channel-wise Norm]
    
    Case1 --> ForLoop3DCase1[Loop over each channel and row]
    ForLoop3DCase1 --> Layernorm3DRow[Call layernorm function<br>Normalize each row]
    
    Case2 --> ForLoop3DCase2[Loop over each channel]
    ForLoop3DCase2 --> Layernorm3DChannel[Call layernorm function<br>Normalize the entire channel]
    
    Layernorm1D --> ComputeMean[Compute Mean]
    ComputeMean --> ComputeVar[Compute Variance]
    ComputeVar --> Normalize[Normalization]
    Normalize --> ApplyAffine{Apply Affine Transform?}
    
    ApplyAffine -- Yes --> WithGammaBeta[Apply gamma and beta]
    ApplyAffine -- No --> WithoutGammaBeta[Use normalized result only]
    
    WithGammaBeta --> End[Done]
    WithoutGammaBeta --> End
    
    Layernorm2D --> ComputeMean
    Layernorm3DRow --> ComputeMean
    Layernorm3DChannel --> ComputeMean

The mathematical formula is:
$$y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} \times \gamma + \beta$$

The core problem becomes clear: why hasn't anyone implemented a Vulkan shader for this? The complexity arises from the two distinct cases for 3D tensors. This branching logic complicates the design of a unified shader.

A Unified Structure with `group_size`

To solve this, I introduced a group_size parameter. This refactoring unifies the logic by treating the input data as a hierarchical structure:

channels -> num_groups_per_channel -> group_elements

With this model, all four original cases can be handled uniformly. Here is the refactored CPU-side C++ code.

// Copyright 2020 Tencent
// SPDX-License-Identifier: BSD-3-Clause

#include "layernorm.h"

namespace ncnn {

LayerNorm::LayerNorm()
{
    one_blob_only = true;
    support_inplace = true;
}

int LayerNorm::load_param(const ParamDict& pd)
{
    affine_size = pd.get(0, 0);
    eps = pd.get(1, 0.001f);
    affine = pd.get(2, 1);

    return 0;
}

int LayerNorm::load_model(const ModelBin& mb)
{
    if (affine == 0)
        return 0;

    gamma_data = mb.load(affine_size, 1);
    if (gamma_data.empty())
        return -100;

    beta_data = mb.load(affine_size, 1);
    if (beta_data.empty())
        return -100;

    return 0;
}

// The core normalization function that operates on a contiguous block of memory
static void layernorm(float* ptr, const float* gamma_ptr, const float* beta_ptr, float eps, int size)
{
    float sum = 0.f;
    for (int i = 0; i < size; i++)
    {
        sum += ptr[i];
    }
    float mean = sum / size;

    float sqsum = 0.f;
    for (int i = 0; i < size; i++)
    {
        float v = ptr[i] - mean;
        sqsum += v * v;
    }
    float var = sqsum / size;

    float a = 1.f / sqrtf(var + eps);
    float b = -mean * a;

    if (gamma_ptr && beta_ptr)
    {
        for (int i = 0; i < size; i++)
        {
            ptr[i] = (ptr[i] * a + b) * gamma_ptr[i] + beta_ptr[i];
        }
    }
    else
    {
        for (int i = 0; i < size; i++)
        {
            ptr[i] = ptr[i] * a + b;
        }
    }
}

int LayerNorm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const
{
    int dims = bottom_top_blob.dims;

    if (dims == 1)
    {
        // For 1D tensor, the whole tensor is one group
        int w = bottom_top_blob.w;
        float* ptr = bottom_top_blob;
        layernorm(ptr, gamma_data, beta_data, eps, w);
    }
    else if (dims == 2)
    {
        // For 2D tensor, each row is a group
        int w = bottom_top_blob.w;
        int h = bottom_top_blob.h;

        #pragma omp parallel for num_threads(opt.num_threads)
        for (int i = 0; i < h; i++)
        {
            float* ptr = bottom_top_blob.row(i);
            layernorm(ptr, gamma_data, beta_data, eps, w);
        }
    }
    else if (dims == 3)
    {
        int w = bottom_top_blob.w;
        int h = bottom_top_blob.h;
        int channels = bottom_top_blob.c;

        int group_size;
        int num_groups_per_channel;

        // Determine group configuration based on affine_size
        if (affine_size == w)
        {
            // Case 1: Row-wise normalization. Each row is a group.
            group_size = w;
            num_groups_per_channel = h;
        }
        else // if (affine_size == w * h)
        {
            // Case 2: Channel-wise normalization. The entire channel is one group.
            group_size = w * h;
            num_groups_per_channel = 1;
        }

        #pragma omp parallel for num_threads(opt.num_threads)
        for (int q = 0; q < channels; q++)
        {
            // Use .channel(q) to get the correct starting pointer, which handles cstep
            float* channel_ptr = bottom_top_blob.channel(q);

            for (int i = 0; i < num_groups_per_channel; i++)
            {
                // Pointer arithmetic within a channel is safe as its data is contiguous
                float* ptr = channel_ptr + i * group_size;
                layernorm(ptr, gamma_data, beta_data, eps, group_size);
            }
        }
    }

    return 0;
}

} // namespace ncnn

Vulkan Shader Implementation Plan

With this unified logic, we can now design a set of Vulkan shaders. Following the excellent example of InstanceNorm in NCNN, the process can be broken down into multiple steps, typically implemented as a multi-pass shader reduction:

Reduce Pass 1 (Sum & SqSum): A compute shader to calculate the sum and sum of squares for each group in parallel. The results are written to an intermediate buffer.
Reduce Pass 2 (Mean & Variance): A small shader that takes the intermediate buffer and computes the final mean and variance for each group.
Normalization Pass: The main shader that applies the normalization formula using the calculated mean and variance, and then applies the affine transformation (gamma and beta).

This approach is highly parallelizable and maps well to the GPU architecture.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implementing a Unified LayerNorm Vulkan Shader for NCNN #6310

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Implementing a Unified LayerNorm Vulkan Shader for NCNN #6310

Uh oh!

futz12 Sep 10, 2025

The Motivation

Analysis of the Existing CPU Implementation

A Unified Structure with group_size

Vulkan Shader Implementation Plan

Replies: 0 comments

futz12
Sep 10, 2025

A Unified Structure with `group_size`