You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As we know, Layer Normalization (LayerNorm) is a critical component in many modern neural networks, especially in the Self-Attention structure of Transformers. While working on deploying a QWen-based model, I noticed that NCNN, a high-performance neural network inference framework, would likely lack a Vulkan shader for RMSNorm. My initial thought was to adapt the existing LayerNorm shader, as it's a very common operation.
To my surprise, NCNN has no Vulkan implementation for LayerNorm at all! While LayerNorm itself isn't computationally intensive, the absence of a Vulkan shader means data must be transferred between the GPU and CPU during inference. This round-trip movement introduces significant latency and severely impacts performance. Therefore, I decided to implement it myself.
Analysis of the Existing CPU Implementation
To maintain consistency with the CPU-side logic, I started by analyzing src/layer/layernorm.cpp. The forward pass logic can be summarized by the following flowchart:
flowchart TD
Start[LayerNorm Forward Inplace] --> DimCheck{Check Input Dimensions}
DimCheck --> Dim1[dims=1]
DimCheck --> Dim2[dims=2]
DimCheck --> Dim3[dims=3]
Dim1 --> Process1D[Process 1D Tensor]
Process1D --> Layernorm1D[Call layernorm function<br>Normalize the entire vector]
Dim2 --> Process2D[Process 2D Tensor]
Process2D --> ForLoop2D[Loop over each row]
ForLoop2D --> Layernorm2D[Call layernorm function<br>Normalize each row]
Dim3 --> CheckAffineSize{Check affine_size}
CheckAffineSize -- affine_size == w --> Case1[Case 1: Row-wise Norm]
CheckAffineSize -- affine_size != w --> Case2[Case 2: Channel-wise Norm]
Case1 --> ForLoop3DCase1[Loop over each channel and row]
ForLoop3DCase1 --> Layernorm3DRow[Call layernorm function<br>Normalize each row]
Case2 --> ForLoop3DCase2[Loop over each channel]
ForLoop3DCase2 --> Layernorm3DChannel[Call layernorm function<br>Normalize the entire channel]
Layernorm1D --> ComputeMean[Compute Mean]
ComputeMean --> ComputeVar[Compute Variance]
ComputeVar --> Normalize[Normalization]
Normalize --> ApplyAffine{Apply Affine Transform?}
ApplyAffine -- Yes --> WithGammaBeta[Apply gamma and beta]
ApplyAffine -- No --> WithoutGammaBeta[Use normalized result only]
WithGammaBeta --> End[Done]
WithoutGammaBeta --> End
Layernorm2D --> ComputeMean
Layernorm3DRow --> ComputeMean
Layernorm3DChannel --> ComputeMean
Loading
The mathematical formula is: $$y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} \times \gamma + \beta$$
The core problem becomes clear: why hasn't anyone implemented a Vulkan shader for this? The complexity arises from the two distinct cases for 3D tensors. This branching logic complicates the design of a unified shader.
A Unified Structure with group_size
To solve this, I introduced a group_size parameter. This refactoring unifies the logic by treating the input data as a hierarchical structure:
With this model, all four original cases can be handled uniformly. Here is the refactored CPU-side C++ code.
// Copyright 2020 Tencent// SPDX-License-Identifier: BSD-3-Clause
#include"layernorm.h"namespacencnn {
LayerNorm::LayerNorm()
{
one_blob_only = true;
support_inplace = true;
}
intLayerNorm::load_param(const ParamDict& pd)
{
affine_size = pd.get(0, 0);
eps = pd.get(1, 0.001f);
affine = pd.get(2, 1);
return0;
}
intLayerNorm::load_model(const ModelBin& mb)
{
if (affine == 0)
return0;
gamma_data = mb.load(affine_size, 1);
if (gamma_data.empty())
return -100;
beta_data = mb.load(affine_size, 1);
if (beta_data.empty())
return -100;
return0;
}
// The core normalization function that operates on a contiguous block of memorystaticvoidlayernorm(float* ptr, constfloat* gamma_ptr, constfloat* beta_ptr, float eps, int size)
{
float sum = 0.f;
for (int i = 0; i < size; i++)
{
sum += ptr[i];
}
float mean = sum / size;
float sqsum = 0.f;
for (int i = 0; i < size; i++)
{
float v = ptr[i] - mean;
sqsum += v * v;
}
float var = sqsum / size;
float a = 1.f / sqrtf(var + eps);
float b = -mean * a;
if (gamma_ptr && beta_ptr)
{
for (int i = 0; i < size; i++)
{
ptr[i] = (ptr[i] * a + b) * gamma_ptr[i] + beta_ptr[i];
}
}
else
{
for (int i = 0; i < size; i++)
{
ptr[i] = ptr[i] * a + b;
}
}
}
intLayerNorm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const
{
int dims = bottom_top_blob.dims;
if (dims == 1)
{
// For 1D tensor, the whole tensor is one groupint w = bottom_top_blob.w;
float* ptr = bottom_top_blob;
layernorm(ptr, gamma_data, beta_data, eps, w);
}
elseif (dims == 2)
{
// For 2D tensor, each row is a groupint w = bottom_top_blob.w;
int h = bottom_top_blob.h;
#pragma omp parallel for num_threads(opt.num_threads)
for (int i = 0; i < h; i++)
{
float* ptr = bottom_top_blob.row(i);
layernorm(ptr, gamma_data, beta_data, eps, w);
}
}
elseif (dims == 3)
{
int w = bottom_top_blob.w;
int h = bottom_top_blob.h;
int channels = bottom_top_blob.c;
int group_size;
int num_groups_per_channel;
// Determine group configuration based on affine_sizeif (affine_size == w)
{
// Case 1: Row-wise normalization. Each row is a group.
group_size = w;
num_groups_per_channel = h;
}
else// if (affine_size == w * h)
{
// Case 2: Channel-wise normalization. The entire channel is one group.
group_size = w * h;
num_groups_per_channel = 1;
}
#pragma omp parallel for num_threads(opt.num_threads)
for (int q = 0; q < channels; q++)
{
// Use .channel(q) to get the correct starting pointer, which handles cstepfloat* channel_ptr = bottom_top_blob.channel(q);
for (int i = 0; i < num_groups_per_channel; i++)
{
// Pointer arithmetic within a channel is safe as its data is contiguousfloat* ptr = channel_ptr + i * group_size;
layernorm(ptr, gamma_data, beta_data, eps, group_size);
}
}
}
return0;
}
} // namespace ncnn
Vulkan Shader Implementation Plan
With this unified logic, we can now design a set of Vulkan shaders. Following the excellent example of InstanceNorm in NCNN, the process can be broken down into multiple steps, typically implemented as a multi-pass shader reduction:
Reduce Pass 1 (Sum & SqSum): A compute shader to calculate the sum and sum of squares for each group in parallel. The results are written to an intermediate buffer.
Reduce Pass 2 (Mean & Variance): A small shader that takes the intermediate buffer and computes the final mean and variance for each group.
Normalization Pass: The main shader that applies the normalization formula using the calculated mean and variance, and then applies the affine transformation (gamma and beta).
This approach is highly parallelizable and maps well to the GPU architecture.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
The Motivation
As we know, Layer Normalization (LayerNorm) is a critical component in many modern neural networks, especially in the Self-Attention structure of Transformers. While working on deploying a QWen-based model, I noticed that NCNN, a high-performance neural network inference framework, would likely lack a Vulkan shader for RMSNorm. My initial thought was to adapt the existing LayerNorm shader, as it's a very common operation.
To my surprise, NCNN has no Vulkan implementation for LayerNorm at all! While LayerNorm itself isn't computationally intensive, the absence of a Vulkan shader means data must be transferred between the GPU and CPU during inference. This round-trip movement introduces significant latency and severely impacts performance. Therefore, I decided to implement it myself.
Analysis of the Existing CPU Implementation
To maintain consistency with the CPU-side logic, I started by analyzing
src/layer/layernorm.cpp. The forward pass logic can be summarized by the following flowchart:flowchart TD Start[LayerNorm Forward Inplace] --> DimCheck{Check Input Dimensions} DimCheck --> Dim1[dims=1] DimCheck --> Dim2[dims=2] DimCheck --> Dim3[dims=3] Dim1 --> Process1D[Process 1D Tensor] Process1D --> Layernorm1D[Call layernorm function<br>Normalize the entire vector] Dim2 --> Process2D[Process 2D Tensor] Process2D --> ForLoop2D[Loop over each row] ForLoop2D --> Layernorm2D[Call layernorm function<br>Normalize each row] Dim3 --> CheckAffineSize{Check affine_size} CheckAffineSize -- affine_size == w --> Case1[Case 1: Row-wise Norm] CheckAffineSize -- affine_size != w --> Case2[Case 2: Channel-wise Norm] Case1 --> ForLoop3DCase1[Loop over each channel and row] ForLoop3DCase1 --> Layernorm3DRow[Call layernorm function<br>Normalize each row] Case2 --> ForLoop3DCase2[Loop over each channel] ForLoop3DCase2 --> Layernorm3DChannel[Call layernorm function<br>Normalize the entire channel] Layernorm1D --> ComputeMean[Compute Mean] ComputeMean --> ComputeVar[Compute Variance] ComputeVar --> Normalize[Normalization] Normalize --> ApplyAffine{Apply Affine Transform?} ApplyAffine -- Yes --> WithGammaBeta[Apply gamma and beta] ApplyAffine -- No --> WithoutGammaBeta[Use normalized result only] WithGammaBeta --> End[Done] WithoutGammaBeta --> End Layernorm2D --> ComputeMean Layernorm3DRow --> ComputeMean Layernorm3DChannel --> ComputeMeanThe mathematical formula is:
$$y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} \times \gamma + \beta$$
The core problem becomes clear: why hasn't anyone implemented a Vulkan shader for this? The complexity arises from the two distinct cases for 3D tensors. This branching logic complicates the design of a unified shader.
A Unified Structure with
group_sizeTo solve this, I introduced a
group_sizeparameter. This refactoring unifies the logic by treating the input data as a hierarchical structure:channels -> num_groups_per_channel -> group_elementsWith this model, all four original cases can be handled uniformly. Here is the refactored CPU-side C++ code.
Vulkan Shader Implementation Plan
With this unified logic, we can now design a set of Vulkan shaders. Following the excellent example of
InstanceNormin NCNN, the process can be broken down into multiple steps, typically implemented as a multi-pass shader reduction:This approach is highly parallelizable and maps well to the GPU architecture.
Beta Was this translation helpful? Give feedback.
All reactions