Exploring the Transformer Series (14) --- Residual Networks and Normalization

0x00 Overview

The Transformer can be viewed as a general-purpose differentiable computer. Its self-attention mechanism provides a message-passing-like architecture, offering a versatile and powerful set of algorithms capable of covering many real-world applications, requiring only a few computational steps to complete tasks. Residual connections and layer normalization enable efficient learning and optimization capabilities. In the Transformer architecture, the Add & Norm modules are used to add residual connections and normalization operations between the multi-head self-attention mechanism and the feedforward neural network. Specifically, they consist of two parts: Add and Norm.

Add refers to the X+MultiHeadAttention(X) operation (residual connection), which originates from the paper Deep Residual Learning for Image Recognition. The residual connection adds the network’s input and output, resulting in the network’s output F(x)+x. In other words, it adds the input tensor and output tensor of the current layer to obtain a residual vector.
Norm stands for Layer Normalization, originating from the paper Layer Normalization. Normalization normalizes the residual vector to better convey information and control gradients, avoiding the vanishing or exploding gradient problems, thereby improving the training efficiency and performance of the model. Different methods can be used for normalization, such as Layer Normalization or Batch Normalization. The Transformer uses layer normalization.

The combination of residual connections and normalized layers enables Transformers to be trained more efficiently, improving model performance and stability. This combination is one of the key factors in the success of modern deep learning architectures. We will guide you through this process with a few questions, which will be answered after completing this article.

Where are Add & Norm located in the Transformer?
Why does the transformer block use LayerNorm instead of BatchNorm?
Should LayerNorm be placed at the beginning or the end?

0x01 Residual Connection

We first present an expanded view of the Transformer LM with attention and feedforward network blocks, including model weights (grey) and residual flow states (green). This demonstrates the importance of residual connections in the Transformer.

1401

1.1 Problem

According to the universal approximation theorem, a single-layer feedforward network is sufficient to represent any function given enough capacity. However, this layer can become very large, leading to overfitting between the network and the data. Therefore, the research community generally believes that network architectures need more layers. Theoretically, the more layers a neural network has, the more complex features it can learn, and the better its performance. However, in reality, when the number of layers increases to a certain extent, the model’s performance drops sharply. This is mainly due to three problems:

Vanishing Gradient: Vanishing gradient refers to the phenomenon in deep neural networks where the gradient gradually decreases during backpropagation, eventually approaching zero. This is caused by the chain rule used in backpropagation. Because current training primarily uses gradient-based optimizers, vanishing gradients mean we lack good signals to adjust and optimize earlier layers. In other words, earlier layers may receive almost no updates, remaining in a randomly initialized state. Only layers closer to the output are updated relatively well. However, in forward propagation, the input to these better-updated layers is the output of earlier, poorly-updated layers. Therefore, the input quality of these better-updated layers may be poor (due to a near-random transformation), resulting in poor overall performance even if later layers are updated well. The final result is that the deeper the model, the worse the performance. This is known as weight degradation, where newly added layers learn little as the model becomes deeper.
Gradient explosion refers to the phenomenon where gradient values grow exponentially during backpropagation, leading to drastic updates to network weights, network instability, and decreased training effectiveness. In deep neural networks, due to the chain rule, gradients need to be backpropagated through multiple layers. If the gradient in each layer increases slightly, the gradient value will become extremely large after multiple propagations, resulting in gradient explosion. Therefore, increasing the number of network layers exacerbates the risk of gradient explosion.
Network degradation: Theoretically, there is an optimal number of layers in a network. If the model exceeds this number of layers, the effect of the extra layers will not exceed the effect of the model with the optimal number of layers. That is, these redundant layers will lead to network degradation.

The residual connection (skip-connect) proposed by Kaiming He can effectively alleviate these problems. This makes deep learning models with dozens or hundreds of layers easier to train, maintaining or even improving accuracy while increasing model depth.

shortcut connections

The idea of residual connections originates from centralization. In neural network systems, centralizing input data (subtracting the mean) is widely proven to accelerate learning. The paper “Accelerated gradient descent by factor-centering decomposition” extends this idea to backpropagation of gradients. Not only are the input and hidden layer activation values centralized, but gradient errors and weight updates can also be centralized. This operation of connecting inputs and outputs is called a shortcut connection. Specifically, the paper proposes decomposing the network into two subnetworks: a biased subnetwork and a centered subnetwork. By training these two subnetworks in parallel, learning the linear and nonlinear transformations respectively, the learning difficulty of each network is reduced, and the training speed of the gradient descent algorithm is significantly improved.

1402

The paper “Highway Networks” proposes Highway Networks, which uses gating units to learn management information and employs skip connections in deep neural networks to allow information to pass through each layer of the deep neural network at high speed without obstruction, thus effectively mitigating the gradient problem.

1403

Identity mapping

Identity mapping, simply put, means that the input and output are the same. The role of identity mapping in neural networks is subtle. The ResNet paper provides an example (as shown in the image below): a 56-layer deep network performs worse than a 20-layer shallow network on both the training and test sets. If the performance of a 20-layer shallow network is already good enough, then even if the subsequent 36 layers do nothing, it shouldn’t perform worse than a 20-layer network. However, this is not the case. Why does this degradation occur? The paper concludes that neural networks do not easily learn identity mapping.

1404

1.3 Network Structure

The core idea of ResNet is to introduce an “identity shortcut connection” that skips one or more layers. Superficially, a residual connection adds the network’s input and output, resulting in the output F(x) + x. Specifically, if we represent the sub-layer’s operation as F(x) and its input as x, then the output of the residual connection is x + F(x), which is also the input to the next layer. There are two papers on ResNet; let’s examine their derivations.

Paper V1

First, the paper selects a shallow architecture network with saturated accuracy and adds some identity mapping structures (i.e., input and output are equal) to generate a new network. The paper assumes that this network will definitely not perform worse than the “shallow architecture”. However, experimental results show that this approach is not better than the shallow network, meaning that the results are actually worse.

$y=F(x)$

1405

Therefore, the paper proposes a solution: using a deep residual network, the specific structure of which is as follows.

1406

The residual network splits the identity mapping structure H(x) into two parts: H(x) = x + (H(x) − x) = x + F(x), thus changing the optimization of H(x) to the optimization of F(x). Its purpose is to learn the residual between the input x and the output, i.e., $\mathcal{F}(x) := \mathcal{H}(x) - x$ . Here, $\mathcal{F}(x)$ is learned through a sub-network consisting of two weight layers and the nonlinear activation function ReLU. The advantage of this design is:

Simulating identity mappings using a bunch of nonlinear networks is very difficult, but simulating the network F(x)=H(x)−x=0 is relatively simple.
Identity shortcuts do not add extra parameters or computational complexity, making the network easier to optimize and improving generalization performance.
If the output of $\mathcal{F}(x)$ is 0, meaning there is no residual, then the shortcut connection makes the identity mapping possible, i.e., directly outputting x, which greatly helps the stability and acceleration of network training.

Let’s look at a few details. First, as shown in the figure above, $\mathcal{F}(x)$ has two hidden layers, which ensures that the expression contains at least one non-linear activation function, ReLU. This is also explained in the paper, as shown in the figure below.

1407

Secondly, the paper points out that the weights of multiple nonlinear layers might be set close to 0, as shown in red in the figure below. Why is this? Is deepening the network simply for the sake of deepening? This is because, in reality, the identity mapping cannot always be optimal. Those skipped layers are not truly skipped, and their weights are not always 0. When the neural network can learn the residual F(x), the output is F(x) + x; when the neural network cannot learn the residual F(x), the worst output is still the identity mapping x. Through the reconstruction of the residual learning, if the identity mapping is optimal, the solver may simply push the weights of multiple nonlinear connections to zero to approximate the identity mapping.

1408

Let’s compare ResNet and highway networks. The implementation of highway networks is the opposite of ResNet.

Gated shortcuts in highway networks are data-dependent and have parameters. ResNet’s identity shortcuts, however, have no parameters.
When the gated shortcuts of a highway network are “off” (close to zero), the layers in a highway network represent non-residual functions. In contrast, ResNet always learns residual functions; its identity shortcuts are never off, all information passes through, and additional residual functions need to be learned.

Paper V2

Following the ResNet paper, the original authors published a second paper, “Identity Mappings in Deep Residual Networks,” which proposed an improved version, ResNetV2. This second paper divides ResNet into three parts: h (skip connection), f (after-addition activation), and F (residual function), and then provides further analysis, as shown in the figure below.

1409

The research team also explored various shortcut (residual connection) designs, as shown in the figure below.

1410

Ultimately, a core conclusion was reached: when the network depth exceeds 100 layers, simple identity residual connections are the only effective method.

1.4 Function

Gradient vanishing

Let’s understand residual connections from the perspective of gradients. Here, we’ll take the vanishing gradient as an example.

The residual connection’s solution to weight degradation is quite simple: Worried about the input gradient vanishing? Then add a term with a constant gradient. The authors directly add the learned result from the previous layer to the learned result of the newly added layer. If an x is added to the output of each layer, the output becomes F(x) + x. This allows information in the network to propagate directly across one or more layers. Even if a sublayer doesn’t learn useful information (or learns incorrect information), the residual connection at least ensures that the information learned from the previous layer is not lost. This helps maintain the integrity of information during forward propagation. During backpropagation, the derivative with respect to x is 1, so it’s equivalent to adding a constant term ‘1’ to the derivative of each layer. That is, one gradient path bypasses the gradient F(x) calculation and goes directly through subsequent processing (propagation). Thus, even if the gradient of x in F(x) vanishes, the gradient of the “direct path” x is essentially preserved. This saves more gradient information and effectively solves the gradient vanishing problem.

Alleviate degeneration

The paper “SKIP CONNECTIONS ELIMINATE SINGULARITIES” points out that residual networks mitigate the degradation of neural networks, which is the key to their effectiveness. The paper also notes that part of the difficulty in training deep networks stems from the singularity caused by the unidentifiable nature of the model, such as the following:

Overlap singularities are caused by the symmetric arrangement of nodes in a given layer.
Elimination of singularities corresponds to the elimination of nodes.
Singularity arising from the linear dependencies of nodes.

These singularities lead to model degradation. When degradation occurs, learning slows down significantly along the degradation direction in the parameter space, thus reducing the effective dimensionality of the model (making the network less expressive than it appears), resulting in a significant reduction in the model’s “degrees of freedom.” Furthermore, as the number of network layers increases, the overall rank becomes even lower after multiplication.

Skip connections can restore the expressive power of a network. The following shows three singularities in fully connected layers and examples of how skip connections restore the expressive power of a network: eliminating zero singularities in the input and weights, breaking symmetry, and reducing the linear dependence of nodes.

1411

Eliminating singularity. The input weights are now zero, making the output weights w unrecognizable (red). Skipping connections (blue) ensures the neuron is at least sometimes active, thus making the output weights recognizable (green). Conversely, for zero output weights w=0, skipping connections restores some recognizableness.
Overlapping singularities. Overlapping input weights (Ja=Jb) make the output weights unrecognizable; skipped connections break the degradation again.
Linear dependency singularity. A subset of hidden neurons becomes linearly dependent, making their output weights unrecognizable. Skipping connections breaks the linear dependency.

Interlayer correction

We know that the output of each layer is an aggregation of its value vector distribution, but the output of a feedforward network also includes residual information from previous feedforward layers. So what role do these residual connections play? Several analyses in the paper “Transformer Feed-Forward Layers Are Key-Value Memories” can reveal some of the mechanisms.

Figure 1 below explores the overlap rate between the highest probability word of the prediction distribution corresponding to the residual vector of each layer and the highest probability word of the prediction distribution of the entire model. Most of the time, the highest predicted word of the residual is the final predicted word of the model.
In the prediction distribution of the exploration model shown in Figure 2 below, the probability value of the word with the highest probability in the prediction distribution corresponding to the residual vector of each layer is shown. That is, the model outputs the probability of the token based on the residual of each layer.

A similar trend can be observed in labels 1 and 2: as the model progresses through layers, the final prediction becomes increasingly certain, and the probability value increases. This indicates that between each layer, the residuals gradually improve the model’s output, acting as a means of inter-layer correction.

1412

The image above, labeled 3, shows 100 randomly selected examples where the residual distribution and feedforward distribution differ in the last layer of the model. The results show that the semantics of the highest predicted word in the aggregated new distribution changed drastically in 66% of cases, while the semantics of the new predicted word were close to the original predicted word only in 34% of cases. This indicates that the feedforward network continuously updates the residual distribution, even in the last layer.

Mask vs. Residual

Some readers might ask: Does the residual structure at the decoding end add subsequent unseen token information, causing information leakage? The answer is no. This is because residual connections themselves do not involve masking or attention calculations. After the decoder’s self-attention layer, residual connections are indeed added to the output after the mask is applied, but this does not introduce any additional information because the mask already ensures that the model cannot “see” subsequent words.

0x02 Normalization

2.1 Problem

Internal covariate shift (ICS) refers to the change in the activation distribution during training due to continuous updates to network weights. This shift causes subsequent layers to receive different input distributions, hindering stable learning. In deep neural networks, the input to a new layer is the output of the previous layers. In practical terms, given a new layer, changes in the parameters of all preceding layers can significantly alter the input distribution of that layer. The higher the layer, the more pronounced the change in input distribution. It’s analogous to a tall building where a small tilt in the lower floors can lead to a larger tilt in the higher floors.

This phenomenon can lead to the following problems:

During training, the network needs to constantly adapt to new input data distributions (i.e., the number of gradient update iterations increases), which greatly reduces the learning speed.
As the number of network layers increases, the parameters may become too large or too small after multiple layers of computation. This may cause abnormalities in the learning process, and the model may converge very slowly.
Due to the different distribution of parameters, a lot of data may fall into the gradient saturation region, causing learning to stop prematurely.
The distribution of certain parameters deviates too much, which has a huge impact on other layers or the output.

In summary, the scale of input features affects the number of iterations and the difficulty of gradient updates in the gradient descent algorithm, thus impacting training convergence. Therefore, we need to find a method to ensure that all features have a similar scale. Whitening is an effective method to alleviate the Inverse Chances (ICS) phenomenon in traditional machine learning, but it has some problems, such as high computational cost and damage to the original expressive power of the input data. Therefore, we need a simpler solution, which should:

Normalize each feature dimension to ensure that its mean is 0 and its variance is 1.
Adding linear transformation operations allows the data to recover its original expressive power as much as possible.

This is the normalization scheme.

Note: There are differing opinions in the industry regarding whether the normalization scheme can effectively alleviate ICS, which we will analyze later.

2.2 Definition

Normalization (also sometimes referred to as “regularization” or “standardization”) is the process of shifting and scaling the input data X before it is fed to neurons. This normalizes the distribution of X to a standard distribution within a fixed range. Simply put, it controls the distribution of input data (activation values) at each layer, filtering out some unimportant and complex information, and pulling the data back to a standard normal distribution. This stabilizes the data distribution (reducing the variance of data in each dimension, minimizing variations in input distribution between different layers, and making the input data/features as independent and identically distributed as possible). This reduces the difficulty of fitting the model and the risk of overfitting, accelerating model convergence. The effects of normalization are as follows:

Stable training process: The normalization layer reduces the variation in input distribution between different layers by standardizing the activation values, thereby improving the stability of training.
Accelerated convergence: By reducing the internal covariate shift, the normalization layer can accelerate the training process, enabling the model to converge faster.
Improving model performance: By providing a more stable training signal, normalization layers can improve the final performance of the model, especially in deeper networks.

2.3 Type

Based on the dimensions of the standardization operation, normalization can be divided into many types. The difference between different regularizations lies only in the different dimensions of the information being operated on, i.e., the different dimensions of the loss information selected. As shown in the figure below, Transformer has two commonly used normalization types: LayerNorm and RMSNorm.

LayerNorm adjusts the values of each layer in a neural network. It is widely used in SLM (Simplified Modeling Library).
RMSNorm helps to adjust and stabilize the values of each layer in a neural network.

1413

As shown in the chart below, LayerNorm was the most commonly used technology in the past few years. However, the trend is now shifting towards RMSNorm.

1414

We will begin by introducing BatchNorm, one of the most common technologies in deep learning.

0x03 BatchNorm

3.1 Formula

BatchNorm is the earliest and most common normalization scheme, while Layer Normalization is frequently used by Transformers. Both BatchNorm and LayerNorm perform the operation of “subtracting the mean and dividing by the standard deviation,” meaning both normalization methods convert the data into a standard normal distribution along the corresponding dimensions to address the distribution problem of values passed from the previous layer. So what are the differences between the two? And why does the NLP field favor LayerNorm? We will begin our analysis with BatchNorm.

BatchNorm processes features of the same dimension from different samples into the same distribution. Let’s take a look at the formula for BatchNorm.

1415

The overall algorithm consists of three steps: preparation (mean and variance), normalization, and restoration (scale and shift).

1416

The following diagram illustrates the specific algorithm.

1417

We’ll use examples to illustrate the algorithm. The diagram below illustrates how BatchNorm works for 3D samples (note: the restoration step is not shown). In this example, a batch contains 5 samples with 4 dimensions, denoted as $x^1, ..., x^5$ . We denote the i-th dimension feature of the k-th sample as $x_i^k$ . BatchNorm treats all features of the same dimension as a distribution and standardizes it.

Preparation. First, for the current batch, calculate the mean and standard deviation for each dimension. For the i-th (i=1,…,m)-th feature, the mean and standard deviation are denoted as $\mu_i$ and $\sigma_i$ , respectively. The mean and standard deviation are determined by the input and do not require learning.
Normalization. Next, subtract the mean from each data point and divide by the standard deviation to make the mean 0 and the variance 1. $y_i^k = (x_i^k - \mu_i) / \sigma_i$ , where $y_i^k$ is the i-th dimension feature of the k-th sample output by BatchNorm. This process ensures that features of the same dimension from different samples are distributed in the same way.
Restoration. After standardization, a linear mapping () is learned, where and are used. The standardized output is then mapped again; the specific mapping method is learned automatically by the learnable parameters of BatchNorm. This step is added because:
- The input distribution and weights of each layer in a deep neural network need to be coordinated. Forcing the distribution to be restricted to zero mean unit variance is not necessarily the best choice. Adding parameters $\gamma$ and $\beta$ to scale and shift the input is beneficial for the coordination between the distribution and weights.
- Standardization disrupts the original distribution of data, which may have a negative effect when inputting it into downstream nonlinear functions such as activation functions. Therefore, a restorative linear transformation is added to restore the original distribution of data appropriately (without worrying about the original distribution of data being destroyed, which could affect network training). The specific restoration principle can be determined during the training process to determine what distribution is suitable.

1418

Assume these 5 samples are all people, with 4 dimensions: height, weight, salary, and age. Height data might be 175cm, 182cm, etc., and weight data are often in the form of 63kg, 72kg, etc. These four data points are four completely different distributions. After processing with BatchNorm, these four distributions all become standard normal distributions.

3.2 Function

When BatchNorm was first proposed, it claimed to alleviate Internal Covariate Shift (ICS), that is, to stabilize the distribution by normalizing the BatchSize dimension. This explanation is mainly based on probability distribution: by normalizing the mean and variance of the current mini-batch, the input distribution of each layer remains stable, thus reducing ICS and stabilizing or even accelerating training.

Batch Normalization (BN) assumes that features with the same dimension have the same distribution. Therefore, normalizing the same feature across different samples does not disrupt the relationship between samples of the same feature, and the result of normalization maintains the comparability between samples. After all, “subtracting the mean and dividing by the standard deviation” is a linear operation of translation and scaling. In the example above, this means that “a tall person before normalization will still be tall after normalization, and an older person before normalization will not become younger after normalization.” This property further determines that samples remain comparable after normalization.

However, the idea of “alleviating ICS” is currently highly questionable. This is because the inputs at any layer cannot strictly follow a normal distribution, so simply standardizing the mean and variance cannot achieve a standard distribution N(0, 1). Secondly, even if the inputs could be scaled to N(0, 1), this interpretation cannot further explain why other normalization methods (such as Instance Normalization and Layer Normalization) work.

The paper “How Does Batch Normalization Help Optimization?” explicitly raises the above questions, refutes some of the original views, and proposes a new understanding of BN: the main function of BN is to make the optimization landscape of the entire loss function (the loss surface composed of the high-dimensional loss function of the neural network) smoother, so that we can train more smoothly.

In other words, Batch Normalization (BN) mitigates variance shift. This is primarily caused by the ”+” operation: adding two random variables results in a variance that is the cumulative sum of their values. In CNNs, convolution operations perform numerous stacking operations on each channel of the feature map. Similarly, attention operations, commonly used in NLP, also perform numerous stacking operations on the embedding of each token. If this accumulated variance is not suppressed, the numerical divergence can be disastrous.

3.3 PyTorch Usage

The image below shows an example of BN in PyTorch.

BatchNorm1d: Because Batch Normalization calculates the statistical data of (N, L) slices in the C dimension, it is generally called Temporal Batch Normalization.
BatchNorm2d: Because Batch Normalization calculates the statistical data of (N, H, W) slices along the C dimension, it is generally called Spatial Batch Normalization. That is, assuming the feature map is $x \in R^{N\times C\times H\times W}$ , containing N samples, each sample has C channels, H height, and W width. When calculating the mean and variance, operations are performed on N, H, and W, while retaining the channel dimension C. Specifically, the mean of channel 1 is obtained by averaging the first channel of the first sample, the first channel of the second sample, and so on, up to the first channel of the Nth sample (note that this is divided by N×H×W, not simply by N; the final result is a number representing the average value of the first channel of this batch, not an H×W matrix). The variance of channel 1 is calculated similarly. By applying this operation to all channels, we obtain the mean and variance of all channels.
BatchNorm3d: Because Batch Normalization calculates the statistical data of (N, D, H, W) slices in the C dimension, it is generally called Volumetric Batch Normalization or Spatio-temporal Batch Normalization.

1419

Let’s take a look at the test code.

import torch.nn as nn

batchnorm = nn.BatchNorm1d(3) # 参数为Tensor最里面一维的长度
# 将1，6视为一个整体，2，4视为一个整体，3，5视为一个整体，分别进行归一化
input_data = torch.Tensor([[1,2,3],[6,4,5]])
output = batchnorm(input_data)
print (output)

batchnorm = nn.BatchNorm1d(2) # 参数为Tensor倒数第二维的长度,，相当于把最里面一维变成了一个数
# 将第一个的12和第二个的56视为一个整体，第一个的34和第二个的78视为整体进行归一化处理
input_data = torch.Tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
output = batchnorm(input_data)
print(output)

batchnorm = nn.BatchNorm2d(2)
# 格式为(batch_size, num_channels,height,width)，在num_channels维度上进行操作
input_data = torch.Tensor([[[[1,2],[3,4]],[[100,102],[4,3]]],[[[2,3],[4,5]],[[101,103],[5,3]]]])
output = batchnorm(input_data)
print (output)

The output is as follows

tensor([[-1.0000, -1.0000, -1.0000],
        [ 1.0000,  1.0000,  1.0000]], grad_fn=<NativeBatchNormBackward0>)
tensor([[[-1.2127, -0.7276],
         [-1.2127, -0.7276]],
        [[ 0.7276,  1.2127],
         [ 0.7276,  1.2127]]], grad_fn=<NativeBatchNormBackward0>)
tensor([[[[-1.6330, -0.8165],
          [ 0.0000,  0.8165]],
         [[ 0.9691,  1.0100],
          [-0.9947, -1.0151]]],
        [[[-0.8165,  0.0000],
          [ 0.8165,  1.6330]],
         [[ 0.9896,  1.0305],
          [-0.9742, -1.0151]]]], grad_fn=<NativeBatchNormBackward0>)

3.4 Problems

Batch Normalization (BN) is not a panacea and has its problems. The root of these problems lies in its use of a mini-batch of data to calculate first and second-order statistics, which calculates normalized statistics based on the sample size. Furthermore, Batch Normalization has certain requirements regarding batch size; improper setting can lead to a loss of statistical significance. For example:

Because BatchNorm performs cross-sample normalization within a batch, the composition and size of the batch directly affect its performance. When the sample size is small, the mean and variance of these samples cannot reflect the global statistical distribution, and BatchNorm based on a small sample size will perform poorly. Therefore, BatchNorm needs to balance the relationship between the small batch statistics and the overall sample statistics, and also needs to consider methods for updating the global statistics using batch statistics.
The behavior of Batch Normalization (BN) layers can be inconsistent between training and inference. During inference, BN layers use parameters derived from the mean and variance of the training data, rather than those derived from the mean and variance of the inference data. Furthermore, while BN can be adjusted based on several data points in a mini-batch during training, the inference input consists of only one instance, without access to other samples; a single sample cannot yield its mean and variance. Therefore, in practice, moving averages are sometimes used to calculate the mean and variance, while statistics obtained from all training samples are used instead of the mean and variance of the inference data. Thus, when the distributions of the training and test data differ, the pre-trained mean and variance may not accurately reflect the test set, leading to inconsistencies across the training, validation, and testing phases. Furthermore, in the context of sequence models, Batch Norm has an inherent weakness: the length (n) of the input sequence within the same batch is inconsistent. This makes Batch Norm unpopular among all sequence models. In the NLP field, the input of a batch includes several independent sentences, each sentence is composed of words of variable length, and each word in the sentence is represented as a fixed-length vector. Therefore, each sentence is considered a sample in the batch, and the multiple word features in the sentence constitute the sentence features, although the feature lengths of different sentences are not necessarily the same. Considering that sample batches represented by tensors require samples to have the same length, padding is needed for sentences with insufficient length. The figure below illustrates the above data organization, where the blank circles represent the zero vectors padded. In this example, a batch includes five sentences, so the number of samples in the batch is 5; the maximum sequence length is assumed to be 4; the dimension of the word features is determined by the model itself, which is $d_{model}$ in the original Transformer paper.

1420

Following the BatchNorm approach, the same-dimensional feature across different samples in a batch is treated as having the same distribution; that is, the word vectors in the same position across all sentences are subjected to a “subtract the mean, divide by the standard deviation” operation. Clearly, this is highly unreasonable in the NLP field, as detailed below:

First, it’s computationally impossible. NLP text sequences are essentially time series, and time series are of variable length. In principle, sequences of different lengths belong to different statistical objects, making it difficult to obtain stable statistics. Furthermore, Batch Normalization (BN) relies on moving averages to obtain the statistics used for prediction, thus BN is not feasible.
Secondly, it’s meaningless. The purpose of normalization is to transform data with similar properties into a standard normal distribution, and the result should not destroy the comparability between data. For example, in the previous example of “height and weight,” the data involved in normalization were all the height and weight data, and either set of data had a clear and identical physical meaning. However, words in the same position in different sentences do not have the same properties and do not need to be compared. For example, in the example above, taking the third position as an example, the words in the five sentences at that position are “too,” “fragrant,” “net,” “cup,” and blank. These words do not have the same attributes or the same physical meaning, making normalization impossible and resulting in unreliable statistics. In addition, the three sentences are processed independently, so there is naturally no issue of comparison among them. Obviously, for sentences, BatchNorm-based normalization is meaningless.

The above discussion ultimately points to one point: the batch size dimension causes problems with Batch Normalization (BN) in NLP scenarios, resulting in the calculated mean and standard deviation failing to truly represent the data distribution. Therefore, while batch normalization is often used in Computer Vision (CV), it encounters issues in NLP.

The paper “Rethinking Batch Normalization in Transformers” analyzes the reasons for the poor performance of Batch Normalization in Transformers, and believes that the main problem is that the batch statistics and their contributing gradients are unstable during forward and backward propagation (during the training of Transformers using Batch Normalization, the mean and variance of each batch oscillate, deviating from the global running statistics).

Now that the problem is clearly defined, the solution becomes simple: avoid the batch dimension during normalization. This led to the development of techniques such as layer normalization and instance normalization.

0x04 layerNorm

Because layer normalization is not affected by sentences of different lengths, it is very suitable for the NLP field.

4.1 Solution

First, let’s look at the purpose of normalization. The purpose of normalization is to transform data with similar properties into a standard normal distribution, and the result should not destroy the comparability between the data. Therefore, let’s examine what data needs to be compared with each other in NLP.

As shown in the image above, the words “太” (tài), “香” (xiāng), “网” (wǎng), “杯” (bēi), and blank have no comparative meaning. We cannot determine the meaning of these words based on a single character, nor can we determine it through comparison between these five elements. What we need to do is analyze the context of each sentence to accurately determine its meaning; that is, the feature relationships within the sample are more important. Furthermore, the meaning of a word is derived from its comparison with other words in the same sentence; comparisons are only meaningful if there are connections between words in the same sentence. In addition, the self-attention mechanism used by Transformer in constructing intra-sentence relationships essentially compares a word with other words in the same sentence to determine their similarity.

Therefore, to preserve the comparability between words in a sentence, we need to perform a normalization operation on individual sentences. This is precisely the normalization method of LayerNorm—calculating the mean and standard deviation of a sentence, and then normalizing each word in the sentence. The operation performed by LayerNorm is similar to finding a “semantic center” in a sentence and then gathering all the words in the sentence around this center, without disrupting the comparative relationships between words in the sentence.

Note: The above expression differs from the actual approach and is not a version of LN in actual NLP. In NLP, LN actually standardizes all features of a single token, not all tokens and features within a single sentence. Therefore, it is unrelated to sentence length and batch size. We will elaborate on this further later.

4.2 Formula

The input to an LN is the input from each layer of neurons, which forms a matrix. The rows of the matrix are the token embeddings, and the columns are the sentences. We calculate the mean and variance row by row, then subtract the mean of each row from each element and divide by the standard deviation of that row to obtain the normalized value. Two trainable parameters, α and β, are then introduced to compensate for the information lost during normalization (typically α is initialized to all 1s, and β to all 0s).

1421

4.3 Function

Why does Layer Normalization help Transformer train better? A common explanation is that two important properties of layer normalization are scaling invariance and translation invariance.

Scaling invariance refers to the ability of the normalization process to adapt to scaling of the input data, making the network insensitive to such scaling.
Translation invariance. If the mean of the input data changes, but the shape and range of the data distribution remain unchanged, then the output of an algorithm or function with translation invariance will not be affected by the change in input.

Because of these two advantages, LN also has two impacts on model training:

Normalizing the values during forward computation can stabilize the input distribution during forward propagation.
Normalization can also make the gradient of backpropagation more stable.

The paper “On the Expressivity Role of LayerNorm in Transformers’ Attention” also offers unique insights.

Without LayerNorm, keys and queries have no obvious geometric structure; however, LayerNorm can project keys onto the same hyperplane, so through learning, the model can orthogonally align queries with keys.
Without LayerNorm, there are some “unselectable” key vectors, meaning these vectors cannot be selected to obtain the highest attention score (the dark area in the upper part of the image below); while LayerNorm eliminates the “unselectable” key vectors, allowing any key vector to potentially obtain the highest attention score.

1422

The paper “On the Nonlinearity of Layer Normalization” points out that Layer Normalization (LN) and its computationally degenerate version, RMSnorm, possess nonlinear expressive power and discusses in detail LN’s universal approximate classification ability. The paper provides a mathematical proof of LN’s nonlinearity and proposes a simple neural network, LN-Net, containing only linear layers and LN. With sufficient depth, LN-Net can theoretically classify any given sample and sample class. This discovery challenges the conventional understanding of various normalizations as linear transformations without fitting ability, and demonstrates that nonlinear layers and normalization layers are no longer disjoint neural network modules.

To further investigate, the paper breaks down LN into two steps: centering and scaling. Centering is mathematically a linear transformation, so the nonlinearity of LN mainly exists in the scaling operation (also referred to as spherical projection in the paper, an operation performed by RMSnorm). The paper uses the simplest linearly inseparable XOR data as an example, correctly classifying these four points through linear transformation and spherical projection.

1423

The image above shows the solution for Xor classification. First, we rotate them by 45°, as shown in Figure (b). Then, we project them vertically onto y=0.5, as shown in Figure (c). Next, we project them spherically onto the circle $x^2+y^2=1$ , as shown in Figure (d). Finally, we project them horizontally onto x=0, as shown in Figure (e). Now we have made the correct classification.

4.4 Differences between LN and BN

The regularization method in deep learning is to “reduce the difficulty of fitting and the risk of overfitting by losing some non-repeating complex information, thereby accelerating the convergence of the model”. Its purpose is to stabilize the distribution (reduce the variance of data in each dimension).

Having understood why we need normalization, let’s discuss what kind of normalization to perform (and on which dimensions we can perform normalization). The importance of the data information determines the normalization method. The difference between different regularization methods lies only in the dimension of information they operate on, that is, the dimension of the loss information they choose. Therefore, choosing different normalization methods means selecting the corresponding dimension to process based on the specific problem.

Therefore, the main difference between the two normalization methods lies in the difference in the objects they apply to due to the different tasks they face, and the resulting differences in their usage. In the figure below, Batch Normalization (BN) normalizes the activations (feature maps) by calculating the channel mean and variance along the batch dimension. Latent Normalization (LN), on the other hand, normalizes along the channel/feature dimension.

1424

Target

Organizing multiple samples into batches and feeding them into a neural network can reduce data throughput, improve computational efficiency, and provide more statistical information for model training. Batch Normalization considers all data within the batch, while Layer Normalization only considers its own data and not other data.

Batch Normalization (BN) normalizes each feature dimension within a batch-size sample, and has the following characteristics:
- BatchNorm performs normalization on the same feature across samples.
- BatchNorm calculates the mean and variance in the batch direction, thus not disrupting the relationship between different samples of the same feature. After all, “subtracting the mean and dividing by the standard deviation” is a linear operation of translation and scaling. This property further ensures that samples remain comparable after normalization.
LN normalizes all features within each sample, and has the following characteristics:
- Layer normalization (LN) does not process the feature dimension within a batch (it does not require the batch’s mean and variance), but rather processes the sample dimension. LN calculates the mean and variance for a single training sample, independent of other data, thus avoiding the problem of being affected by the mini-batch data distribution in Batch Normalization (BN).
- LayerNorm calculates the mean and variance for each sample itself, eliminating the need to store global mean and variance. Because LN does not require storing the mean and variance of the mini-batch, it saves additional storage space.

Direction of action

Actually, this is another interpretation of the object of action, and explaining it separately here will be more helpful for everyone to understand.

Batch Normalization (BN) operates on a row-by-row basis (each row minus its mean, then divided by its standard deviation). Each row represents a relevant dimension within a batch, and BN normalizes the embeddings of tokens at the same position across all sentences within a batch. This clearly contradicts the principles of NLP. The complexity of language text is high; any word can potentially be placed in an initial position, and word order may not affect our understanding of the sentence. This results in a non-one-to-one correspondence between tokens at the same position. Such normalization can alter the semantics.
LN (Layered Normalization) works column-by-column, normalizing each sentence individually. In NLP, LN is equivalent to normalizing each word at every position within each sentence of a batch. This normalization preserves the original semantics while effectively preventing gradient problems caused by excessively large values. From a sample perspective, since each sentence is relatively different, it’s best to normalize each sentence individually, given the goal of fusing and normalizing global information.

See the diagram below for details. Note that the diagram below is only a preliminary demonstration and not the actual LN version of NLP. In reality, it standardizes all features of a single token, rather than standardizing all tokens and all features within a single sample.

1425

Business Selection

The above is a superficial look; now we will look at it from a deeper level.

CV

CV primarily uses Batch Normalization (BN), so let’s examine why CV chooses BN.

Batch Normalization (BN) essentially calculates the mean and variance of all samples across each dimension of the feature set. A typical output shape for a CNN is (N, C, H, W). BN operates on the Channel dimension, meaning we can perform the calculations across “N × H × W” within a batch. The resulting mean and variance vectors are both (C), therefore the two parameters are also C-dimensional vectors.

Why does CNN only perform statistics along the channel dimension? This is because Computer Vision (CV) uses convolutional kernels to output feature maps (h1, w1). One convolutional kernel produces feature map data for one channel, and multiple convolutional kernels produce multiple feature maps. Therefore, the number of convolutional kernels equals the number of channels, and the parameters of the convolutional kernels used in different images and at different locations within the same image are shared. Since we want the data distribution after convolution with this kernel to be stable across different images and at different locations, normalization needs to be performed along the channel dimension. Therefore, for the feature map (B, C, H, W) of a CNN, we can perform normalization along the three dimensions of (B, H, W), where (H, W) is essential because each channel, i.e., each feature map (HxW), is generated through repeated stacking. Therefore, normalization along these two dimensions is necessary for (HxW). Ultimately, common image normalization operations include Batch Normalization (BN), Induction Normalization (IN), and General Normalization (GN), all based on the same principle.

NLP

Let’s look at why researchers made this choice in NLP, specifically in the use of LN.

LayerNorm assumes that features within each sample have the same distribution. Therefore, LayerNorm’s normalization operation is performed independently only within a single sample (a single sentence), so the existence of batches can be completely ignored. Because LN does not involve other samples, the behavior of the inference process is consistent with that of training.

LN is generally used in the third dimension, that is, calculating statistics on the embedding size (the dimension of word vectors) in (Batch size, seq len, embedding size). The units of measurement for each feature should be the same in this dimension. Therefore, scaling issues caused by different unit of measurement for features will not be encountered.

For NLP (transformer), the embedding of each token is also generated by attention operations and superimposing other embeddings. Each layer of the transformer continuously adjusts the position of each word in the space. For example, a token can be seen as a weighted sum of all context words. This process has a potential risk: $E_i$ is uncontrollable. The context word vectors $E_i$ and the number of context words $E_i$ , which can cause large offsets. The “scale” of the sum may also change significantly, especially since the transformer has high dimensions, where even a small change in each dimension can cause a large change in the “scale”. Therefore, embedding normalization is a necessary operation. When $E_i$ is small, LN has little effect, but when it is large, layer normalization can pull the vector obtained after summing back to the vicinity of the original vector, preventing it from going too far.

Why does the LN layer in the Transformer only normalize the last dimension (n_features) and not the latter two dimensions (seq_len, n_features)? The reason might be the presence of padding. A sample sequence might contain some padded tokens. For tokens, if standardization is applied to the latter two dimensions, the mean and variance will be affected by the padding max length. In contrast, standardization across the various features of the same token is more reasonable. Furthermore, statistical analysis is impossible along the token dimension because different tokens within a sentence are highly correlated, making the assumption of their independence clearly invalid.

1426

In summary, Layer Normalization (LN) in Transformer is unique because it normalizes all features of a single token, rather than normalizing all tokens and features within a single sample. Therefore, LN is independent of sentence length and batch size.

Below is an example of using LayerNorm in the NLP and CV fields provided on the PyTorch official website.

# NLP Example
batch, sentence_length, embedding_dim = 20, 5, 10
embedding = torch.randn(batch, sentence_length, embedding_dim)
layer_norm = nn.LayerNorm(embedding_dim)
# Activate module
layer_norm(embedding)

# Image Example
N, C, H, W = 20, 5, 10, 10
input = torch.randn(N, C, H, W)
# Normalize over the last three dimensions (i.e. the channel and spatial dimensions)
# as shown in the image below
layer_norm = nn.LayerNorm([C, H, W])
output = layer_norm(input)

Actually, Batch Normalization (BN) can also be used in Transformers, but the reason it doesn’t perform as well as LN is more due to engineering reasons, such as the impact of sentence length. This point is mentioned in “Rethinking Batch Normalization in Transformers”.

Specific implementation

Several nonlinear operations, such as Softmax, LayerNorm, and GELU, require special support or off-chip computation. Compared to linear operations, these nonlinear operations constitute a relatively small proportion of the overall computation when using Transformer networks for inference. However, computing these nonlinear operations on typical hardware is more challenging than matrix computations, and they can incur significant overhead if not handled properly.

The figure outlines the Softmax, LayerNorm, and BatchNorm operations. Because they rely on runtime statistics, LayerNorm and Softmax require multiple passes of the input to compute nonlinear operations. In the case of Softmax, the denominator is computed the first time through the input. For LayerNorm, the input is passed three times: once to compute the mean; once to compute the standard deviation; and once to apply normalization. Unlike LayerNorm and Softmax, BatchNorm uses only the statistics learned during training, therefore it only requires a single pass of the input.

1427

4.5 Post-Norm VS Pre-Norm

concept

In the Transformer model structure, each layer contains a residual structure. “Pre” and “Post” refer to the location of the Norm. Considering the different placement of LayerNorms, they can be divided into PreLayerNorm and PostLayerNorm.

Post-Norm: The Norm operation is performed after the Add operation. In the original Transformer block, the Layer Norm layer was placed after the residual join, also known as post-LN.
Pre-Norm: Add after Norm. It involves placing the Layer Norm layer before the multi-head self-attention layer or the fully connected layer. Studies have shown that “Pre-LN” is more gradient-descent-friendly, converges faster, and is easier to optimize hyperparameters, but its performance is generally worse than “Post-LN”.

1428

The two methods are formalized as follows.

$Pre\ Norm: x_{t+1}=x_t+F_t(Norm(x_t))$ $Post\ Norm: x_{t+1}=Norm(x_t+F_t(x_t))$

In addition, the Sandwich-Norm shown in the figure below is based on the Pre-Norm, with an additional layer normalization after the sublayer. While prioritizing the effectiveness of residual connections, it attempts to better control the variance of the network layer output values.

1429

Thesis Implementation

In the paper (vanilla Transformer), the Encoder architecture is as follows, which is Post-Norm.

$Attention \rightarrow Add\ \&\ Norm \rightarrow Feed\ Forward \rightarrow Add\ \&\ Norm$

Add & Norm refers to using Layer Norm after the residual connection. Sublayer represents the transformation, X represents the input of Multi-Head Attention or Feed Forward, and MultiHeadAttention(X) and FeedForward(X) represent the output of the relevant module.

$Add\ \&\ Norm: LayerNorm(X+Sublayer(X))$

The corresponding code is:

return self.norm(x + self.dropout(sublayer(x)))

The architecture diagram is as follows.

1430

Finally, after the input matrix X is processed by the Encoder, the output is:

$O=LayerNorm(X+MultiHeadAttention(X))$ $O=LayerNorm(O+FeedForward(O))$

Post-Norm

Post LayerNorm normalizes the output of each neural network layer (i.e., normalizes after the residuals). This ensures that the outputs of each neural network layer are on a similar scale, maintaining consistency across modules, stabilizing the variance of forward propagation, avoiding the vanishing and exploding gradient problems, and improving network stability and training performance. Therefore, it has better effect on parameter normalization and stronger model robustness. Once trained, the performance will be even better.

Post LayerNorm has some issues, such as being difficult to train and requiring careful initialization, which leads to its limited use in current large-scale models.

Difficult to train

Post LayerNorm emphasizes residual branches, losing the advantage of residual networks being “easy to train”. The formula for Post LayerNorm is as follows: $x_{t+1}=Norm(x_t+F_t(x_t))$ . Assuming the variance of x is 1 and the variance of F(x) is $\sigma^2$ , the Norm operation is equivalent to dividing by $\sqrt{1+\sigma^2}$ . If σ is relatively small, the weights of the “straight path” in the residual are closer to 1, the model is closer to an identity function in the initial stage, and it is less prone to gradient vanishing, thus making it easier to optimize.

However, since all parameters are randomly initialized, we can consider x and F(x) as two independent random vectors. Assuming their respective variances are 1, the variance of x + F(x) is 2. The Norm operation is responsible for restoring the variance to 1. Therefore, during the initialization phase, the Norm operation is equivalent to dividing by $\sqrt{2}$ . Each Norm operation weakens the weight of the identity branch of the residual. The original intention of the residual was to provide a “green channel” for the preceding layers, allowing gradients to be propagated more directly. However, in Post-Norm, this “green channel” is severely weakened; the closer to the preceding layers, the smaller the weight (the closer to the input, the more severe the weakening). The residual becomes virtually useless, making training difficult.

Warm-up is needed

Compared to Pre-Norm, Post-Norm training is more unstable. The main reason for training instability (vanishing/exploding gradients) is that Post-Norm is very sensitive to parameters.

When performing layer normalization between residual blocks, the gradients of parameters near the output layer tend to be large. Without a warm-up phase, directly applying a large learning rate to these parameters may not lead to model improvement but rather make the optimization process unstable. Therefore, Post Norm requires careful parameter tuning to achieve good results, such as the essential warm-up learning rate strategy. Warm-up allows the model sufficient time to “warm up,” which suppresses the learning speed of later layers and gives earlier layers more optimization time, promoting synchronous optimization of each layer. Ultimately, this results in smaller updated parameters, relatively stable LN, reduced likelihood of gradient vanishing, and more stable training.

Furthermore, the mainstream optimizers in current NLP are Adam and its variants. Theoretically, as long as the absolute value of the gradient is greater than the random error, the corresponding parameters will have a constant update amount; while the update amount of SGD is proportional to the gradient. If the gradient is small, the update amount will also be small, and if the gradient is too small, the parameters will hardly be updated. Therefore, although the residuals of Post-Norm are severely weakened, in base and large-scale models, they are not weakened to the point of being less than the random error. With optimizers like Adam, they can still be effectively updated, and thus, successful training is possible. Of course, this is only a possibility; in fact, deeper Post-Norm models are indeed more difficult to train.

Vanishing gradients aren’t entirely bad; in fact, they can be beneficial during the fine-tuning stage. During fine-tuning, we typically want to prioritize adjusting parameters closer to the output layer, avoiding over-adjusting parameters closer to the input layer to prevent severely compromising pre-training performance. Vanishing gradients mean that the closer a parameter is to the input layer, the weaker its impact on the final output—exactly what fine-tuning aims for. Therefore, pre-trained Post-Norm models often exhibit better fine-tuning results than Pre-Norm models.

Pre-Norm

The paper “On Layer Normation in the Transformer Architecture” mentions that the position of the Layer Norm in the Transformer is incorrect (i.e., Post Norm). Adjusting the Layer Norm to before the Sub-Layer is called Pre Norm, which greatly helps with training (i.e., Pre Norm).

The main idea of Pre-Norm is to “standardize only when needed” and “process each matrix separately before nonlinearity”. Its formula is $x_{t+1}=x_t+F_t(Norm(x_t))$ . For Transformers, the main nonlinearity is in FFN (ReLU) and Self-Attention (Softmax). These stacked isomorphic sublayers are very similar to RNN units such as GRU and LSTM. Information flows sequentially through the sublayers. Setting LN at the output position of each sublayer is no longer simply about “falling into the gradient-sensitive space of sigmoid to accelerate training,” but more importantly, it makes the vectorized values of each word more balanced to eliminate the influence of extreme cases on the model and obtain a more stable deep network structure—just like when these words come out of the embedding layer, they only differ in information, not in energy. LN, like tanh in LSTM, provides nonlinearity to the model to enhance expressiveness while limiting the output within a certain range. Therefore, for Transformer, the effect of LN is no longer in the category of “how good” but “cannot be without”.

In Su Jianlin’s article, he explains why Pre Norm performs worse than Post Norm. He states that under the same training settings, Pre Norm (i.e., Norm and add) is often easier to train and its training effect is better than Post Norm (Add and Norm). However, if you adjust the settings individually (using a suitable training configuration) and train it completely, Post Norm will produce better results.

Why does this happen?

The answer given by Zhihu expert Tang Xianghao is: The depth of Pre-Norm is “watered down”. The Pre-Norm structure invisibly increases the width of the model while reducing its depth. An L-layer Pre-Norm model has fewer effective layers than an L-layer Post-Norm model. We know that depth is usually more important than width, so reducing the depth results in fewer layers and ultimately a worse performance.

Expanding the Pre-Norm formula further, we get:

$x_t=x_0+F_0(x_0)+F_1\left(\frac{x_1}{\sqrt{2}}\right)+F_2\left(\frac{x_2}{\sqrt{3}}\right)+...+F_{t-1}\left(\frac{x_{t-1}}{\sqrt{t}}\right)$

This ensures that at least each residual channel is weighted equally, making the effect of residuals more pronounced than with Post Norm, thus making Pre-Norm easier to optimize. However, this results in a large variance for the final $x_t$ , so Pre-Norm also requires a Normalization step before the prediction layer.

However, when t is relatively large, $x_t$ and $x_{t+1}$ are small in difference, so $F_{t+1}(Norm(x_{t+1}))$ is very close to $F_{t+1}(Norm(x_t))$ . Therefore, the sum of a t-layer model and a t+1-layer model is approximately equivalent to a wider t-layer model. The statement that “the gradients of Pre-LN at bottom layers tend to be larger than at top layers” means that the Pre-Norm structure tends to excessively favor the identity branch (bottom layers), causing Pre-Norm to degrade into a “shallow and wide” model, ultimately inferior to Post-Norm at the same depth. In Pre-Norm, the result of stacking multiple layers is more about increasing width than depth; the more layers there are, the more “vacuum” the layer becomes.

In addition, Pre-Norms do not involve all parameters in regularization (because some parameters are added directly to the end and do not need to be regularized), so they are less prone to gradient explosion or gradient vanishing problems, resulting in stronger model training stability.

summary

The figure below shows the mathematical derivation of the performance of Pre-Norm and Post-Norm models. Pre-Norm structures often over-rely on identity branches (lower layers), causing them to easily degenerate into a “shallow and wide” model. This makes Pre-Norm perform worse than Post-Norm models of the same depth. Intuitively, this manifests as the stacking of L layers failing to achieve the expected effect. Post-Norm, on the other hand, gradually weakens the influence of identity branches through each normalization operation, thus highlighting residual branches more. Therefore, Post-Norm has a more “sufficient” hierarchical structure and a more “balanced” number of layers, typically resulting in better performance after training.

1431

The diagram below illustrates the architectural differences between the two. Pre-Norm, with the same parameter configuration, doesn’t achieve the same upper bound as Post-Norm, but it’s easier to train. In the initial training phase, the gradient norm of each layer is approximately equal, which helps us eliminate the warm-up phase, improving training stability and accelerating model convergence. While Post-norm theoretically can achieve a better upper bound, it also requires frequent monitoring of gradients during training, making convergence more difficult. However, once well-trained, its performance is definitely superior.

1432

The diagram below shows some LLM configurations. Here, PE represents positional embedding, #L represents the number of layers, #H represents the number of attention heads, $d_{model}$ represents the size of the hidden states, and MCL represents the maximum context length during training. In the BERT era, due to the shallower number of layers, Post-Norm was often used. However, in the era of large models, as the number of transformer layers has increased, although Pre-LN may bring some performance loss, most LLMs still choose to use Pre-LN for training stability.

1433

0x05 Extended Comparison

5.1 Instance Norm

For better comparison, we’ll also introduce Instance Normalization (IN). Instance Normalization (IN) was originally used for image style transfer. Its authors discovered that in generative models, the mean and variance of each channel of the feature map affect the style of the final generated image. Therefore, the image can be normalized at the channel level first, and then “de-normalized” using the mean and standard deviation of the corresponding channel of the target style image to obtain the style of the target image. The IN operation is also performed within a single sample and does not depend on the batch.

IN calculates the mean and standard deviation for each sample across the H and W dimensions, while retaining the N and C dimensions. In other words, it only calculates the mean and standard deviation within each channel. Therefore, for CNNs, instance norm is the minimum set of features selected during normalization; without selecting this dimension, optimization becomes very difficult. Similarly, layer norm applies to transformers.

For CNNs, as long as the images in a batch are independent, we can extend the instance normalization to the B dimension, which is BN. However, if there is a strong correlation between batches, such as a large number of duplicate images, because the feature maps are highly correlated and there is a large amount of pattern co-occurrence between each feature map, forcibly imposing the independent and identically distributed assumption will not yield good results.

5.2 GroupNorm

Group Normalization (GN) is another deep learning normalization method that can replace Batch Normalization (BN). GN is also batch-independent and is a compromise between LN and In-N.

Group Normalization (GNU) addresses the weakness of Batch Normalization (BN) in performing poorly with small mini-batches. Normalization along the batch size introduces problems—inaccurate batch statistics estimation leads to a rapid increase in BN error as the batch size decreases. In training large networks and transferring features to computer vision tasks (including detection, segmentation, and video), memory constraints limit the use of BN to small batches. Furthermore, small batch sizes cause Batch Normalization to fail because it’s impossible to approximate the population mean and standard deviation with only a few samples.

Group Normalization is more suitable for tasks that consume a lot of GPU memory. Group Normalization first divides the channels into multiple groups, then calculates the mean and mean within each group to normalize them. From a deep learning perspective, the features extracted by convolution can be considered as unstructured features or vectors. Each layer has many convolutional kernels, and the features learned by these kernels are not entirely independent. Features learned on the same image should have the same distribution; therefore, features with similar characteristics can be grouped into the same group.

The calculation of GN is independent of the batch size, and the accuracy is relatively stable for different batch sizes, thus avoiding the influence of batch size on the model.

1434

5.3 Comparison

We’ll use an example for comparison. This example comes from the internet, but unfortunately, the original source couldn’t be found. If any reader knows the source, please point it out.

analogy

If we compare $x \in R^{N\times C\times H\times W}$ to a stack of books: there are N books in total, each book has C pages, each page has H lines, and each line has W characters. The following are various methods for normalizing and calculating the mean; the same principle applies to calculating the standard deviation.

BN: Add up the pages of these books one by one (e.g., page 6 of book 1, page 6 of book 2, etc.), and then divide by the total number of characters on each page: N×H×W. Therefore, BN can be regarded as the operation of finding the “average book” (note that this “average book” has only one character per page).
LN: Add up all the words in each book and then divide by the total number of characters in the book: C×H×W, which is to find the “average number of words” in the whole book.
IN: Add up all the words on a page of the book and then divide by the total number of words on that page: H×W, which is to find the “average number of words” per page of the book.
GN: Divide a book of C pages into G equal parts, each part becoming a booklet with C/G pages. Find the “average number of words” in each booklet.

detail

The diagram below illustrates four normalization methods and also shows the differences between BN and LN in the fields of CV and NLP.

For CV, the input is in the format (B, C, H, W). N is the batch size, H/W is the height/width of the feature, and C is the number of channels in the feature. Compressing H/W to one dimension, its three-dimensional representation is shown in the figure below. Assuming the length of a single square is 1, it represents [6, 6, *, *]. A C, H, W is a plane, and all the data within this plane is called a batch.

BN normalizes N, H, and W on a batch basis (the normalization dimension is [N, H, W]), while preserving the dimension of channel C.
LayerNorm avoids the batch size dimension by normalizing C, H, and W in the channel direction (the normalization dimension is [C, H, W]), and finally outputs the mean and std of (B, 1, 1, 1).
InstanceNorm normalizes H and W (with the normalization dimension being [H, W]) and finally outputs the mean and std of (B, C, 1, 1).
GN groups the channels and then normalizes them (the normalization dimension is [C//G, H, W]). In fact, the extreme cases of GN are LN and IN, which correspond to G equal to C and G equal to 1, respectively.

Additionally, it’s important to note the difference between their mapping parameters λ and β: for BN, IN, and GN, both λ and β are vectors with dimension equal to the number of channels C. For LN, however, both λ and β are matrices with dimension equal to the normalized_shape.

In NLP, for a feature with input (N, L, D):

Batch Normalization (BN) fixes the nth position of each sentence, and this slice is an (N, D) dimensional matrix.
LN fixes a single statement, resulting in slices of (L, D) dimensions. LayerNorm reduces along D, ultimately outputting the mean and standard deviation (N, L, 1). The final gamma and beta are (1, 1, D) dimensions. For NLP examples, the mean and variance are calculated only within the last dimension (n_features). That is, the normalized shape is (n_features). Unlike BatchNorm (which shares a single set of scaling and translation parameters across all elements within its normalized shape), LayerNorm scales and translates each element individually within its normalized shape. Therefore, in the NLP context, LayerNorm is essentially the same as InstanceNorm in Computer Vision.

1435

0x06 Implementation

6.1 LayerNorm

Below is the code for LayerNorm, which performs the same function as torch.nn.BatchNorm2d. Formally, let $x \in B$ represent an input from a mini-batch $B_N$ , and batch normalization according to the following expression:

$BN(x)=\gamma \odot \frac{x-\hat\mu_B}{\hat\sigma_B}+\beta$

$\hat\mu_B$ is the sample mean of B, and $\hat\sigma_B$ is the sample standard deviation of the mini-batch B. After standardization, the resulting mini-batch has a mean of 0 and a unit variance of 1. Since the unit variance (along with some other magic numbers) is a subjective choice, we typically include a scale parameter $\gamma$ and a shift parameter $\boldsymbol{\beta}$ , which have the same shape as $\mathbf{x}$ . Note that $\boldsymbol{\gamma}$ and $\boldsymbol{\beta}$ are parameters that need to be learned along with other model parameters.

"""构建一个层归一化（layernorm）模块，对应论文原图中 “Add & Norm”中“Norm”的部分，其实与torch.nn.BatchNorm2d的作用一致。"""
class LayerNorm(nn.Module):   
    # 初始化函数，接收features（特征维度大小）和eps（防止除以零的微小值）作为输入参数
    def __init__(self, features, eps=1e-6):
        """
        features: int类型，含义为特征数。也就是一个词向量的维度。该值一般和d_model一致。
        eps: 在规范化公式的分母中出现的微小值，防止分母为零，默认是1e-6
        """
        super(LayerNorm, self).__init__() # 调用父类nn.Module的构造函数
        """
        这两个参数其实对应BatchNorm的参数，a_2对应gamma(γ), b_2对应beta(β)。
        而nn.Parameter的作用就是将这个两个参数作为模型参数，之后要进行梯度下降。
        """
        self.a_2 = nn.Parameter(torch.ones(features)) # 定义一个大小为features的一维张量，初始化为全1，并将其设置为可训练参数
        self.b_2 = nn.Parameter(torch.zeros(features))  # 定义一个大小为features的一维张量，初始化为全0，并将其设置为可训练参数
        self.eps = eps # 将防止除以零的微小值eps保存为类实例的属性

    # 定义前向传播函数，参数x是输入张量
    def forward(self, x):
        """
        x代表来自上一层（注意力层或者FFN层）的输出。x的形状和解码器的输入一样，其实在整个过程中，x的形状都不会改变。例如，x的shape为(1, 7, 128)，即batch_size为1，7个单词，每个单词是128维度的向量。
        """

        # 计算输入x在最后一个维度（最后一维才是样本的维度）上的均值，保持维度与输入一致
        mean = x.mean(-1, keepdim=True) # mean的形状为 (1, 7, 1)
        # 计算输入x在最后一个维度上的标准差，保持输出结果的维度，保持维度与输入一致
        std = x.std(-1, keepdim=True) # std的形状为 (1, 7, 1)
        # 进行归一化
        # 对输入x进行层归一化，使用可训练参数a_2和b_2进行缩放和偏移，最后返回归一化后的结果
        # 对应BN的操作：output = (gamma * (x - mean) / (std + eps)) + beta
        # *号代表同型点乘，即对应位置进行乘法操作
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

The Encoder is used as follows.

class Encoder(nn.Module):
    "Core encoder is a stack of N layers"

    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, mask):
        "Pass the input (and mask) through each layer in turn."
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

The Decoder is used as follows.

class Decoder(nn.Module):
    "Generic N layer decoder with masking."

    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)

6.2 Residual

The implementation method in the Harvard tutorial is Pre-Norm.

$Norm \rightarrow Attention \rightarrow Add \rightarrow Norm \rightarrow Feed\ Forward \rightarrow Add$

Whether it’s Self-Attention or a fully connected layer, it first connects to a LayerNorm, then Self-Attention/Dense and Dropout, and finally residual connections. There’s a lot of reusable code here, so the Harvard code encapsulates this repeatable code into the SublayerConnection class. For simplicity, the code is implemented in the order shown in the diagram below.

1436

Let’s first look at how to use the SublayerConnection class. In the forward() function of the EncoderLayer class, when SublayerConnection is called, self_attn and self.feed_forward are passed in respectively. Calling self.sublayer[0] invokes the functionality of self_attn.

We x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask)) simplify this to self.sublayer[0](x, z). self.sublayer[0] is a callable, self.sublayer[0].__call__(x,z) and self.sublayer[0](x, z) will be called, which in turn calls SublayerConnection.forward(x, z). As seen in the subsequent code, SublayerConnection.forward(x, z) has two parameters: the input Tensor and a callable, which can be called with only one parameter. Ultimately, it calls sublayer(self.norm(x)). sublayer is the passed parameter z, therefore it’s z(self.norm(x)).

Let’s examine how self_attn is converted to z. self_attn has four parameters, but we know that in the Encoder, the first three parameters are the input x, and the fourth parameter is the mask. Since the mask is known, we can use the lambda technique to transform it into a function with one parameter: z = lambda x: self.self_attn(x, x, x, mask), where the input is x.

As can be seen, there is a Norm operation between the vector’s Position Encoding and its entry into the Encoder; Pre-Norm plays a role here.

class EncoderLayer(nn.Module):
    "Encoder is made up of self-attn and feed forward (defined below)"

    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size

    def forward(self, x, mask):
        "Follow Figure 1 (left) for connections."
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask)) # 传入参数
        return self.sublayer[1](x, self.feed_forward) # 传入参数

The SublayerConnection code is as follows. This class constructs LayerNorm and Dropout, but Self-Attention or Dense is not constructed here; it’s placed in EncoderLayer and passed in during forward. The advantage of this approach is greater generality. For example, the Decoder also needs to have LayerNorm, Dropout, and residual connections added before and after Self-Attention, Attention, or Dense, allowing for code reuse. Here, the passed-in sublayer must be a function that can be called with one parameter (or implements __call__).

class SublayerConnection(nn.Module):
    """
    本类实现的功能是Layer Norm层之后跟随一个残差连接（residual connection），即LayerNorm + sublayer(Self-Attenion/Dense) + dropout + 残差连接。为了实现简单，此处把LayerNorm放到了前面，这和原始论文稍有不同，原始论文LayerNorm在最后。
    """
   
    # 初始化函数，把接收size（层的维度大小）和dropout（dropout率）作为输入参数
    def __init__(self, size, dropout):
        # size是d_model，即词向量的维度
        super(SublayerConnection, self).__init__() # 调用父类nn.Module的构造函数
        self.norm = LayerNorm(size)   # 定义一个层归一化(Layer Normalization)操作，使用size作为输入维度
        self.dropout = nn.Dropout(dropout) # 定义一个dropout层

    # 定义前向传播函数，输入参数x是输入张量，sublayer是待执行的子层操作
    def forward(self, x, sublayer):
        """
        Apply residual connection to any sublayer with the same size.
        x是前一层的输出，本层的输入
        sublayer是Attention层或Feed ForWard层，它可以当成函数调用，这个函数的有一个输入参数
        """
        
        # 将残差连接应用于任何具有相同大小的子层
        # 首先对输入x进行层归一化，然后将结果传给子层，执行子层操作（如self-attention或前馈神经网络）
        # 接着应用dropout，dropout是为了随机停止一些网络中神经元的作用，来防止过拟合。
        # 最后将结果与原始输入x相加。
        # 因为存在跳跃连接，所以是将输入x与dropout后的子层输出结果相加作为最终的子层连接输出。    
        return x + self.dropout(sublayer(self.norm(x)))

0x07 Optimization and Evolution

7.1 RMSNorm

Currently, many LLMs use RMSNorm (Root Mean Square Layer Normalization) to replace the traditional LayerNorm function. This change significantly improves computational efficiency while maintaining training stability and increasing model convergence speed. RMSNorm’s higher efficiency stems from its creators’ research and improvements on LN. Two crucial aspects of LN are translation invariance and scaling invariance. The authors of RMSNorm believe that scaling invariance, rather than translation invariance, is key to LN’s success. Based on this finding, unlike Layer Normalization which first calculates the mean and variance, RMSNorm eliminates the translation during computation (removing the part subtracted from the mean), retaining only the scaling. It directly calculates the squared mean of the input features and then uses it to scale the input features. This computational approach gives RMSNorm the following advantages:

Better numerical stability. RMSNorm’s calculation method is more stable and can avoid outlier problems that may occur in Layer Normalization, especially when training large-scale models.
Faster training speed. By omitting the mean calculation in the normalization process, the algorithm becomes simpler, requires less computation, and can accelerate the model training process.
Better model performance. In some experiments, RMSnorm performs comparably to LN, or even slightly better. It has been shown to improve the final performance of the model.

The figure below shows some of the differences between LayerNorm and Root Mean Square Normalization (RMSNorm).

1437

The Llama implementation of RMSNorm in the Huggingface transformer library is as follows.

from torch import nn
class LlamaRMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        """
        LlamaRMSNorm is equivalent to T5LayerNorm
        """
        super().__init__()
        # RMS Norm层的权重参数，和hidden states相同尺寸。在具体的计算过程中会进行相乘
        self.weight = nn.Parameter(torch.ones(hidden_size))
        # 防止根号运算出现根号0计算非法的问题，也不要太小，推荐使用1e-5到1e-7来防止数据精度溢出
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        # 输入是隐变量,它的shape是[batch_size, seq_len, hidden_size]
        # 先保存输入hidden_state的数据类型
        # 中间计算过程使用torch.float32数据类型
        # 在计算最后会将其恢复为hidden_state的数据类型
        input_dtype = hidden_states.dtype
        # 为了防止计算出错,先将隐藏层张量的数据类型提升为torch.float32
        hidden_states = hidden_states.to(torch.float32)
        # 在最后一个维度计算均方值
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        # torch.rsqrt表示对输入进行开根号求倒数
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        # 然后在除以均方根的基础上再点乘一个参数张量，回到输入的数据类型
        return self.weight * hidden_states.to(input_dtype)

The implementation of Llama3 is as follows.

class RMSNorm(torch.nn.Module):
    def __init__(self, dim: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def _norm(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)

    def forward(self, x):
        output = self._norm(x.float()).type_as(x)
        return output * self.weight

7.2 Deep Norm

Deep Norm, proposed by Microsoft in 2022, is a standardized method for improving the training stability of deep Transformers. As described in the paper, Deep Norm combines the good performance of Post-LN with the training stability of Pre-LN, allowing the authors to extend the depth of Transformers to 1000 layers. Tsinghua University’s GLM and ChatGLM utilize post-LN based on Deep Norm.

The characteristics of Deep Norm are as follows:

Before performing LN, increase the input $\alpha$ .
During Xavier initialization, the initialization range of some parameters (parameters of fully connected layers, value projection layers, and output layers) is reduced by $\beta$ .

Essentially, it involves adding a coefficient ‘a’ to balance the residual branches and the main road. See the diagram below for details.

1438

7.3 PRepBN

LayerNorm requires calculating the mean and standard deviation for each sample’s feature dimension, which can lead to high computational overhead when the feature dimension is very large. However, LayerNorm can be trained stably. BatchNorm directly calculates the mean and standard deviation using the statistical mean and variance data from training, resulting in lower inference latency, but may cause training crashes and poor performance.

The paper “SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization” proposes a novel PRepBN method that uses the hyperparameter lambda to control the ratio of the two normalization layers, progressively replacing the LayerNorm with the reparameterized BatchNorm during training. See the figure below for details.

1439

7.4 RealFormer

The paper “RealFormer: Transformer Likes Residual Attention” proposes the RealFormer model (Residual Attention Layer Transformer), as shown in the figure below. This model applies the residual structure to the attention layer, making the model more robust to training hyperparameters while ensuring improved model performance.

Specifically, RealFormer differs from the two structures mentioned earlier (“Pre-LN” and “Post-LN”) in that it incorporates a residual structure when calculating the attention score for all heads in each layer. This means the attention score of the current layer is added to the attention score of the previous layer. Directly adding skip connections during attention calculation does not increase the computational load exponentially, thus its efficiency is relatively considerable.

RealFormer adopts the same model design as Post-LN, but adds the Attention Scores matrix of the previous layer when calculating multi-head attention in each Transformer layer. That is, when calculating the attention matrix of the nth layer, the formula (1) is changed to formula (2).

1440

Implementing the above calculation method requires only minor code changes to the Transformer model code, and it is also applicable when there is more than one type of attention module in the network.

7.5 nGPT

The paper “nGPT: Normalized Transformer with Representation Learning on the Hypersphere” proposes a novel neural network architecture: a normalized Transformer that performs representation learning on the hypersphere.

Research Background & Motivation

The Transformer architecture is fundamental to modern language models, and researchers have proposed numerous modifications to improve its training stability, inference cost, context length, and robustness. Among these, applying various normalization techniques is considered beneficial, such as adding normalization layers like LayerNorm and RMSNorm, and controlling the norm of weights through weight decay. Meanwhile, research has shown that representation learning on a hypersphere is associated with more stable training, greater separability in the embedding space, and better performance on downstream tasks. Building on this, the authors propose a Normalized Transformer, aiming to unify various findings and observations in this field.

Core contributions

The authors propose normalizing all vectors constituting the embedding dimension of the network matrix so that they lie on the unit norm hypersphere. In this way, matrix-vector multiplication can be viewed as a dot product representing cosine similarity in the range [-1,1], thus making weight decay unnecessary.

The Normalized Transformer itself performs multi-step optimization (two steps per layer) on the hypersphere, where each step of the attention and MLP update is controlled by the feature learning rate (the diagonal elements of a learnable metric matrix). For each label in the input sequence, the optimization path of the Normalized Transformer starts at the point on the hypersphere corresponding to its input embedding vector and moves to the point on the hypersphere that best predicts the next label embedding vector.

In the original decoder-only Transformer, the norm of the labeled embedding vectors is unconstrained, which can lead to inaccurate similarity estimates. In nGPT, the authors propose normalizing the embedding vectors stored in and after each step of the training algorithm. Meanwhile, since all nGPT embeddings are normalized, the logistic values in the original formula represent dot products in the range [-1, 1], which limits the confidence (temperature) of the probability distribution generated by softmax. Therefore, the authors introduce a trainable scaling parameter to adjust this.

contrast

1441

Layers and Blocks

Baseline Transformer: Applies layer transformations to the hidden state, including alternating self-attention (ATTN) and multilayer perceptron (MLP) blocks, and normalizes using RMSnorm.
Normalized Transformer: For any two points on the hypersphere, SLERP or its approximate LERP can be used to compute interpolation along the geodesic. The authors rewrite this as an update equation in nGPT, which involves update equations for the attention and MLP blocks, and controls the update process through learnable parameters and a normalization function Norm. Unlike the baseline Transformer, nGPT does not require additional normalization after the last layer.

Self-attention block

Baseline Transformer: The attention mechanism is a key component of the Transformer, allowing each token to pay attention to other tokens in the sequence. In the Baseline Transformer, the input hidden state is first normalized using RMSnorm, then projected into query, key, and value, and Rotated Position Embedding (RoPE) is applied. Attention weights are obtained by computing the dot product of the query and key vectors, scaling them, and applying the softmax function. Finally, a weighted sum of the value vectors is calculated.
Normalized Transformer: The authors propose normalizing $W_k$ , $W_q$ , $W_v$ , and $W_o$ along their embedding dimensions, so that the computed dot product can be interpreted as the cosine similarity between unit norm vectors. Furthermore, q and k are additionally normalized to ensure that the dot product of each query and key is within controllable limits. Simultaneously, the softmax scaling factor is adjusted.

MLP Block

Baseline Transformer: The input hidden state of the MLP block is first normalized using RMSnorm, then two intermediate vectors are generated through two separate linear projections, combined using the SwiGLU activation function, and finally the output is obtained through a final linear transformation.
Normalized Transformer: The authors propose normalizing $W_u$ and $W_v$ , and introduce $S_u$ and $S_v$ .

Switch

The steps to convert a baseline Transformer to a normalized Transformer include: removing all normalized layers; normalizing all matrices along their embedding dimensions after each training step; replacing the update equations; changing the softmax scaling factor in the attention and rescaling and normalizing the values; rescaling the intermediate states of the MLP blocks; rescaling the logistic values; and removing weight decay and learning rate warm-up.

7.6 DyT

The paper “Transformers without Normalization” proposes element-wise operations of dynamic Tanh (DyT), which aim to enable Transformer models without normalization layers to achieve the same or even better performance, and in most cases, without the need to adjust hyperparameters.

1442

motivation

The paper’s exploration begins with an observation: LN layers map their inputs to outputs via an S-shaped curve similar to tanh, scaling input activations while compressing extreme values. Inspired by this, the paper proposes an element-wise operation called Dynamic Tanh (DyT), defined as: DyT(x) = tanh(αx), where α is a learnable parameter. This operation aims to learn an appropriate scaling factor through α and compress extreme values through a bounded tanh function, thus mimicking the behavior of LN layers. Notably, unlike normalized layers, DyT achieves both effects without computing activation statistics.

The authors selected three models for analysis. For these three trained networks, a mini-batch of samples was sampled and forward-propagated through the networks. Then, the inputs and outputs of the normalization layers—the tensors before and after the normalization operation, excluding learnable affine transformations—were measured. Since LNs preserve the dimensionality of the input tensors, we can establish a one-to-one correspondence between the input and output tensor elements, thus directly visualizing the relationships between them. These mappings are plotted in the figure below.

1443

We found that the relationship between input and output is mostly linear, similar to a straight line in an xy graph. However, in deeper LN layers, the shape of these curves is very similar to a complete or partial S-curve, resembling the tanh function.

accomplish

Inspired by the similarity in shape between the normalized layer and the scaled tanh function, we propose Dynamic Tanh (DyT) as an alternative to the normalized layer. Given an input tensor x, the definition of the DyT layer is shown in Figure 1 below.

1444

Here, alpha is a learnable scalar parameter that allows scaling based on the input range, thus adapting to different x-scales (as shown in Figure 2). This is why we name the entire operation “Dynamic” Tanh. gamma and beta are learnable, per-channel vector parameters, the same parameters used in all normalization layers—they allow the output to scale to arbitrary scales. Sometimes this is considered a separate affine layer; however, in our design, we treat them as part of the DyT layer, just as the normalization layer contains these parameters.

DyT is not a new type of normalization layer because it operates independently on each input element in the tensor during the forward propagation, without computing statistics or other types of aggregations. However, it does retain the effect of normalization layers in nonlinearly compressing extreme values, while performing a nearly linear transformation on the central portion of the input.

0xFF Reference

Challenging tradition: The Transformer architecture without a normalization layer (MLSys2024)

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift .

Root Mean Square Layer Normalization

A Discussion on Model Optimization: Why is the Initial Standard Deviation of BERT 0.02? (by Su Jianlin )

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Group Normalization

Highway Networks

https://github.com/xinghaochen/SLAB

LLM Inference Acceleration 2: PREpBN/Turbo Sparse/MatMul-free/KIVI/Speculative Decoding akaihaoshuai

NGPT: Normalized Transformer for Representation Learning on Hyperspheres ( Jiang Xiaopi Bupi)

On Layer Normalization in the Transformer Architecture

On the Nonlinearity of Layer Normalization

PowerNorm: Rethinking Batch Normalization in Transformers

Rethinking Batch Normalization in Transformers

Root Mean Square Layer Normalization Biao Zhang , Rico Sennrich

Self-Attention and Transformer reeds

SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization

A Few Things About Normalization in Transformers (Linsight)

Why does Transformer use LayerNorm instead of BatchNorm? The Way of AI Algorithms

Transformer Upgrade Path: 15. Key Normalization Aids Length Extrapolation ( Su Jianlin)

Understanding the Difficulty of Training Transformers

RealFormer: Transformer Likes Residual Attention

[AI Confusion] The Past, Present, and Principles of Residual Networks (Part Three)

[Deep Learning Notes] Pre-Norm, Post-Norm, and DeepNorm

10,000-word line-by-line analysis and implementation of Transformer, with German-to-English translation practice (Part 2) iioSnail

Why is Pre Norm less effective than Post Norm? (Su Jianlin )

What would happen if the network used both batch normalization and layer normalization simultaneously? (12 likes, 0 comments)

How would you evaluate DeepNet, proposed by Microsoft Research Asia, which increases the Transformer layer to 1000 layers? (Tang Xianghao )

Commonly used Normalization: Overview of BN, LN, IN, and GN (ming6383 )

A Discussion on Model Optimization: Why is the Initial Standard Deviation of BERT 0.02? (by Su Jianlin )

A Brief Discussion on Transformer Initialization, Parameterization, and Standardization (by Su Jianlin)

In-depth analysis of ResNet V1 residual network (with source code) Ruled by Math

After reading this, you might gain a better understanding of Batch Normalization . (DengBoCong)

Random Thoughts: Minor Details of Transformer

Decoding the Tech Beast of Group Normalization

Detailed Explanation of Normalization in Deep Learning: BN/LN/WN (Juliuszh)

https://github.com/meta-llama/llama3/blob/11817d47e1ba7a4959b025eb1ca308572e0e3963/llama/model.py#L195

https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html

https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html#torch.nn.LayerNorm

Yunhao Ni, Yuxin Guo, Junlong Jia, and Lei Huang. On the nonlinearity of layer normalization. arXiv preprint arXiv:2406.01255, 2024.

| [Transformer Series] (14) Residual Networks and Normalization

Exploring the Transformer Series (14) --- Residual Networks and Normalization

Exploring the Transformer Series (14) --- Residual Networks and Normalization

0x00 Overview

0x01 Residual Connection

1.1 Problem

1.2 Related Knowledge

shortcut connections

Identity mapping

1.3 Network Structure

Paper V1

Paper V2

1.4 Function

Gradient vanishing

Alleviate degeneration

Interlayer correction

Mask vs. Residual

0x02 Normalization

2.1 Problem

2.2 Definition

2.3 Type

0x03 BatchNorm

3.1 Formula

3.2 Function

3.3 PyTorch Usage

3.4 Problems

0x04 layerNorm

4.1 Solution

4.2 Formula

4.3 Function

4.4 Differences between LN and BN

Target

Direction of action

Business Selection

CV

NLP

Specific implementation

4.5 Post-Norm VS Pre-Norm

concept

Thesis Implementation

Post-Norm

Difficult to train

Warm-up is needed

Pre-Norm

summary

0x05 Extended Comparison

5.1 Instance Norm

5.2 GroupNorm

5.3 Comparison

analogy

detail

0x06 Implementation

6.1 LayerNorm

6.2 Residual

0x07 Optimization and Evolution

7.1 RMSNorm

7.2 Deep Norm

7.3 PRepBN

7.4 RealFormer

7.5 nGPT

Research Background & Motivation

Core contributions

contrast

Layers and Blocks

Self-attention block

MLP Block

Switch

7.6 DyT

motivation

accomplish

0xFF Reference