Exploring the Transformer Series (13) --- FFN

0x00 Overview

The Transformer’s method of extracting and processing “sequence information” involves two steps: taking the encoder of the original Transformer structure as an example, each layer contains a multi-head self-attention block (MHSA) and an FFN (Feed Forward Network), that is, after the self-attention layer, the encoder has an FFN.

FFN is a simple network consisting of two linear transformations and one activation function (linear + relu + linear). Considering that the attention mechanism may not fit complex processes well enough, the Transformer authors added two more layers to enhance the capacity and nonlinearity of the model plus model.

1301

0x01 Network Structure

Feedforward networks can be divided into two main types: standard FFN and threshold FFN.

Standard FFN: This is a commonly used structure in neural networks. The network consists of two layers and uses an activation function.
Gated FFN: In addition to the standard method, a threshold layer is used to enhance the network’s ability to control and regulate the flow of information.

Over time, preferences for these feedforward neural network types have changed. The right side of the graph above shows the trend of feedforward network types used in SLM from 2022 to 2024, with standard FFNs gradually being replaced by threshold FFNs. This article mainly introduces the standard FFN, which is the implementation of the Transformer paper.

1302

1.1 Mathematical Representation

The FFN layer is a two-layer fully connected layer. The first layer uses ReLU as the activation function, while the second layer does not use an activation function. In addition to ReLU, Dropout is used between the two linear transformations.

$FFN(x)=max(0, xW_1+b_1)W_2+b_2$

The first linear layer. Its input is $X \in R^{d_{input}\times d_{model}}$ . It is the output of multi-head attention, which can be viewed as the output of each input position ( $d_{input}$ lines) attention results stacked across $d_{model}$ columns. The first linear layer typically expands the dimension of the input. For example, if the input dimension is 512, the output dimension might be 2048. This is done to enable the model to learn more complex functions and to better integrate the output of the preceding multi-head attention mechanism.
ReLU activation: This is a non-linear activation function. It’s relatively simple: it returns 0 if the input is negative and the input itself if it’s positive. ReLU activation enables the model to learn non-linearity, which can be understood as introducing non-linearity to filter the vector. Its mathematical expression is: $max(0, xW_1+b_1)$ .
The second linear layer. This is the inverse operation of the first linear layer, reducing the dimension back to the original dimension. The final output matrix obtained by FFN has the same dimension as the input X.

The above structure performs the same information transformation on each row of the input X (rows are not interleaved, i.e., “separately and identically”). This linear transformation behaves the same at different positions, only using different parameters across different layers. That is, parameters are shared between each row (each token), but the learned parameter matrices differ across layers. We can represent the above structure as follows, where d is the embedding size (512 in Transformer). $d_{ffn}$ It is the hidden layer dimension of FFN (2048 in Transformer).

1303

Ultimately, the weights of FFN are reflected in these two linear layers. Attention mechanisms integrate different entities within the same feature space, emphasizing the importance between different entities. FFN, on the other hand, maps entities from feature space A to feature space B. The granularity of their comparisons differs. Furthermore, starting with T5, many models no longer use biases in the FFN layer.

1.2 Intermediate Layer Ratio

The median ratio in a Faster Forwarding Network (FFN) refers to the ratio between the dimension of the intermediate layers and the dimension of the hidden layers. Simply put, it determines the size of the intermediate layers relative to the entire network. Standard FFNs typically set the median ratio to 4. This means that the intermediate layers are usually four times smaller than the hidden layers. On the other hand, threshold FFNs offer greater flexibility in their median ratio, allowing it to be any range from 2 to 8, chosen based on the characteristics of the model.

If the median ratio is set too low, it will result in fewer model parameters and poorer performance. If it is set too high, it will cause excessively high peak memory usage. Therefore, a balanced approach is required. The figure below shows the trend of median ratios for different feedforward networks from 2022 to 2024.

1304

1.3 position-wise

The paper names this FNN “Position-wise feed-forward networks.” “Position-wise” means that the same linear transformation is applied to each element (each position) in the sequence. The authors emphasize “position-wise” because FFN has the following characteristics (which will also be compared with attention mechanisms here):

The modeling only considers individual positions. The FFN layer performs an independent nonlinear transformation (which can be understood as a transformation and translation from the perspective of matrix operations) on the information representation of a single token corresponding to each row of the input matrix (each token, i.e., each position). Because FFN performs the same operation on the token vector at each position in the sequence, the fully connected layer at each time step can be computed independently and in parallel, which can improve the speed of training and inference.
There is no information exchange between elements. The Transformer already uses attention mechanisms to consider the semantics and dependencies of words at different positions, and performs a global aggregation of information in the sequence at each position. Because when the information reaches FFN, each token already includes the information it is interested in at the token level, and the context in the sequence has already been aggregated, there is no need for further interaction at FFN (interaction between elements relies entirely on self-attention). What FFN does is, after the information exchange between elements at the attention layer, let each element digest and integrate its own information, preparing for the next layer to exchange information again through self-attention.
The granularity of computation is the dimension within the token. Attention mechanisms can capture the contextual relationships in a sequence by mixing tokens at different positions, and their computation is at the token level. In contrast, FFN only considers information from a single position when processing sequence data, mixing features from different dimensions of each token (without interaction between tokens), and completing feature mapping within the token itself.
Fine-grained reprocessing. MHA allows the model to learn information in different representation subspaces, while FFN allows the model to utilize contextual information generated by the attention mechanism and further transform this information to capture more complex relationships in the data. Therefore, in FFN, each row of the matrix is operated independently, processing the contextual information of each token into the final semantic space vector required.

1305

Alternatively, it can be explained from the perspective of convolution. Regarding matrices… $W_1 \in R^{d_{input}\times d_{model}}$ and $W_2 \in R^{d_{input}\times d_{model}}$ The inverted dimension, according to the Transformer authors, can be understood as “two convolutions with kernel size 1,” meaning that Position-wise FFN is equivalent to a convolution with kernel size = 1, so each position (token) is computed independently. Why specify a kernel size of 1? Because if it were greater than 1, there would be dependencies between adjacent positions, and it could no longer be called position-wise.

In summary, FFN is essentially a position-wise MLP that “upgrades to the original dimension - overactivates - returns to the original dimension”.

1.4 Activation Function

Activation functions are non-linear functions in neural networks used to introduce non-linear relationships between neurons, enabling the model to learn and represent complex data patterns. Without activation functions, neural networks, regardless of their number of layers, can only represent linear relationships between inputs and outputs, which greatly limits their ability to handle complex problems.

Common functions

In feedforward neural networks (FFNs), there are several commonly used activation functions:

ReLU (Rectified Linear Unit): ReLU acts like a switch, turning the information flow on or off, and it has a wide range of applications.
GELU (Gaussian Error Linear Unit): GELU is an activation function that smoothly transitions between zero and positive values.
SiLU (Sigmoid Linear Unit): SiLU is an activation function that combines the properties of the sigmoid function and the linear function. Essentially, it is… $\beta$ The Swish activation function when the value is 1.

These activation functions are discussed in detail in the paper “GLU Variants Improve Transformer,” which proposes using variants of GLU (replacing the original sigmoid activation function in GLU with other activation functions) to improve the FFN layer of the Transformer, and lists three variants: ReLU, GELU, and SwiGLU. The abbreviation of the activation function is prefixed to GLU in the naming convention. The paper replaces the first fully connected layer and activation function in the FFN with this GLU variant and removes the bias term from the GLU. The specific formula is as follows.

1306

The image below shows information about common large models, from which you can see how activation functions are used.

1307

Over time, the use of these activation functions has changed. In 2022, ReLU became the preferred activation function for many FFNs. However, in 2023, there was a transition to GELU and its variant, GELU Tanh. By 2024, SiLU had become the dominant choice for activation functions. This is illustrated in the figure below.

1308

ReLU

The ReLU function is the corrected linear unit function, proposed by Vinod Nair and Geoffrey Hinton in their paper “Rectified Linear Units Improve Restricted Boltzmann Machines”. Its formula is:

$ReLU(x)=max(0, x)$

The ReLU function outputs the same value as the input when the input is greater than 0, and zero otherwise. The advantages of ReLU are its simple computation and fast convergence. Compared to the Sigmoid and Tanh functions, ReLU has a constant gradient of 1 in the positive interval, which helps alleviate the vanishing gradient problem and makes deep networks easier to train. However, it also has a problem: when the input is less than 0, the gradient is zero, which can cause neurons to fail to update their weights, leading to the “neuron death” problem.

GLU

The paper “GLU Variants Improve Transformer” proposes that activation functions can be improved using gated linear units (GLUs). GLU activation was first proposed in the 2016 paper “language modeling with gated convolutional networks”. GLU is not actually an activation function, but rather a layer in a neural network. It is a structure of a linear transformation followed by a gating mechanism. The gating mechanism is a sigmoid function used to control how much information can pass through. Its formula is as follows:

$GLU(x, W, V, b, c)=(xW+b)\otimes \sigma(xV+c)$

Where ⊗ represents element-wise multiplication.

$X$ It is input.
$W$ and $V$ It is a weight matrix.
$b$ and $c$ This is the bias term. Note that some papers place the gating of GLU in the weight W part, i.e.

$GLU(x, W, V, b, c)=\sigma(xW+b)\otimes (xV+c)$

GELU

The paper “Gaussian Error Linear Units (GELUs)” proposes the GELU (Gaussian Error Linear Unit) function, a smoothed version of ReLU. GELU smooths the input using a Gaussian error function (the cumulative distribution function of a standard normal distribution), thereby improving model performance. The mathematical expression for the GELU function is as follows:

$GELU(x)=x\cdot \Phi(x)$

Where:

$x$ It is input.
$\Phi(x)$ The cumulative distribution function of the standard normal distribution is defined as:

$\Phi(x)=\frac{1}{2}\left(1+erf\left(\frac{x}{\sqrt{2}}\right)\right)$

$erf(x)$ It is an error function. Previously, due to the high computational cost, the paper provided two elementary functions as approximations, but now many frameworks can perform accurate calculations.

1309

SwiGLU

SwiGLU (Swish-Gated Linear Unit) is an activation function that combines the characteristics of Swish and GLU (Gated Linear Unit). Essentially, SwiGLU is a variant of GLU that uses Swish as the activation function but removes the bias. Compared to ReLU, SwiGLU can improve model performance. The core difference between the two lies in:

The ReLU function sets all negative inputs to zero, while positive inputs remain unchanged.
The SwiGLU function has a learnable parameter $\beta$ . This allows for adjustment of the interpolation degree of the function. With $\beta$ as the value increases, the behavior of SwiGLU will gradually approach that of ReLU.

Swish function

The Swish function was proposed by the Google team in their 2017 paper “Searching for Activation Functions,” and its formula and effects are shown in the figure below. The Swish function has a smooth curve and is differentiable at all points. This is very helpful in the model optimization process and is considered one of the reasons why Swish outperforms ReLU.

The mathematical expression for the Swish function is:

$Swish(x)=x\cdot \sigma(\beta x)$

, where $\sigma$ is the activation function Sigmoid, defined as follows:

$\sigma(x)=\frac{1}{1+e^{-x}}$

Input x and $\sigma$ multiplication makes Swish similar to the gating mechanism in LSTM, so Swish is also known as a self-gated activation function, which only requires a scalar input to complete the gating operation.

$\beta$ is a learnable parameter that controls the shape of the function; it is usually a constant or learned adaptively by the model. When $\beta=0$ , Swish degenerates into a linear function; when $\beta$ approaches infinity, Swish becomes ReLU. In most cases, $\beta$ is set to 1, thus simplifying to $Swish(x)=x\cdot \sigma(x)$ . It is also called SiLU (Sigmoid Gated Linear Unit).

1310

SwiGLU activation function

The mathematical expression for SwiGLU is $f(X) = (X ∗ W + b) \otimes Swish(X ∗ V + c)$ , where $\otimes$ represents the Hadamard product. this model also can by change Change for:

$SwiGLU(a,b)=Swish(a)\otimes \sigma(b)$

, where, $a$ and $b$ It is the input tensor.

$\sigma(x)=\frac{1}{1+e^{-x}}$

It is the Sigmoid activation function.

$Swish(x)=x\cdot \sigma(x)$

It is the Swish activation function.

SwiGLU essentially replaces the first fully connected layer and ReLU in the feedforward propagation layer of the Transformer’s FFN. The original FFN uses two fully connected layers: the first layer increases dimensionality, and the second layer reduces dimensionality back to the input dimension, with the ReLE activation function between the two layers. SwiGLU also uses fully connected layers and an activation function, but the difference is that SwiGLU uses two weight matrices to transform the input separately, then performs a Hadamard product operation with the Swish activation function. Because the FFN itself has a second fully connected layer, the FFN module with the SwiGLU activation function has three weight matrices: W1 and V are the two weight matrices of the SwiGLU module, W2 is the weight matrix of the second fully connected layer of the original FFN, and Swish is the activation function.

1311

Due to SwiGLU, FFN changes from two weight matrices to three. To keep the number of model parameters roughly the same, researchers typically scale the size of the hidden layers, for example, reducing the dimensions of the intermediate layers to 2/3 of their original size, with each matrix having a shape of (ℎ, 8ℎ/3). Furthermore, to ensure that the intermediate layers are multiples of 256, a modulo operation is performed followed by restoration.

accomplish

Let’s take a look at the LlamaMLP code. Constants are used in LLaMA. $\beta=1$ , at which point Swish simplifies to $Swish(x)=x\cdot \sigma(x)$ . Swiglu uses nn.SiLU.

class LlamaMLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size
        self.intermediate_size = config.intermediate_size
        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=config.mlp_bias)
        self.act_fn = ACT2FN[config.hidden_act]

    def forward(self, x):
        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
        return down_proj

As can be seen from ACT2CLS, nn.SiLU was used.

ACT2CLS = {
    "gelu": GELUActivation,
    "gelu_10": (ClippedGELUActivation, {"min": -10, "max": 10}),
    "gelu_fast": FastGELUActivation,
    "gelu_new": NewGELUActivation,
    "gelu_python": (GELUActivation, {"use_gelu_python": True}),
    "gelu_pytorch_tanh": PytorchGELUTanh,
    "gelu_accurate": AccurateGELUActivation,
    "laplace": LaplaceActivation,
    "leaky_relu": nn.LeakyReLU,
    "linear": LinearActivation,
    "mish": MishActivation,
    "quick_gelu": QuickGELUActivation,
    "relu": nn.ReLU,
    "relu2": ReLUSquaredActivation,
    "relu6": nn.ReLU6,
    "sigmoid": nn.Sigmoid,
    "silu": nn.SiLU,
    "swish": nn.SiLU,
    "tanh": nn.Tanh,
}
ACT2FN = ClassInstantier(ACT2CLS)

dReLU

Researchers have never stopped optimizing. For example, because activation sparsity can significantly accelerate the inference process of large language models without affecting performance, the paper ” Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters” proposes a new dReLU function designed to improve LLM activation sparsity (achieving sparsity close to 90%). The dReLU formula and its effect are as follows.

1312

0x02 Implementation

2.1 Harvard Code

The characteristics of the two linear layers are as follows, where B is the batch size, L is the seq length, and D is the feature dimension.

name	Operator type	Input shape	Weight Shape	Output shape	Other notes
FFN expansion	dense	(B, L, D)	(D, 4D)	(B, L, 4D)	Dimensional expansion to 4D
FFN contraction	dense	(B, L, 4D)	(4D, D)	(B, L, D)	Dimensions reduced back to D

The code implementation is very simple:

# 定义一个继承自nn.Module，名为PositionwiseFeedForward的类来实现前馈全连接层
class PositionwiseFeedForward(nn.Module):

    def __init__(self, d_model, d_ff, dropout=0.1):
        """
        d_model：线性层的输入维度也是第二个线性层的输出维度
        d_ff：隐层的神经元数量。是第二个线性层的输入维度和第一个线性层的输出维度
        dropout：置0比率
        """    
        super(PositionwiseFeedForward, self).__init__()
		# 使用nn.Linear实例化了两个线性层对象，self.w1和self.w2
        self.w_1 = nn.Linear(d_model, d_ff) # 第一个全连接层，输入维度为d_model，输出维度为d_ff
        self.w_2 = nn.Linear(d_ff, d_model) # 第二个全连接层，输入维度为d_ff，输出维度为d_model
        self.dropout = nn.Dropout(dropout) # 定义一个dropout层，dropout概率为传入的dropout参数

    # 前向传播方法    
    def forward(self, x):
        """输入参数为x，代表来自上一层的输出"""
        """
        操作如下:
        1. 经过第一个线性层
        2. 使用Funtional中relu函数进行激活，公式中的max(0, xW+b)其实就是ReLU的公式
        3. 使用dropout进行随机置0
        4. 通过第二个线性层w2，返回最终结果
        """       
        return self.w_2(self.dropout(self.w_1(x).relu()))

2.2 llama3

The implementation of llama3 is as follows, which uses distributed linear layers such as ColumnParallelLinear and RowParallelLinear. From the llama source code, it can be seen that it has three w parameters that need to be trained.

class FeedForward(nn.Module):
    def __init__(
        self,
        dim: int,
        hidden_dim: int,
        multiple_of: int,
        ffn_dim_multiplier: Optional[float],
    ):
        super().__init__()
        hidden_dim = int(2 * hidden_dim / 3)
        # custom dim factor multiplier
        if ffn_dim_multiplier is not None:
            hidden_dim = int(ffn_dim_multiplier * hidden_dim)
        hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)

        self.w1 = ColumnParallelLinear(
            dim, hidden_dim, bias=False, gather_output=False, init_method=lambda x: x
        )
        self.w2 = RowParallelLinear(
            hidden_dim, dim, bias=False, input_is_parallel=True, init_method=lambda x: x
        )
        self.w3 = ColumnParallelLinear(
            dim, hidden_dim, bias=False, gather_output=False, init_method=lambda x: x
        )

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

0x03 The function of FFN

The paper “Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth” found that without residuals and FFN, stacking multiple layers of self-attention will quickly cause the rank of the entire model to collapse, meaning all representations tend to a single vector, rendering the model unusable. Therefore, Attention, FFN, and ResNet can be considered the three pillars of the Transformers architecture, each with its own function and indispensable. Having understood the importance of FFN, let’s look at some of its roles:

Extract more semantic information.
Increase expressive ability.
Store knowledge.

We will analyze them one by one next.

3.1 Extracting more semantic information

Let’s look at another question: Why does FFN first increase the dimension and then decrease it? The specific analysis is as follows:

LLM, through pre-training in its self-constructed high-dimensional language space, records a massive amount of human language instances, extracting countless structural and relational information. We can understand this high-dimensional language space, along with the structural and relational information extracted during training, as the “brain” of LLM. FFN is the key module for information extraction. In the process of mapping input word vectors to output word vectors, FFN performs a mixing operation on the information learned from multi-head attention to extract richer semantic information. This mixing operation specifically consists of two steps:

Dimensionality increase. Its main function is to fit a higher-dimensional mapping space, thereby improving the model’s expressive power and fitting accuracy.
- The first linear layer and activation function combination can be viewed as learning a set of basis functions. Each neuron can be considered a simple classifier used to approximate a high-dimensional mapping of the input data. Dimensionality increase effectively expands the network’s degrees of freedom, enabling the model to learn more feature representations, thereby improving the model’s fitting ability. From the perspective of one-dimensional convolution, dimensionality increase can extract more features. For example, the Harvard code uses 2048 convolutional kernels in the range [1, 512] to extract features.
- Dimensionality increase maps the input word vectors to a higher-dimensional feature space. FFN doesn’t simply model directly on the embedding space of the input dimension; instead, it fits a high-dimensional mapping space through a series of linear transformations. Theoretically, using only linear bases, we only need the number of bases equal to the input dimension. However, the space composed of all possible smooth mappings is infinite-dimensional, thus requiring dimensionality increase to fully represent this space.
Dimensionality reduction. Its main function is to restore dimensionality and limit computational complexity.
- Dimensionality reduction can restore the original dimensions, allowing the next layer to continue computation, thus ensuring that the encoder layer and decoder layer can be stacked.
- Dimensionality reduction can condense features and prevent overfitting. Although increasing dimensionality brings more feature representations, a larger hidden dimension (or number of key-value pairs) is not necessarily better. Too many hidden dimensions can lead to information bottlenecks and overfitting, and may even make the model unable to effectively convey information.
- Dimensionality reduction can limit computational complexity. While increasing dimensionality helps capture more information, theoretically, it requires an infinite number of degrees of freedom to represent a complete, smooth mapping. However, in practice, we cannot have unlimited computational resources, so dimensionality reduction must be used to control the size and computational complexity of the network. Dimensionality reduction effectively controls the complexity of the model by mapping high-dimensional representations back to lower-dimensional spaces.

3.2 Enhance expressive ability

Nonlinear features in the Transformer architecture have a significant impact on the capabilities of the Transformer model. Enhancing nonlinearity can effectively mitigate the feature collapse problem and improve the expressive power of the Transformer model.

Attention mechanisms are essentially linear transformations of the value. Although the weights of the transformation are calculated using a non-linear softmax algorithm, there is no non-linear transformation for the value itself. Each Attention calculation is equivalent to a weighted average of the vector represented by the value. Even with multiple Self-Attention layers, it’s still just a weighted average of the value vector, unable to handle some non-linear features. Therefore, regardless of the number of layers, it’s still a linear transformation of the initial input x, and the overall operation remains linear, not fundamentally different from a single-layer transformation. This limits its hypothesis space and prevents it from fully utilizing the advantages of multi-layer representations.

The activation function in FFN is a key unit that provides nonlinear transformations. It enhances feature learning capabilities. The introduction of nonlinear activation functions breaks the limitations of linear models, allowing for more complex transformations of the data. Dimensionality reduction maps the increased dimensionality back to the original dimension, combining these nonlinear features into the final output. This operation enhances the model’s expressive power, enabling it to represent more complex functional relationships. This is why FFN is essential; in other words, FFN provides the simplest nonlinear transformations.

3.3 Storing Knowledge

The powerful capabilities of large language models rely heavily on their knowledge retention: for example, if a model wants to answer “What is the capital of China?”, it must, in some sense, remember that “the capital of China is Beijing.” Transformers do not have an external explicit database; the knowledge retention is implicitly expressed in the parameters. This knowledge retention can be universally computed through two fundamental capabilities: recursive state maintenance and reliable history access.

Most of the knowledge or information actually learned is stored in the FFN. From a certain perspective, the FFN can be compared to a key-value pair storage structure. The first linear layer generates the “keys,” that is, it calculates a set of recall weights for each token. The second linear layer calculates the “values” and performs a weighted sum with the recall weights. This approach is similar to improving the network’s long-term memory capacity through large-scale memory storage (dimensionality increase).

However, FFN’s storage is distributed, or rather, ambiguous; that is, neurons can respond to seemingly unrelated inputs. Features are causally related to the output, but features do not correspond to neurons. One theory regarding the cause of this ambiguity is the superposition hypothesis: neural networks represent more independent features than their neurons by storing linear combinations of multiple features. If we consider each feature as a vector corresponding to a neuron, then these features form an overcomplete basis in the activation space. Features that contribute to model performance, if their frequency in the training data is sparse, will naturally exhibit superposition during neural network training. Similar to compressed sensing, given any vector in the activation space, sparsity allows the model to eliminate the ambiguity caused by superposition. Furthermore, models trained using cross-entropy loss tend to represent more features with ambiguity rather than fewer “true features” with uniambiguity, even when sparsity constraints make superposition impossible.

Since FFN is a knowledge storage module, it means that it is difficult to compress and speed up because: if the FFN becomes smaller, it means that the model capacity becomes smaller, which leads to a deterioration in model performance. Moreover, the low-rank activations in the middle of the FFN are difficult to detect, making it impossible to speed up.

3.4 Increase the number of parameters

Emergence in large models is a complex and fascinating topic. Its occurrence is primarily related to the number of parameters. When the training parameters of a large model reach a certain scale, the interactions between the various components within the model begin to emerge. These interactions gradually strengthen with the increase in the number of parameters, potentially leading to a significant improvement in the overall performance of the model—this is the emergence phenomenon. Therefore, the number of parameters is crucial for large models. The number of parameters in a language model determines its ability to learn and store information during training. More parameters typically allow the model to cover more dimensions of knowledge, capture more complex patterns and nuances, thereby improving performance on language tasks.

One advantage of using FNN instead of RNN is that it can avoid parameter sparsity. We all know that CNN and RNN have parameter sharing capabilities. This parameter sharing may have certain benefits when dealing with simple tasks, but it may not bring advantages when dealing with complex tasks. On the contrary, densely connected FNN has a greater advantage. Dense connections mean an increase in the number of parameters, and an increase in the number of parameters can at least increase the amount of information that the model can carry.

Ignoring word embedding layers, in a transformer architecture model, FFN and Attention parameters account for the vast majority of the model’s parameters, exceeding 90%. The ratio of FFN to Attention parameters is approximately 2:1. Alternatively, it can be said that the feedforward layer comprises about two-thirds of the model’s parameters. We can quickly obtain the answer using PyTorch.

import torch.nn as nn

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

d_model = 512
n_heads = 8
multi_head_attention = nn.MultiheadAttention(embed_dim=d_model, num_heads=n_heads)
print(count_parameters(multi_head_attention))  # 1050624
print(4 * (d_model * d_model + d_model))  # 1050624

3.5 Summary

Finally, let’s summarize why we need to distinguish between MHA and MLP in the Transformer model. The reason is that these two core components each have their own roles while also working together. The Transformer uses embedding to resolve undefined concepts, and uses MHA+FFN to handle computations and changes that cannot be expressed using existing operators.

MHA considers the semantics and dependencies of words in different positions and uses this information to capture the internal structure and representation of sentences. MHA is the most outstanding one in Transformer.
FFN allows models to leverage contextual information generated by attention mechanisms and further integrate and transform this information to capture more complex relationships in the data, adding depth and complexity to the learning process. FFN also provides a place to store knowledge. FFN is the unsung hero of the Transformer model. These two work together to improve the model’s performance.

0x04 Knowledge Utilization

Since we’ve mentioned that FFN is used to store knowledge, let’s do some further analysis.

Knowledge is defined as the cognition and understanding of facts, concepts, etc. Mastering knowledge has always been a core pursuit in the development of artificial intelligence systems. In today’s rapidly developing AI landscape, LLMs (Language Models) have demonstrated astonishing capabilities and are often regarded as virtual knowledge bases supporting knowledge-oriented tasks. In other words, the outstanding performance of the Transformer is partly attributed to the rich information stored in its massive parameters, including but not limited to linguistic knowledge, common sense, arithmetic knowledge, and world knowledge. However, behind these apparent performances, the laws governing how LLMs learn, store, and utilize knowledge, as well as the dynamic evolution of knowledge, remain unsolved mysteries. For example, regarding questions like “In which city was Liu Xiang born?”, we cannot determine whether the model truly understands the concept it is dealing with and arrives at the answer based on internal knowledge and logical reasoning, or simply outputs an answer based on superficial statistical pattern matching because the question appeared in the training set. Therefore, we need to explore the inherent laws governing concept formation, alignment, and cognitive mechanisms in language models, and we need to explore the mechanisms by which LLMs store and manage factual knowledge.

Furthermore, while LLMs possess immense potential, directly viewing them as next-generation knowledge bases still presents certain limitations, often manifesting as inaccurate or erroneous outputs in practical applications. An ideal knowledge base not only stores vast amounts of information but also allows for efficient and targeted updates to correct these errors and improve accuracy. To bridge this gap, the domain of knowledge editing for LLMs has emerged. This approach aims to efficiently improve the performance of LLMs in specific domains while maintaining their overall performance in handling general inputs.

Next, we will learn from several perspectives how the model can effectively retrieve, process, and utilize learned information within the complex architecture of the Transformer, that is, how to better leverage knowledge, specifically including…

Memory refers to how a model stores knowledge.
Location refers to how the model recalls basic knowledge.
Modification refers to how the model modifies certain stored knowledge.

4.1 Extraction Steps

Let’s first look at the steps of knowledge extraction, and different papers have proposed different ideas and solutions.

The paper “Dissecting Recall of Factual Associations in Auto-Regressive Language Models” reveals the internal mechanism of attribute extraction through information flow analysis. Let’s illustrate this with an example: suppose the input prompt is ” Beat music is owned by ”, the correct answer returned by LLM should be “apple”. Like many approaches, this paper abstracts knowledge into the following triples (s, s, s), where s represents the head entity (subject s), the tail entity (object s), and the relationship r between them. We first identify the key points: relations and entities. In this example, “Beat music” is an entity, “is owned by” is a relation, and “Apple” is an attribute corresponding to this entity. Then, by analyzing the information from these points, we can determine the three steps of attribute extraction as follows:

Incorporating information. After the initial multi-layer MLP processing, the representation (Music) at the last subject position incorporates many subject-related attributes, such as information about Beats.
Relationship propagation. The first few layers of the model propagate the information of the queried relation r to the last token position (by) of the entire input.
Attribute extraction. The last position (by) already integrates the information of the word “own”. At this point, the attribute “apple” corresponding to “beats music” is extracted using an attention mechanism (using relation r).

1313

The paper “A mechanism for solving relational tasks in transformer language models” divides the process of a language model completing a factual recall task into two stages:

Parameter Formation: When we ask the model “What is the capital of France?”, the answer decoded from the residual stream will first form the queried country, i.e., “France”. This process can be likened to the model forming an implicit function similar to “get_capital(x)”, and the process of the residual stream gradually forming the information of “France” can be likened to the model forming the parameters of the implicit function.
Application function: As the number of layers continues to increase, the high-probability tokens decoded in the residual stream will transition from country to capital city name, i.e., “Paris”. This change can be likened to the model applying the implicit function “get_capital(x)”.

These observations provide a valuable foundation for research, but many questions remain unanswered, which are crucial for further understanding the mechanisms of factual recall in language models:

How does the model perform “parameter passing”?
How exactly are implicit functions applied? How does MLP work in this process?
The paper mainly focuses on how the model works in the one-shot setting, zero-shot, or few-shot scenarios.

The paper “Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models” further explores this topic, summarizing the process by which language models complete factual recall tasks in zero-shot scenarios as follows:

Attention heads pass parameters to an “implicit function.” First, the task semantics formed at the shallow level activate some mid- to deep task-specific attention heads. These attention heads have QK matrices that are sensitive to tokens associated with a specific subject (e.g., country name). They focus on these subject tokens and move them to the end of the residual stream. This mechanism allows the model to extract “parameters” from the context and pass them to the “implicit function.” Furthermore, the OV matrices of some attention heads can directly map the “parameters” to the desired output without further processing by subsequent MLPs. This mapping can be seen as completing a partial “function application.”
The MLP is the “activation function” for the attention head outputs. The MLP following the attention head acts as the “activation function” for each head output, causing the “parameters” passed to the task-specific head to stand out in the residual stream. Since the outputs of all attention heads are summed together with equal weight before being added to the residual stream, the MLP can erase or amplify the outputs of individual heads by generating vectors that align with or oppose the head output directions.
Parameter application is " $b+r_{mid}$ ". In the output of an MLP, a “task-dependent” component, namely the intercept term, performs “function application” when added to the residual stream, guiding the residual stream in the direction that the MLP deems the correct output. We can decompose a task-aware component from the MLP output; this component, when added to the residual stream, manipulates the direction of the residual stream. The MLP can use this component to guide the residual stream towards the direction of the unimbuedding vector of the target answer. This operation of adding this component to the residual stream can be considered the basic implementation of “function application.”
In addition, the last layer of the model generally has an anti-overconfidence mechanism. Whether it is the attention head in the last layer of the model or the MLP, it guides the output of the model in the direction of “high frequency” or in other words “safe”. In this way, even if the model makes a wrong prediction, the expected loss obtained from the perspective of the entire training set is still relatively small.

The figure below illustrates the key mechanism of fact recall used in Transformer-based language models.

Figure (1) shows the attention heads related to a specific task. $A_{l,1}$ moves the subject entity (i.e., “France”) to the final position in the residual stream.
As shown in Figure (2), the MLP takes “France” as the parameter of the implicit function “get_capital(X)”. Its output redirects the residual stream in the direction of its expected answer, which is “Paris” in this case.
As shown in Figure (3), the output of the MLP can erase or amplify the output of a single head in the residual current. In this case, $A_{l,1}$ output is amplified, while the outputs of other heads are erased.

1314

4.2 Knowledge memorization

The purpose of knowledge memorization is to remember and recall knowledge, such as specific terms, grammar, and concepts. Academician Wang Jian stated, “Memory is about reshaping the connections between neurons; memory preferences are changes in the relationships between neurons. For today’s large language models, this means a change in weights.” Existing research is dedicated to revealing the behavioral mechanisms of LLMs, particularly the knowledge storage patterns within them. The following figure provides a brief summary.

1315

Let’s first look at some typical thoughts and research. In our current work, two directions are particularly important:

Key-value pairs. This approach posits that facts are stored in MLPs as key-value pairs. Based on this, methods such as knowledge editing, machine unlearning, and detoxification are used to modify the MLP layers of the model to mitigate and correct its defects.
Knowledge loops. This approach posits that knowledge is not stored in a single, isolated area, but rather is composed of various components working together.

key-value pair form

The image below illustrates the concept of key-value pairs. Next, we’ll look at some important papers in this field.

1316

Memory Network

In 2015, the paper “End-To-End Memory Networks” introduced the concept of memory networks, a key-value memory structure that allows the addition of memory modules to neural networks. This paper maps the information to be stored into key and value vectors, then models the conditional probability of a key with respect to an input (x) using the exponential form of the vector dot product. The overall memory network is thus a weighted sum of each key-value pair. The model architecture is shown in the figure below. Each text is mapped to a vector. $c_i$ and $m_i$ , and the query is also encoded as an internal state u. In the embedding space, the model computes u and $m_i$ by inner product and softmax operation. The final output vector O is calculated using $c_i$ and $p_i$ .

1317

We further abstract the model structure; given input x and key k, we have $x, k_i \in R^d$ . The structure of the memory network is then:

$MemoryNet(x)=softmax(x\cdot K^{\top})\cdot V$

The specific details are shown in the image below.

1318

We already know that the formula for FFN is

$FFN(H)=f(H\cdot W_1)W_2$

here $f$ uses the ReLU activation function. Therefore, it can be observed that memory networks and attention mechanisms are very similar, and FFN’s key-value memory is almost identical to that of memory networks. The only difference is that memory networks use softmax for normalization, while FFN uses ReLU for filtering and does not require normalization. Assuming an FFN layer is a key-value memory, each key vector $k_i$ can capture the pattern of the input sequence, and the corresponding value vector $v_i$ can represent the distribution of tokens that follow this pattern.

Key-Value

Based on the above information, the papers “Transformer Feed-Forward Layers Are Key-Value Memories” and “Knowledge Neurons in Pretrained Transformers” also conducted in-depth research and found that FFN does indeed memorize and store some patterns or knowledge. Some of the relevant arguments are as follows:

The FFN under the Transformer architecture is highly similar in form to a memory neural network, both being two-layer key-value memory networks. Furthermore, the weights of the first layer of the FFN feedforward network, $W_{fc}^{(l)}$ , correspond to KEY in the key-value pairs (KEY-VALUE) within the memory network, while the weights of the second-layer feedforward network, $W_{proj}^{(l)}$ , correspond to VALUE. The intermediate layer dimension corresponds to the number of memory tokens (perhaps the intermediate layer dimension requires a larger explanation).
The memories learned by FFN have a certain degree of interpretability. The keys of the feedforward network capture a certain pattern of input, or in other words, each key is highly correlated with at least one human-understandable input pattern. The stored patterns are derived from the training data.
Each KEY neuron triggers a shallow input pattern that is understandable to humans, and the corresponding VALUE neuron stores the output probability of the next word.
VALUE can predict the distribution of the next output word based on the pattern captured by the KEY. In other words, the next word of the pattern sentence associated with the KEY corresponding to VALUE will appear in this distribution with a high probability.
The output of each layer is equivalent to merging hundreds or thousands of activation memory distributions, ultimately forming a completely new distribution. The predictions of this distribution are continuously corrected and refined with the residual connections in each layer, until the last layer. This process ultimately produces the model’s prediction results. The final output of FFN can be understood as a weighted sum of activation values.
Shallow layers tend to detect shallow patterns, while higher layers tend to detect semantic patterns.

Below is the detailed KV structure of FFN derived in the paper. The first layer of FFN can be considered as the KEY, and the second layer can be considered as the VALUE.

1319

The diagram below shows the process from pattern to value.

1320

We will now analyze these arguments in detail.

Key mode

Regarding key patterns, the paper “Transformer Feed-Forward Layers Are Key-Value Memories” also conducted research. The authors labeled a batch of sentences corresponding to key values, requiring the patterns to meet the following conditions: repetition at least three times, describable, and containing either shallow patterns (repeated word/phrase n-grams) or semantic patterns (multiple repeated subjects). Through experiments, the authors found that each key vector corresponds to at least one human-interpretable pattern. Low-level key vectors tend to capture shallow patterns, such as common patterns (e.g., ending with a certain word), while high-level keys tend to capture abstract semantic patterns (e.g., sentence classification). This finding is similar to that in CNNs, lower layers tend to capture displayed image features, while higher layers tend to capture abstract features. It is also similar to findings from papers like ELMO in the NLP community.

Furthermore, the authors conducted experiments on the effects of removing tail words and removing head words. Compared to high-level key values, the memory coefficient of the shallow patterns at the lower level was more sensitive to the impact of “removing tail words.” This corroborates the conclusion that high-level and low-level patterns focus on different levels of pattern abstraction.

1321

Value vectors represent distributions

The value vector of the memory network represents the distribution of the output vocabulary, and it tends to complete the next word corresponding to the prefix of the key. Its specific characteristics are as follows:

Each key $k_i^l$ corresponding value $v_i^l$ , that is, the nth row of the second parameter matrix of the FF layer, can be regarded as a distribution of the output vocabulary and can also be used as a supplement to patterns captured by $k_i^l$ .
Directly multiply $v_i^l$ by the embedding matrix E of the output vocabulary (assuming the same vocabulary matrix is used in each layer of the model), and then perform softmax:

$p_i^l = softmax(v_i^l \cdot E)$

This allows us to convert the values into the distribution of the output vocabulary. $p_i^l$ is not calibrated and is not a true vocabulary distribution. This is because FF contains two parameter matrices; the first parameter matrix yields the memory coefficients:

$m_i^l = f(x \cdot k_i)$

This coefficient will be multiplied by the second parameter matrix to obtain $m_i^l \cdot v_i^l$ , and here we directly use $v_i^l$ to obtain the vocabulary distribution.

For each layer, according to $argmax(p_i^l)$ get the top-ranked token, then compare it with $w_i^l$ . $w_i^l$ is the next token of the prefix sequence triggered by the highest-scoring $m_i^l$ , meaning that the vocabulary distribution obtained through the value follows the key capture pattern.

Distributed storage and memory aggregation

So far, we’ve been discussing specific key-value pairs. But we know that a memory network is a weighted sum (with memory coefficients) of all value vectors (plus a bias term). If the value vectors represent distributions in the word space, how is this information aggregated into a final distribution?

Research indicates that all knowledge in the brain is not stored in one place, nor is it stored everywhere like a hologram. Knowledge about an object is distributed across thousands of cortical columns. For example, Karl Lashley proposed a nonlocalization conclusion in the early 20th century: there are no dedicated memory organs in the brain; information is not stored in specific filing cabinets, but rather distributed among neurons. This conclusion has been proven largely correct by later improved experimental protocols.

Similar to the human brain, knowledge in a Freeform Network (FFN) is also stored in a distributed manner, with weights used to store a specific pattern distributed across different layers. An FFN does not simply activate a key and its value, but rather a weighted sum of multiple values. The output of each layer is further a combination of the FFN output and the residual.

Knowledge loop

The paper “Knowledge Circuits in Pretrained Transformers” discovered knowledge circuits in the Transformer architecture. Knowledge circuits view the language model as a computational graph composed of components (input, output, attention_head, MLP) as nodes, connected by edges (residual flows), where information flows between these components. Compared to knowledge editing, which focuses on the storage of knowledge, knowledge circuits focus more on the flow of information.

The diagram below shows the loop the model traverses when answering the question “The official language of France is”. For this loop, based on a fact triple “(Franch, official language, French)”, the model completes the sentence “The official language of France is”, thus predicting that the object is French. In the loop, points like MLP14 represent the 14th layer of the MLP; L18H14 represents the 14th attention head in the 18th layer, and the brown lines connecting the points represent the information flow between them. By ablating the edges between nodes (i.e., setting the parameters to 0), we can determine whether an edge is a key edge for the knowledge; by retaining the important edges, we can construct the loop about this fact.

1322

The paper conducted experiments, decoding the intermediate outputs of each layer and observing the prediction results. Regarding the fact that “The official language of France is French,” the following figure shows the rank and probability of the target entity at the last subject token position and the last token position. The explanations of the symbols in the figure are as follows:

The Target Entity at Last Position indicates the predicted rank of the word “French” at the “is” position. The lower the value, the higher the rank.
The Target Entity at Subject Position indicates the predicted rank of the output logits when the word “French” is in the “France” position. The lower the value, the higher the rank.
The Prob. of Entity indicates the likelihood of an entity; the higher the value, the greater the likelihood of the entity.

1323

As can be seen from the above figure, the probability of the target entity begins to gradually increase after the MLP17 layer. From the network diagram above, we can see that the edges connecting MLP17 are (L14H13 → MLP17), (L14H7 → MLP17) and (L15H0 → MLP17). Therefore, this paper concludes that different attention heads play different roles.

Attention head L14H13 is a relation head that focuses on relation tokens within a context. The output of this head is tokens related to the relation, such as “language” and “Language”.
Attention head L14H7 is a mover head that moves information from the subject position “France” to the last token.
The MLP17 layer combines the information provided by the previous components to improve the highest rank of the target token.

Attention module

Furthermore, the attention module plays a crucial role in storing relational knowledge. This indicates that when analyzing and modifying knowledge in LLMs, it’s essential to consider not only the MLP layers but also the role of the attention mechanism. For instance, the paper “EXBERT: A Visual Analysis Tool to Explore Learned Representations in Transformer Models” explains the knowledge learned by each attention head. Specifically, attention heads store explicit linguistic features, location information, and more. Additionally, factual information and biases are also transmitted through the attention heads.

1324

The figure above shows the effects of different attention heads on the pre-trained model BERTbase and different corpora.

(a) shows that attention head 5-3 predicts that the masked word should be a verb by using the auxiliary verb (AUX) “to”.
(b) shows that attention head 7-5 found the relationship between the preposition (PREP) and its object (POBJ) in the input sentence.
(c) shows that attention head 5-5 learned about the co-reference of entity relationships because both “she” and “her” explicitly point to “Kim”.

4.3 Positioning of Knowledge

Besides knowledge storage, some research has begun to explore knowledge retrieval and utilization from the perspective of network architecture or attention mechanisms.

Positioning of facts

The localization of factual knowledge can be divided into two steps: knowledge attribution and knowledge neuron refining.

Knowledge Attribution

The paper “Axiomatic Attribution for Deep Networks” proposes using integrated gradients to calculate the attribution of each feature to the output, thereby explaining the relationship between model predictions and input features.

An important property of the integral gradient method is that the sum of all attributions is the difference between f(x) and f(x’). The formula is as follows, where the function F represents the neural network. If F(x’) = 0 and F(x) = 1, then the attribution of each feature can be considered as a contribution to the sample belonging to label = 1.

1325

The attribution definition for deep networks is shown in the figure below: Assume function F represents a deep network with input x and a baseline input x’. Then the attribution of x relative to x’ is a vector $A_F(x, x')$ , and $a_i$ is the contribution of input $x_i$ to the prediction result F(x).

1326

Knowledge Neuron Refining

We can use a refinement strategy to more accurately locate factual knowledge. While many “true-positive” knowledge neurons in the initial set of neurons contribute significantly to the final output, there are also many “false-positive” knowledge neurons that represent other knowledge (such as syntactic and lexical information, i.e., they represent supplementary or contextual information). Therefore, we need to filter out these “false-positive” knowledge neurons to improve the localization effect.

1327

How do we filter them? Let’s first look at the characteristics of “false-positive” neurons. For example, given several prompts describing Li Shimin, because they have various syntactic and lexical information, their “false-positive” knowledge neurons will differ. However, they all share the same factual information: Li Shimin. Therefore, we can extract the neurons shared between different prompts to locate those universal factual information. Specifically, given a relational fact, the complete process of identifying its knowledge neurons is described as follows:

Construct n different prompts to express this fact.
For each cue, calculate the knowledge attribution score for the neuron.
For each cue, neurons with attribution scores greater than the attribution threshold t are retained to obtain a coarse set of knowledge neurons.
Set a sharing threshold p% (whether it is shared by multiple prompts).
All the coarse knowledge neurons are aggregated together, and only those neurons that reach this threshold are retained.

Taking the following figure as an example, for a relation and its activated neuron, we input 10 prompts (containing head and tail entities) to obtain the average activation of the knowledge neuron. Then, we sort these prompts, keeping the top-2 (highest activation) and bottom-2 (lowest activation). We find that the top-2 always represent the corresponding relational fact, while the bottom-2, despite containing the same head and tail entities, does not represent the corresponding relation. This finding indicates that knowledge neurons can capture the semantic patterns of relational facts and further validates that knowledge neurons are activated by knowledge probe prompts.

1328

Positioning of relationships

The preceding discussion primarily examined knowledge in LLMs from the perspective of entities. However, if we approach the same knowledge from the perspective of relationships, we might obtain entirely different observations. Theoretically, a piece of knowledge includes entities and the relationships between them; the absence of either renders the knowledge incomplete. Therefore, in this case, entities and relationships should be equivalent, which is the premise of many current model editing efforts, as it requires modifying knowledge within the model parameters.

The paper ” Does Knowledge Localization Hold True? Surprising Differences Between Entity and Relation Perspectives in Language Models ” investigates the differences between entities and relations, specifically by modifying entity or relation knowledge to determine whether these changes produce consistent results, and observes the effects from two perspectives. Ideally, these effects should be identical because the edited knowledge involves the same information.

1329

The researchers raised the following research questions:

Where is relational knowledge stored? Is it stored in MLPs like entity knowledge?
Regardless of storage location, are relational and entity knowledge equally important in knowledge triples?

The paper answers the two questions as follows:

Entity and relation knowledge may be stored and represented in different ways.
- Entity and relation knowledge should not be simply stored in the same location or represented in the same way, but should be stored separately.
- The attention module also plays a crucial role in storing relational knowledge. This relational knowledge is closely related to higher-level MLP layers and mid-to-high-level attention layers.
Entity knowledge and relational knowledge are interchangeable. Based on this assumption, researchers believe it is theoretically possible to modify entity knowledge by changing relational knowledge. However, editing entity knowledge and relational knowledge are not entirely equivalent.

Dictionary learning and sparse autoencoders

Let’s first look at a few concepts.

The linear representation hypothesis states that neural networks represent meaningful concepts (called features) as directions in their activation space. Simply put, the model’s understanding and representation of a concept can be viewed as a direction in a multi-dimensional space. Changing this direction, i.e., changing the value of the feature, alters the model’s understanding and processing of that concept.
The superposition hypothesis: Extending from the above hypothesis, neural networks utilize the existence of nearly orthogonal directions in high-dimensional space to represent more features than the dimension itself. This means that even if our model has only a finite number of dimensions, by superimposing and combining these dimensions in different directions, we can represent and understand more features and concepts.
Dictionary learning: Dictionary learning is a commonly used feature extraction method that, by learning a dictionary, represents high-dimensional data as a linear combination of elements in the dictionary. This technique draws inspiration from classical machine learning, separating recurring neuronal activation patterns across many different contexts and matching these patterns (called features) with human-interpretable concepts. For context, the goal of dictionary learning is to unpack the activations within an LLM neuron into a small set of interpretable features. These features can then be examined to check what is happening internally when the model processes a given context.
Sparse autoencoders are a special dictionary learning method that can effectively extract key features from data by limiting the number of dictionary elements and the sparsity of their linear combinations.

Based on these concepts, the paper “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning” expands the interpretability of knowledge in LLM from another perspective. Its core idea is that sparse autoencoders can extract a large number of interpretable features from a single-layer transformer model.

In a sense, the image below represents a language model so simple that it’s incomprehensible to most people. The paper aims to decompose its MLP activation vectors into individual features. This is achieved by appending an overcomplete autoencoder to the MLP activation; the autoencoder is used to interpret the model’s intrinsic activations (activations after the MLP layer). The number of features after autoencoder decomposition exceeds the number of neurons, and each dimension of the hidden state can be considered an abstract feature with strong interpretability. This is because we believe that the MLP layer is likely to use superposition to represent more features than its neurons (of course, not only superposition occurs, but also a non-linear mapping of features).

1330

This allows us to understand and process complex data and concepts by finding “directions” representing different concepts in a multidimensional space and by superimposing and combining these directions. Through dictionary learning and sparse autoencoders, we can effectively extract these directions, thereby better understanding and controlling the behavior of the model.

The figure below shows a comparison between Transformer and sparse autoencoder.

1331 1332

Some interesting conclusions from the paper are as follows.

Sparse autoencoders can extract relatively simple semantic features.
Sparse autoencoders can produce interpretable features that are actually invisible in neurons.
Sparse autoencoder features can be used to intervene in and guide the content generation of a transformer.
Sparse autoencoders can produce relatively universal features.
When the size of the autoencoder is increased, the features “split”.
Only 512 neurons are needed to represent tens of thousands of features. Despite the very small size of the MLP layer, we can still discover new features as the scale of the sparse autoencoder increases.
These features are interconnected in a system similar to a “finite state automaton” to enable complex behaviors. For example, we can find features that collectively generate valid HTML.

4.3 Modify Knowledge

LLMs have demonstrated remarkable capabilities in understanding and generating natural language. However, due to the massive number of parameters, their training requires significant computational power. The real world is constantly evolving, necessitating frequent updates to LLMs to remove outdated information or integrate new knowledge. This further exacerbates the computational challenge. Besides the need for frequent parameter updates to ensure continuous learning, many applications also require ongoing post-training model adjustments to address shortcomings or undesirable behaviors in pre-trained models.

Therefore, an increasing number of studies are attempting to propose efficient and lightweight methods that can modify models in real time. In recent years, knowledge editing technology, a representative approach of this type, has made groundbreaking progress in the LLM field. This technology enables LLMs to generate more accurate and relevant outputs through rapid and accurate modifications. This promises to address the current shortcomings of LLMs, thereby fully realizing their potential as dynamic and accurate knowledge bases in various downstream applications.

The diagram below illustrates several technical approaches related to knowledge editing, including parameter-efficient fine-tuning, knowledge augmentation, continuous learning, and machine unlearning.

1333

The symbol ✔ indicates the presence of a specific feature in the technology, while ✗ indicates its absence. + indicates an enhancement of LLM capabilities, and - indicates a reduction or deletion of certain capabilities in the model.

As shown in the diagram above, knowledge editing intersects with other technologies, drawing on the strengths of various approaches. Knowledge editing techniques specifically target the knowledge embedded within LLMs and leverage the inherent knowledge mechanisms within these models. This is not merely about applying known techniques to new models, but rather about understanding and manipulating the subtle knowledge storage and processing capabilities of LLMs. Furthermore, knowledge editing represents a more precise and fine-grained form of model manipulation because it involves selectively altering or enhancing specific aspects of the model’s knowledge base, rather than retraining or fine-tuning the entire model. Therefore, unlike simply modifying existing methods, knowledge editing requires a deeper understanding of the functionality of LLMs. These characteristics make knowledge editing a potentially more effective and efficient technical approach for updating and optimizing LLMs to suit specific tasks or applications.

Function

As an ideal knowledge base, knowledge editing for LLMs must be able to achieve the following three basic functions: knowledge insertion, knowledge modification, and knowledge erasure.

Knowledge insertion. As new fields and entities emerge and develop, it is crucial to empower LLMs with the ability to absorb new knowledge. Knowledge insertion achieves this by endowing LLMs with new knowledge beyond their existing scope: i.e., 𝜃′=𝐹(𝜃,{∅}→{𝑘}).
Knowledge modification. Knowledge modification refers to changing the knowledge already stored in LLMs: 𝜃′=𝐹(𝜃,{𝑘}→{𝑘′}), which can be divided into two categories:
- Knowledge correction aims to correct inaccurate information in LLMs to ensure they convey accurate information. As vast knowledge bases, LLMs are prone to containing outdated or incorrect information. Knowledge correction aims to correct these errors, ensuring that the model always produces accurate and up-to-date information.
- Knowledge interference – Modifying LLMs to answer counterfactual or typographically incorrect (unintentional fallacies) question inputs. This is a more difficult task. Existing work shows that counterfactual ideas score very low in LLMs compared to factual knowledge, resulting in a much lower probability of being generated than factual knowledge, thus requiring more targeted modifications.
Knowledge erasure. Knowledge erasure is the removal of existing knowledge from a model, primarily to reset facts, relationships, or attributes: 𝜃′=𝐹(𝜃,{𝑘}→{∅}). Implementing knowledge erasure is crucial for eliminating biased and harmful knowledge and helps limit the replay of confidential or private data, thus enabling responsible and trustworthy AI applications.

In summary, the interaction between knowledge insertion, modification, and erasure forms the basic framework for knowledge editing techniques for LLMs. When these techniques are combined, they enable LLMs to self-transform, self-correct, and ethically adapt as needed.

1334

Classification

Knowledge editing for LLMs can be mainly divided into the following categories, which correspond to the three different stages of human knowledge acquisition: identification, association, and mastery.

External knowledge dependence. A representative approach involves prompting engineering and knowledge retrieval, specifically occurring during the knowledge identification phase. This method is analogous to the identification phase in human cognition, requiring exposure to new knowledge within a relevant context, much like how people encounter new information for the first time. For example, a large model can be provided with sample statements updated with facts, enabling it to initially identify the knowledge to be edited. Alternatively, the LLM’s response can be validated through retrieval; if the retrieved facts conflict with the LLM’s output, the LLM’s response is updated; otherwise, the LLM’s output is used as the final answer.
External knowledge injection. Representative approaches involve adding parameters or replacing outputs, specifically occurring during the knowledge association stage. This method closely resembles the association stage in human cognition, establishing a connection between new knowledge and existing knowledge within the model. Such methods typically use a learned knowledge representation to enhance or replace the output or intermediate results of a large model. In general, we can collectively represent these methods as: $h_{final}=h+h_{know}$ . However, these methods combine new knowledge with the original model, making the weighting of knowledge from different sources a key parameter that needs to be considered. In fact, external knowledge dependency and external knowledge injection are both considered weight-preserving methods. That is, they achieve this preservation by introducing external models, utilizing contextual learning, or changing the representation space of the LLM. These can also be called memory-based methods.
Internal knowledge editing. This approach is similar to the mastery stage in human cognitive processes, where the large model’s weights are modified and utilized to fully integrate knowledge.

The table below summarizes representative methods in the field of LLMs knowledge editing. “No Training” indicates methods that do not require additional training; “Batch Edit” indicates whether these methods can support editing multiple cases simultaneously. “Edit Area” refers to the location where model components are used; “Editor #Params” indicates the number of parameters that need to be updated during editing. ”𝐿” indicates the number of layers that need to be updated. ” $𝑑_ℎ$ surface Show Transformers middle hidden Tibetan layer of dimension number. $𝑑_𝑚$ ” refers to the intermediate dimension between the upper and lower projections. 𝑁 represents the total number of neurons updated in each individual layer. For the specific paper corresponding to the methods in the table, please refer to the paper “A Comprehensive Study of Knowledge Editing for Large Language Models”.

1335

Internal knowledge editing

While external knowledge dependency and external knowledge injection methods perform well across various tasks, we still face the challenge of how models store, utilize, and express knowledge. Therefore, we arrive at the internal knowledge editing (parameter updating) that occurs during the mastery phase. In the mastery phase, the model needs to learn about its own parameters and autonomously master this knowledge.

Fine-tuning a model is the most direct way to edit its internal knowledge. However, as mentioned earlier, training the entire model requires significant computational resources and is time-consuming. Furthermore, fine-tuning techniques are prone to catastrophic forgetting and overfitting. Currently, most research in the mastery stage uses methods specifically designed for particular knowledge to update model parameters. These methods can be categorized into two types: meta-learning and localization-editing.

Meta-learning does not directly update model weights. Instead, it trains a supernetwork to learn the changes in model weights Δt. For example, it can directly use the representation of new knowledge to train the supernetwork. Alternatively, it can introduce a new training objective, considering sequential, local, and generalized model updates, to ensure that while using the supernetwork to update the intrinsic related knowledge, other knowledge remains unchanged.

Location-editing first locates the position of knowledge storage within the larger model, and then edits the knowledge by modifying these specific areas.

FFN

The paper “Knowledge Neurons in Pretrained Transformers” proposes a knowledge attribution method that calculates gradient change sensitivity to locate knowledge storage. Since neurons that significantly influence certain facts or knowledge can be located, the authors directly use the embedding of the target knowledge to modify the corresponding value slots, specifically including the following:

By enhancing or suppressing the values within these neurons, Transformers can improve or worsen their responses to these facts or pieces of knowledge.
These neurons are deleted, causing Transformers to completely forget this knowledge. For example, after identifying all knowledge neurons related to this knowledge, a threshold m=5 is set, and then these m neurons are deleted by setting them to [UNK].

The top of the image below shows the number of neurons required to change the second column to the third column. The bottom explains that by modifying several slots corresponding to the knowledge neurons, some knowledge can be erased. It also shows the accuracy of missing entity predictions for four different relationships before and after knowledge erasure.

1336

This approach of editing the value matrix of the FFN layer in a model can lead to forgetting and other side effects. Therefore, the paper “WISE: Rethinking the Knowledge Memory for Lifelong Model Editing of Large Language Models” references human learning—that is, humans continuously and progressively acquire new knowledge and then forget old knowledge—and designs a lifelong learning model editing method. This method can achieve efficient model updates while avoiding the side effects of catastrophic forgetting and other knowledge variability. The goal of lifelong learning editing is to ensure that large models, after hundreds or thousands of edits, can align with human expectations and maintain previous knowledge and capabilities. To achieve this, the paper introduces two components: an auxiliary memory module and a knowledge sharding and merging mechanism.

Auxiliary memory design. This component copies the value matrix in the model as auxiliary memory, and then edits on the auxiliary memory, thus circumventing these defects. During inference, a routing mechanism determines whether to use the auxiliary memory. If the given query falls within the scope of the previous edits, the auxiliary memory is used; otherwise, the primary memory is used.
Knowledge Segmentation and Merging. To achieve lifelong learning editing, hundreds or even thousands of edits are needed in the parameter space, inevitably leading to knowledge conflicts and ultimately catastrophic forgetting. To avoid multiple edits in a single parameter space, this paper proposes copying the auxiliary memory k times and then applying n edits to k segments, thus achieving continuous editing. For multiple auxiliary memory segments, there are overlapping and disjoint elements. This paper employs the Ties-Merge merging method, using overlapping parts as anchor points to ultimately merge multiple memory segments into a single memory.

1337

Attention

The above describes knowledge editing of the FFN. In addition to knowledge editing in the FFN region, the paper “PMET: Precise Model Editing in a Transformer” also edits the attention heads, as shown in the figure below:

1338

This article processes the outputs of MHSA and FFN separately, but only processes the output corresponding to FFN during the update.

Researchers observed that the MHSA component contains more variation and dynamism in its knowledge than the FFN component. This observation may mean that the internal representations and weights of MHSA need more frequent adjustments when capturing and encoding certain patterns or relationships in the input data. Based on this observation and a review of existing research, the researchers further proposed a new hypothesis: MHSA can be viewed as a “knowledge extractor.” It can not only identify patterns and relationships in the input data but also store general knowledge extraction patterns that can help models better extract and understand valuable information or knowledge from the data.

Based on this new understanding and assumption, researchers have proposed a novel optimization strategy. They argue that the functional space of the MHSA (or the hidden states of the Transformer components) can be expanded by optimizing its hidden states, thereby enabling it to better extract and store knowledge. Moreover, this optimization can be achieved without updating the MHSA weights.

ROME

Finally, let’s look at the paper “Locating and Editing Factual Associations in GPT”. This paper mainly proposes an editing method for LLM, where the authors use knowledge triples (s,r,o) to complete model editing. First, through experiments, the authors found that the MLP plays a major mediating role in the last token of the subject. Therefore, the authors hypothesize that the intermediate layer MLP at this position stores the association information between facts. $w_{fc}$ stores subject information. $w_{proj}$ stores information about the relationships between facts. The author will treat $w_{fc}$ as a key and $w_{proj}$ as a value; by editing key-value pairs, factual information in the LLM is modified, thereby achieving model editing and improving the model’s generalization ability and portability.

1339

The paper represents each fact as a knowledge triple 𝑡=(s,r,𝑜), containing the subject s, the object o, and the relation r connecting the two. Then, a natural language cue 𝑝 describing (s,r) is provided, and the model’s prediction for 𝑜 is examined. The paper treats $W_{proj}^{(l)}$ as linear associative memory. From this perspective, by solving WK≈V, any linear operation 𝑊 can be stored as a set of vector keys K=[k1|k2|…] and corresponding vector values 𝑉=[v1|v2|…]. By solving the constrained least squares problem, the paper derives a closed-form solution in the fully connected layer, as shown in Figure 1 above. Once (k∗,v∗) is calculated, we can directly insert any fact. So let’s see how to find suitable k∗ and v∗. The specific steps are as follows:

Step 1: Select k* to choose the subject. Based on the decisive role of the MLP input in the final subject token, we select the input of the last token representing the subject as the lookup key k*. Specifically, we compute k* by collecting activations: the text 𝑥 containing the subject s is passed to the language model 𝐺; then, at the index 𝑖 of the last subject token in the 𝑙* layer, we read the value after the non-linear layer inside the MLP. Because the state changes according to the tokens preceding s in the text, we set k* to the average of a small group of texts ending with the subject s. See label 2 in the diagram above.
Step 2: Select v* to recall facts. Next, we want to select some vector values v* to encode the new relation (r, 𝑜*) as attributes of s. Our specific processing is as follows, as shown in icon 3 above.

The first term (equation a) seeks a vector z such that when z is used to replace the MLP output of the token i at the end of the subject, the model will predict the target object o* for the cue p.

The second term (Equation b) minimizes the KL divergence for an unchanged model and hint p’ (in the form “{subject} is a”). This optimization does not directly change the model weights, which helps maintain the model’s understanding of the subject’s essence. Equation b identifies the vector representation of v*. If the target MLP module outputs v*, it means that v* is a new attribute (r, o*) of the subject s.

Step 3: Insert Facts. Once we have computed the pair (k*, v*) representing the complete fact (s, r, 𝑜∗), we apply equation number 1 in the above diagram to update the MLP weights by directly inserting the new key-value association’s rank one update $W_{proj}^{(l)}$ .

The complete process is shown in the diagram below.

1340

4.4 Learning Knowledge

Next, let’s look back at how the Transformer was adjusted or modified internally during the learning process.

Forward propagation

Existing methods primarily focus on the mapping between the hidden states and weights in forward propagation. For example, Logit Lens (LM) is a method for analyzing and interpreting the internal mechanisms of large language models, which demonstrates the model’s performance during the generation process by transforming the hidden states of the LM into lexical probabilities. This projection helps to understand how the LM gradually constructs patterns in the output during generation.

The principle behind Logit Lens is quite simple. Decoding a new token involves first transforming the latent vector using a linear layer, then converting it to a probability distribution in the dictionary using softmax. By cracking this process at each intermediate layer, the tokens of that layer can be obtained. Specifically, for a given layer, Logit Lens directly multiplies the output of that neuron or layer with the unembedding matrix, and then observes the top tokens to locate the information stored in the model. The top-ranked tokens indicate that the neuron/layer output stores information about those tokens.

The image below is an example.

1341

Furthermore, the paper “Physics of Language Models” points out that LLM’s knowledge storage capacity follows a linear scaling law of 2 bits per param, provided that the knowledge is sufficiently trained during the pre-training phase. Sufficient training is roughly defined as a piece of knowledge appearing more than 1000 times in the training corpus (different expressions of the same semantics can be considered multiple occurrences). This capacity is only related to the number of model parameters and is independent of model structure, depth, training hyperparameters, etc., even after removing the MLP layers.

However, if the knowledge is not trained sufficiently (e.g., the number of occurrences is reduced to 100), its storage capacity will decrease to approximately 1 bit/param. In this case, the differences between different model architectures begin to emerge: the architectures of Llama and Mistral perform about 1.3 times worse than GPT-2.

Reducing the MLP layer of GPT-2 to 1/4 does not result in a significant loss of storage capacity, but completely removing the MLP layer would result in a significant loss.
If the GatedMLP in Llama’s structure is replaced with a standard MLP, its storage capacity will be restored to the same level as GPT-2.

Here, “bit” refers to a semantic meaning, meaning that templates for data entries in the dataset that have the same semantics but may differ in wording are considered the same information. This means that information is measured based on the different values/content filled in, i.e., only semantically different information is considered. This experiment was trained and tested on a single-question dataset.

Backpropagation

The paper “Backward Lens: Projecting Language Model Gradients into the Vocabulary Space” extends existing interpretability methods, particularly by applying them to the backpropagation process of Language Model (LM). By analyzing the gradient matrix during backpropagation, we can gain a more comprehensive understanding of the flow of information within the model. Furthermore, the paper proposes a novel approach to reveal the intrinsic mechanisms by which LM learns new knowledge by mapping the gradient matrix to the lexical space. Through this method, researchers hope to gain a clear understanding of how the model stores and remembers information at multiple levels.

The backpropagation algorithm updates the weights in the model by calculating the gradient at each layer. This mechanism not only enables the model to learn new information but also provides researchers with opportunities to interpret the model’s behavior. Recent interpretability research has proposed various methods to attempt to interpret the inner workings of language models by visualizing weights and hidden states, especially in the forward propagation stage. However, discussions on how the gradients of backpropagation affect model learning and knowledge storage remain relatively scarce.

The figure below illustrates the impact of gradients on model updates during the forward and backward propagation of an MLP layer, specifically the interaction between gradients (represented in green) and weights (represented in blue). This paper primarily focuses on how to effectively apply this gradient information to the model’s knowledge updates and editing.

1342

The authors of the paper propose a method called “imprint and shift” by applying Logit Lens to the gradient matrix, which can reveal the mechanism by which information is stored in the MLP.

The gradient of each MLP layer can be represented as a combination of the forward-propagating input vector and the backward-propagating VJP (vector Jacobian product). Specifically, the gradient’s behavior during the update process can be expressed as:

$\frac{\partial L}{\partial W}=x_i^{\top}\cdot \delta_i$

In this expression, $x_i$ is the input for forward propagation, and $\delta_i$ is the corresponding VJP. When backpropagation is used to update the MLP layer of the LM, the following two main phases of change occur:

Imprint stage: In this stage, the input $x_i$ is added to or subtracted from neurons in $FF_1$ , thereby adjusting the activation degree of corresponding neurons in $FF_2$ . This process gives the MLP layer an “imprint” of a given input. This is equivalent to reinforcing the most likely words.
Shift phase: This phase involves adjustments or changes of $FF_2$ output, hence the name shift. Specifically, it involves subtracting the VJP value $\delta_i$ from neurons in $FF_2$ . This amplifies the impact of the output after enabling the VJP value. This is equivalent to elevating previously low-probability words to higher-probability targets.

This “imprinting and shifting” mechanism can be used in the knowledge update process: given the original input and new target of a layer, the process updates $FF_1$ to reinforce similar inputs, and then shifts $FF_2$ output toward the new target. The advantage of this method is that information can be efficiently stored and adjusted in the MLP layer using only a single forward propagation.

1343

0x05 Optimization and Evolution

Next, let’s look at the optimization and evolution schemes for FFN.

5.1 MoE

In this field, much research has focused on integrating Hybrid Expert (MoE) techniques into LLMs to improve their performance while maintaining computational cost. The core idea of MoE is to dynamically allocate different computational budgets to different input tokens. In MoE-based Transformers, multiple FFNs (i.e., experts) are used in conjunction with a trainable routing module. During inference, the model selectively activates a specific expert for each token controlled by the routing module.

The figure below shows the method for efficient FFN design presented in the paper “A Survey on Efficient Inference for Large”. As can be seen, most of the solutions are related to MoE.

1344

We will study MoE in subsequent articles.

5.2 MemoryFormer

Large language models possess exceptional contextual understanding and the ability to incorporate novel information. However, the potential of this approach is often limited by constraints on the effective context length. One approach to address this issue is to allow the attention layer to access external memory containing (key, value) pairs.

The paper “MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers” proposes a novel Transformer architecture called MemoryFormer. This architecture replaces the computationally expensive fully connected layers in traditional Transformers with an innovative memory layer design, significantly reducing computational complexity and resource requirements while maintaining model performance and flexibility.

Motivation and Challenges

While multi-head attention mechanisms excel at capturing the intrinsic relationships in sequential data, fully connected layers dominate the computational load. As model size increases, the computational complexity and memory requirements of fully connected layers grow exponentially, drastically increasing the cost of training and inference. Although existing methods have attempted to optimize the computational efficiency of Transformers, such as model pruning, weight quantization, and redesigning attention mechanisms (e.g., linear and flashing attention), most of these methods ignore the computational bottleneck of fully connected layers, resulting in limited overall optimization effects. To address these challenges, MemoryFormer proposes a novel solution: by introducing a memory layer to replace the fully connected layer, fundamentally reducing computational complexity and resource consumption.

Principles and Innovations

The left side of the image below shows a schematic diagram of the Memory Layer, and the right side shows a component of the MemoryFormer.

1345

The core of MemoryFormer lies in its Memory Layer design, which replaces the traditional fully connected layer with memory lookup tables and the Locality Sensitive Hash (LSH) algorithm. The following are its key technical details:

Memory layer design and working principle

The primary function of the Memory Layer is to replace traditional matrix multiplication by retrieving pre-computed vector representations from memory. Specifically, the input embeddings are first hashed using a locality-sensitive hashing algorithm, mapping similar embeddings to the same memory location. Then, the model retrieves pre-stored vectors from memory that approximate the results of matrix multiplication.

The advantage of this design is that:

Reduce computational complexity: By pre-computing and memory lookup, the expensive matrix operations in traditional fully connected layers are avoided.
Reduced memory requirements: Input embeddings are divided into smaller blocks and processed independently, which significantly reduces memory usage.
Supports end-to-end training: The hash table in the memory layer integrates learnable vectors, allowing the model to be optimized through backpropagation.

Applications of the Locality Sensitive Hashing (LSH) algorithm

Locality Sensitive Hashing (LSH) is an efficient approximate nearest neighbor search algorithm. Its core idea is to project high-dimensional data into a low-dimensional space using a hash function, thereby quickly locating similar data. In MemoryFormer, the LSH algorithm is used to embed and map the input into a specific location in memory. This mapping method ensures that the features stored in the hash table continuously adapt to the input data, and during the inference phase, it efficiently retrieves approximate output results based on the similarity of input features, thus achieving the feature transformation function required by fully connected layers.

Scalable memory lookup table

MemoryFormer’s in-memory lookup table design supports dynamic expansion, allowing for flexible adjustment of storage capacity and retrieval accuracy based on task requirements. Furthermore, by introducing learnable vectors, the lookup table can be continuously optimized during training, thereby improving the overall performance of the model. Additionally, MemoryFormer controls the storage size of the hash table through multi-table partitioning and vector segmentation, preventing a surge in memory requirements due to the introduction of the hash table. The derivation process is as follows.

1346

5.3 Memory Layers at Scale

Pre-trained language models typically encode a large amount of information in their parameters, and as their size increases, they can recall and use this information more accurately. For dense deep neural networks that primarily encode information as weights of linear matrix transformations, scaling the parameter size directly correlates with increased computational and energy requirements. A crucial subset of information that language models need to learn is simple association. While feedforward networks can, in principle (given sufficient size), learn any function, using associative memory is more efficient.

Memory layers add extra parameters to the model using a trainable key-value lookup mechanism without increasing FLOPs. Conceptually, sparsely activated memory layers complement computationally intensive dense feedforward layers, providing dedicated capacity for inexpensive storage and retrieval of information.

The paper “Memory Layers at Scale” takes memory layers beyond the proof-of-concept stage by replacing one or more transformer layers in a feedforward network (FFN) with memory layers (keeping other layers unchanged). This demonstrates the practicality of memory layers in scaling large language models (LLMs). The research extends the number of key-value pairs to millions.

A trainable memory layer is similar to an attention mechanism. Given a query, a set of keys, and values, the trainable memory layer outputs a soft combination of values that is weighted according to the similarity between q and the corresponding keys. There are two key differences between memory layers and attention layers in their use.

First, the keys and values in the memory layer are trainable parameters, not activation parameters;
Secondly, the memory layer typically has a larger scale in terms of the number of keys and values, so sparse queries and updates are necessary.

A simple memory layer can be described by the following equation:

1347

Extended memory layer

One bottleneck encountered when expanding the memory layer is the query-key retrieval mechanism. Simple nearest neighbor search requires comparing every query-key pair, which quickly becomes impractical for large memory layers. While approximate vector similarity techniques can be used, integrating the keys is a challenge as they are continuously trained and need to be reindexed. Instead, this paper employs trainable product-quantized keys.

Parallel memory

The memory layer is memory-intensive, primarily due to the large number of trainable parameters and associated optimizer states. This study parallelizes embedding lookup and aggregation across multiple GPUs, with memory values sharded along the embedding dimension. In each step, indices are collected from the process group, each worker performs a lookup, and then aggregates the embedded portions into the shard. Subsequently, each worker collects the partial embeddings corresponding to its own index portion. This process is illustrated in the figure.

1348

Shared Memory

Deep networks encode information at different levels of abstraction across different layers. Adding memories to multiple layers can help the model use those memories in a more general way. Compared to previous work, this study uses a shared memory parameter pool across all memory layers, thus keeping the number of parameters the same and maximizing parameter sharing.

1349

This study improves the training performance of the memory layer by introducing input-related gating with silu nonlinearity.

5.4 KAN

The authors of the paper “KAN: Kolmogorov–Arnold Networks” argue that while Multilayer Perceptrons (MLPs) are the foundational building blocks of modern neural networks, they are not optimal and have some drawbacks. Therefore, they propose a novel neural network architecture, KAN (Kolmogorov–Arnold Networks). The authors choose to replace the parameter + activation function combination with parametric spline functions and claim that KAN is a powerful alternative to MLPs, surpassing traditional MLPs in accuracy and interpretability. This offers new possibilities for further improving current deep learning models that heavily rely on MLPs (due to faster neural scaling).

The main points of the paper are as follows:

KANs are inspired by the Kolmogorov-Arnold representation theorem, rather than the general approximation theorem upon which MLPs are based. Through their unique structural design and weight representation (learnable activation functions, represented as spline curves), KANs can achieve higher model performance than traditional MLPs while maintaining efficient computation, demonstrating their potential as efficient nonlinear approximators in resource-constrained environments.
The main feature of KANs is that they remove linear weights and fixed activation functions, replacing the weights with learnable activation functions. These activation functions are represented by univariate spline functions (single variable input, multiple parameters, which can control the shape of the function to be different in different intervals, used to generate a smooth curve).
The authors argue that KANs are a combination of splines and multilayer perceptrons (MLPs), each leveraging its strengths while avoiding its weaknesses (spline functions are highly accurate in low-dimensional functions but suffer from the curse of dimensionality; LPs are less affected by the curse of dimensionality due to their feature learning capabilities, but they are less accurate than splines in low-dimensional cases). With MLPs learning features in the outer layer and splines optimizing these learned features in the inner layer to achieve high accuracy, KANs can both learn combinatorial structures and approximate univariate functions well when dealing with high-dimensional functions.
The authors highlight the potential of using the Kolmogorov-Arnold representation theorem to construct neural networks (KANs). While previous research has utilized the Kolmogorov-Arnold representation theorem, most work has been limited to the original representation with a depth of 2 and a width of (2n + 1), failing to fully leverage modern techniques such as backpropagation for network training. The authors extend the original Kolmogorov-Arnold representation theorem to arbitrary widths and depths, making it more suitable for today’s deep learning environment.

1350

0xFF Reference

Axiomatic Attribution for Deep Networks

[NeurIPS 2024] MemoryFormer: Huawei proposes a new Transformer architecture that replaces computing with storage, reducing inference computation by 10 times. (Wang Yunhe)

The Physics of Language Models 3.1: Knowledge Storage and Retrieval Digital Garden | Wang Banxian

A Comprehensive Study of Knowledge Editing for Large Language Models. arXiv preprint arXiv:2401.01286 (2024)

ACL 2020 | Best Thematic Paper Award: “Towards NLU: On the Meaning, Forms, and Understanding of the Data Age” - Academic Headline

Analyzing Memorization in Large Language Models through the Lens of Model Attribution

Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data

Does Knowledge Localization Hold True? Surprising Differences Between Entity and Relation Perspectives in Language Models

Efficient softmax approximation for GPUs

EMNLP 2024 | Knowledge Mechanisms of Large Language Models: Overview and Perspectives [ZJUKG]

EMNLP 2024 Best Paper: Understanding the Transformer’s Operation Mechanism Through the Backpropagation Matrix [PaperWeekly]

End-to-End Memory Networks

Focused Transformer: Contrastive Training for Context Scaling

In-The-Wild Interpretability: Indirect Target Recognition Circuit in GPT-2 Small ( Hao Bai)

Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models

KAN: Kolmogorov–Arnold Networks

Knowledge Circuits in Pretrained Transformers

Knowledge Neurons in Pretrained Transformer

Memory Layers at Scale

Memory-Based Model Editing at Scale Fred

MemoryFormer: A Novel, Efficient, and Scalable Architecture for Large-Scale Language Models (Yuan Yan [Dunshu AI])

Meta explores large model memory layers, expanding to 128 billion parameters, outperforming MoE ( Machine Learning).

PMET: Precise Model Editing in a Transformer

ROME: Locating and Editing Factual Associations in GPT Hao Bai

Mathematical Framework of Transformer Circuits ( Hao Bai)

Transformer Feed-Forward Layers Are Key-Value Memories

Transformer Feed-Forward Layers Are Key-Value Memories pureDemon

Does the Transformer truly understand the semantic information of natural language, or is it merely a pattern recognition function ?

[Model Editing Technology] Paper Reading Notes (Part 1): PMET: Precise Model Editing in a Transformer

From Mathematics to Neural Networks (Part 2): Computation - From Computation to Construction : Alpha

Discussing Natural Language Understanding from the Perspective of Cognitive and Logical Thinking

Peking University & Microsoft: Knowledge Neuron Machine Learning Community in Pre-trained Models (Transformers)

Interpretability of Integrated Gradients by Shepherd

Large-Scale Language Model Series Interpretation (Part 2): The Memory Function of FFN in Transformer (by Ding Jiayu)

A Review of Research on Memory Mechanism Analysis and Intervention in Large-Scale Language Models ( Coco [Dunshu AI])

What exactly is knowledge storage in large models? (Cheese AI eats fish)

Large models also have lateral hemispheres? Unveiling how WISE brings a new breakthrough in lifelong learning .

The large model’s load-bearing walls, once removed, meant it would stagnate! Apple gave it “super weight.” [Machine Heart]

A New Perspective on the AI Black Box: LLM Concept Alignment: Revealing the Cognitive Mechanisms of LLM | Princeton University AI Cat Repair Prompt

Machine Reading Comprehension: Reasoning Networks (Part 1) - End-To-End Memory Networks (Full Text Translation - Low-Level Alchemist)

Model interpretability: Axiomatic Attribution for Deep Networks knight

A New Direction in Model Interpretation! Zhejiang University Unveils the Knowledge Flow Between Hidden Layers in LLM Models! (bhn [Deep Learning Natural Language Processing])

Learn the Big Model Through Visuals: The Past and Present of Transformers (Part 2)

Algorithm Trivia Episode 1 - What are the changes in FFN for large models? Sam, eat more vegetables.

Let’s talk about FFN in Transformer. (Pan Zizheng [Qingke AI])

Paper Notes: Dissecting Recall of Factual Associations in Auto-Regressive Language Models Vicle

Paper Interpretation: Physics of Language Models (For Application-Oriented Readers) [July 2024] Original by Kong [Kong’s Low-Dimensional Cognition]

Several important mechanisms used by language models to complete factual recall tasks: GSAI-ALOHA

Read the paper LINEARITY OF RELATION DECODING IN TRANSFORMER LANGUAGE MODELS Fred

Read the paper “Locating and Editing Factual Associations in GPT” Fred

Towards Unisequitur: Learning to Decompose Language Models Through Dictionaries (Hao Bai)

Knowledge Editing for Large Language Models: (I) Introduction and Background Knowledge (A giraffe riding a shark)

Knowledge Editing for Large Language Models: (Part 3) Definition and Method Classification of Knowledge Editing Tasks (Giraffe Riding a Shark)

https://arxiv.org/abs/2411.12992

https://zhuanlan.zhihu.com/p/409287967

https://zhuanlan.zhihu.com/p/432553711

https://zhuanlan.zhihu.com/p/558937247

https://zhuanlan.zhihu.com/p/604739354

GitHub: LLMForEverybody

Rectified Linear Units Improve Restricted Boltzmann Machines

Rectifier Nonlinearities Improve Neural Network Acoustic Models

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Language Modeling with Gated Convolutional Networks

Searching for Activation Functions

Scaling Vision Transformers

Self-Normalizing Neural Networks

Gaussian Error Linear Units (GELUs)

Mish: A Self Regularized Non-Monotonic Neural Activation Function