Exploring the Transformer Series (2) --- Overall Architecture

0x00 Overview

0.1 Process

Using Transformer for text generation essentially involves using a model to predict the next word. The complete process includes multiple stages, such as word segmentation, vectorization, attention calculation, and sampling. The specific workflow is as follows:

Tokenization involves breaking down the user’s input text (let’s assume it’s “Data visualization empowers users to”) into several independent lexical units, i.e., tokens.
Encoding. Tokens are mapped to numbers using a vocabulary, with each token represented by a unique number.
Embedding is the process of converting the numbers represented by the tokens into embedding vectors, essentially mapping words to a vector space so that LLM can process them. Positional encoding is also added at this stage, because understanding language involves not only individual words but also their order. Positional encoding ensures that the word order is not lost. All embedding vectors are combined to form an embedding matrix.
Attention computation. This is a contextualized operation where stacked Transformer Blocks use an attention mechanism to transform embedding vectors into feature vectors, constructing relationships between words. During attention computation, each token understands its relevance to other tokens. Ultimately, each token, after passing through the last layer of the Transformer, results in a feature vector representing its semantics.
Calculate the probability. Map the feature vector corresponding to the last token (“to”) to the probability distribution (logits) of the next word to be predicted. Specifically, this involves using a linear layer to increase the dimensionality of the feature vector to the vocabulary dimension (i.e., transforming the decoder output into a vector of the same size as the vocabulary), and then normalizing it using softmax, ultimately outputting a probability distribution. This distribution represents the probability that each word in the vocabulary matches this feature vector.
Sampling. Based on these probabilities, the next token is sampled according to certain sampling rules, such as selecting “visualize” with the highest probability as the most likely next word.
The word segmentation table is used again to convert the integer corresponding to “visualize” back to the original words, forming the inference result sentence.
Repeat the above process continuously until the LLM output end stream (EOS) marker indicates that decoding is complete or the required number of tokens have been generated.

The diagram below visualizes the core of the above process and forms the basis of this explanation. The model structure and execution flow will be further refined in subsequent sections.

101

0.2 Explanation

This series is based primarily on the following:

Transformer paper: Attention Is All You Need https://arxiv.org/abs/1706.03762v7.
Other relevant classic papers and excellent blog posts will be provided in specific sections of each article.
The blog “The Annotated Transformer” and its source code (hereinafter referred to as the Harvard source code). “The Annotated Transformer” is a collection of reading notes on the Transformer paper, and the blog author has implemented the model from the paper in code, providing a detailed interpretation of the original paper in conjunction with the implemented model. Compared to other Transformer model implementations available online, “The Annotated Transformer” is more suitable for learning and interpretation. Its address is:
- http://nlp.seas.harvard.edu/2018/04/03/attention.html
- Pull the code from https://github.com/harvardnlp/annotated-transformer.git on February 4, 2025.

In addition, this article will use the text translation function as an example for explanation.

0x01 Overall Architecture

1.1 Design Motivation

The novelty of Transformer lies in the fact that it is a sequence transformation architecture implemented entirely based on the attention mechanism. Our analysis of the main design motivations of Transformer is as follows:

Addressing long-distance dependencies. This paper aims to overcome the limitations of RNNs on long-distance sequences, and the attention mechanism can reduce the distance between any two positions in a sequence to a constant, thereby capturing more semantic relationships in long text analysis.
Improve training parallelism. The paper aims to overcome the limitation of RNNs not being able to run in parallel, while the attention mechanism can capture the relationships between sequences regardless of their order, thus exhibiting better parallelism. It is compatible with existing GPU frameworks, enabling distributed training and improving model training efficiency.

Therefore, Jakob Uszkoreit (one of the authors of Transformer) proposed replacing the encoding and decoding process of sequences by RNNs with a self-attention mechanism. Noam Shazeer (another author of Transformer) then built upon this by proposing scaled dot-product attention, multi-head attention, and positional representation.

1.2 Model Structure

First, let’s look at the architecture diagram in the original paper, and then we will use it as the starting point for our analysis.

102

Main module

From a network structure perspective, Transformer consists of four main modules.

The input module corresponds to the green circle in the image below.
The encoder corresponds to the blue circle in the image below.
The decoder corresponds to the red circle in the image below. Both the encoder and decoder have their own inputs and outputs. The encoder’s output is used as part of the decoder’s input (located in the orange circle in the middle of the decoder).
The output module corresponds to the purple circle in the image below.

To be precise, the blue circle represents the encoder layer, and the red circle represents the decoder layer. (The diagram shows $N\times$ .) A stack represents the layering of several layers with the same structure. This layering mechanism of repeating the same structure multiple times is called a stack. To avoid confusion, we will refer to a single layer as an encoder layer or a decoder layer, and the result of the stack as an encoder or a decoder. In the Transformer paper, the Transformer used a stack of 6 layers for learning.

103

Multi-story

In the Transformer, the input to the first layer is an embedding matrix. The output of the first layer is then used as the input to the second layer, and so on. Each layer generates a set of embeddings, but these embeddings are no longer directly related to individual lemmas; instead, they are associated with a more complex understanding of lemma relationships. For example, the figure below shows the relationship between layers 6 and 7 in a model with 12 attention heads per layer.

104

There are currently different explanations for the role of multi-layered architecture. A common explanation is that the essence of layering is to gradually build different levels of features from different contexts from the bottom up. For example, the bottom layer learns word features, the middle layer learns syntactic features, and the top layer learns semantic features, etc. Each layer does its own thing and does not affect the others.

The embedding corresponding to the input text evolves continuously as it flows through the layers of the Transformer. This process is similar to refining and abstracting the input information layer by layer. Each layer transforms and abstracts the input at different levels, absorbing more contextual information to enrich its representation and extracting higher-level features layer by layer. This results in a more powerful expressive capability after integrating multiple layers. As the deep learning model is trained, these network layers gradually learn the relationships and similarities between various categories, enabling them to utilize this knowledge when reasoning and answering questions. When the embedding reaches the last layer, it not only represents the independent meaning of the corresponding token but also possesses profound contextual information, reflecting the comprehensive relationship between the token and other tokens in the sequence. We can understand multi-layer processing as a factory assembly line. For example, to produce a piece of porcelain, we first determine the shape through molding and trimming, then carve various exquisite patterns or designs onto the dried body. Next, glazing is performed, applying glaze to the surface of the formed ceramic body; finally, the porcelain body is placed in a saggar and fired at high temperature in a kiln. Only then can you finally obtain a beautiful piece of porcelain.

Researchers have conducted in-depth studies on the fact that each layer in the stratification may play a different role.

The paper “What Does BERT Learn about the Structure of Language?” dissects the complexity of English structure as understood by BERT. Their research found that BERT’s phrase representations primarily capture phrase-level information in the lower layers of the neural network, while encoding a complex hierarchical structure of language elements in the middle layers. This hierarchical structure is based on surface features, with the middle layers extracting grammatical features and the top layer presenting semantic features.

The paper “Analyzing Memorization in Large Language Models through the Lens of Model Attribution” states:

The deeper attention modules (the last 25% of the layers) are primarily responsible for memory.
Shallow attention modules are crucial for the model’s generalization and reasoning abilities.
Applying short-circuit intervention to the deep attention module can significantly reduce the memory required for memorization while maintaining model performance.

The paper “Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models” finds a common mechanism in language models: anti-overconfidence. In the last few layers of the model, the language model consistently suppresses the output of correct answers. This suppression can be further divided into two types:

By copying the information from the starting position of the input to the last position using the attention head, we found that the information at the starting position seems to contain many high-frequency tokens. The model can use this method to dilute the correct answer in the residual stream with high-frequency tokens, thereby reducing the confidence of the answer.
The final layer of the MLP seems to be guiding the residual flow toward an “average” token (the average token is the result of weighted averaging of token embeddings based on the word frequencies of the training data).

In addition, the performance of a model is often directly proportional to the number of its parameters. The Transformer increases the number of learnable parameters by increasing the number of layers in the model, allowing more parameters to carry deeper information in the text.

1.3 Attention Module

Attention mechanism is the heart of the Transformer model, giving it the superpower to see the intricate relationships between each word in a sentence and other words.

Classification

The Transformer class has three attention structures: global self-attention, masked self-attention, and cross-attention, as shown in the figure below. Cross-attention is mainly used to handle the relationship between two different sequences; global self-attention is mainly used to handle the relationship between elements within a single sequence; masked self-attention (also known as causal self-attention) uses a mask to control the scope of the model’s attention when calculating the attention score, thereby ensuring that it is not affected by future information during decoding.

105

Location

The corresponding positions of the three attention modules in the Transformer network are shown in the figure below.

106

effect

The Transformer essentially establishes global connections within and between sequences through a triple attention mechanism. The explanation of these three attention mechanisms in the paper is shown in the figure below.

107

Let’s analyze these three types of attention in detail.

Global self-attention layer

The global self-attention layer, located in the encoder, is responsible for processing the entire input sequence. In this mechanism, each element in the sequence can directly access other elements, establishing dynamic relationships and allowing the model to better capture important information. Self-attention, in essence, focuses on the relationships within the sequence. How does this happen? Self-attention sets the query, key, and value to be the same thing—the input sequence—allowing the attention mechanism to search for relationships within the sequence itself and notice the correlations between different parts.

For global self-attention, Q, K, and V can be as follows:

Q, K, and V are all input sequences.
Q, K, and V all come from the output of the previous layer in the encoder. Each position in the encoder can focus on all positions of the output of the previous layer.

To elaborate further, Q is the word vector at the current position in the sequence, while K and V are the word vectors at all positions in the sequence.

Mask self-attention

A masked self-attention layer, or causal self-attention layer, captures the association between the current word and already decoded words during the decoding phase. It performs a similar function to a global self-attention layer on the decoder’s input sequence, but with some differences.

The Transformer is an autoregressive model that generates text sequentially, then appends the current output text to the previous input to create a new input. Subsequent outputs depend on preceding output words, exhibiting a causal relationship. This sequential operation significantly impacts the training time. To achieve parallel processing and speed up the process, masks are introduced. During attention calculations, masks ensure that later words do not participate in the computation of earlier words.

For masked self-attention, Q, K, and V can be as follows:

Q, K, and V are all input sequences for the decoder.
Q, K, and V all come from the output of the previous layer in the decoder. Each position in the decoder can be related to all positions in the previous layer.

To elaborate further, Q is the word vector at the current position in the sequence, while K and V are the word vectors at all positions in the sequence.

Cross attention layer

A cross-attention layer is essentially a traditional attention mechanism. Located within the decoder, it connects the encoder and decoder, allowing it to characterize the global dependencies between the input and output sequences and align them. Therefore, it uses the target sequence as Q and the context sequences as K and V.

For cross-attention, Q, K, and V come from the following:

Q comes from the previous decoder layer and is the output vector of the causal attention layer.
K and V are attention vectors from the encoder output.

This allows each position in the decoder to focus on all positions in the input sequence. Furthermore, instead of just passing the hidden state from the last step, the encoder passes all hidden states generated at all times (corresponding to each position) to the decoder, thus solving the problem of the fixed length of the intermediate semantic encoding context.

108

Alternatively, from another perspective, cross-attention is a sequence-to-sequence pattern; bidirectional self-attention is an autoencoding pattern; and unidirectional self-attention is an autoregressive pattern.

1.4 Execution Process

Let’s briefly describe the computation process of the inference stage using the model structure diagram, as shown in the figure below.

109

Suppose we are performing machine translation, translating the Chinese phrase “我吃一个苹果” into the English phrase “I ate an apple”. Assuming the model has only one layer, the execution steps are as follows:

Input processing. The user inputs the natural language sentence “I ate an apple”; the tokenizer first converts the sequence into a token sequence; then the Input Embedding layer performs embedding encoding on each token, and adds Position Encoding, finally forming an embedding encoding matrix with position information. The encoding matrix uses $X \in \mathbb{R}^{n \times d}$ . Let n be the number of words in the sentence, and d be the dimension of the vector ( $d=512$ in the paper). Note: The input in the original paper’s diagram was a token; for better illustration, this paper uses a natural language sentence as the input.
The encoder performs encoding. The encoding matrix first enters the MHA (Multi-Head Attention) module, where each token exchanges and merges its information with that of other tokens according to certain weights. The fusion result then enters the FFN (Feed Forward Network) module for further processing, ultimately obtaining the mathematical representation of the entire sentence, where each word in the sentence carries information from the other words. The mathematical representation of the entire sentence is the output of the encoder.
Enter the translation start character to start the decoder.
The decoder performs decoding. First, the decoder enters the Masked Multi-Head Attention module, where the input sequence undergoes internal information exchange. Then, in the Multi-Head Attention module, the decoder fuses its input sequence with the encoder’s output, ultimately outputting a probability distribution representing the probability of each word in the vocabulary being the next output word. Finally, based on a certain strategy, it outputs the most probable word. Here, the first word “I” is predicted.
The first word predicted, “I”, and the start symbol together are used as input to the decoder for further decoding.
The decoder predicted the second word “ate”.

For this example, the specific inputs and outputs of the decoder at each step are shown in the table below.

110

1.6 Summary

The overall architecture of Transformer is an organic whole that is difficult to divide. The significance of its composition lies not in the basic units that constitute it, but in the complex relationships and emergent behaviors formed between these units. For example, collective intelligence originates from the combination of individuals, yet it produces higher-order capabilities that none of the individuals possess.

Some work has shifted its focus to the high-level architecture of the transformer module, arguing that its complete structure, rather than just the labeling of hybrid attention operations, is crucial for achieving competitive performance in the Transformer implementation.

The paper “Attention is not all you need” points out that without skip connections (residual connections) and MLPs, the output of a self-attention network will shrink towards a rank-1 matrix. That is, skip connections and MLPs can effectively prevent this “rank collapse” degradation of self-attention networks. This reveals the indispensable role of skip connections and MLPs in self-attention.

The paper “MetaFormer is Actually What You Need for Vision” describes a general architecture in which the input is first embedded to obtain Y. The embedding is then fed into repeated blocks. The first block primarily contains a token mixer, enabling different tokens to communicate with each other ( $Y=\mathrm{TokenMixer}(\mathrm{Norm}(X))+X$ ); the second block contains two layers of MLPs. This architecture can achieve different models by specifying the specific design of the token mixer. If the token mixer is specified as an attention-based or spatial MLP, MetaFormer will become a transformer or an MLP-like model, respectively.

111

0x02 Construction

We will now analyze and study the Harvard source code. The make_model() function in the Harvard code is the function for building the Transformer model.

2.1 Parameters

The make_model() function has the following 7 parameters:

src_vocab: The number of words in the source language vocabulary, i.e., the size of the source dictionary.
tgt_vocab: The number of words in the target language vocabulary, i.e., the size of the target dictionary.
N=6: Number of encoder and decoder stacks, i.e., number of encoder layers and number of decoder layers.
d_model=512: The dimension of the data processed by the model, i.e. the size of the word embeddings.
d_ff=2048: The dimension of the transformation matrix in FFN (Feedforward Fully Connected Layer), i.e., the number of hidden neurons.
head=8: The number of attention heads in the multi-head attention layer.
dropout=0.1: Prevent overfitting.

2.2 Construction Logic

The main idea behind the make_model() function is to build a Transformer by constructing blocks from smallest to largest. Let’s first think about it without looking at the code, and see which modules in the architecture diagram can be used as building blocks.

Input modules: Input Embedding and Positional Encoding can each be used as individual building blocks, and they can be combined to form a new building block.
The encoder and decoder layers can be considered as two separate large building blocks. The Masked Multi-Head Attention, Multi-Head Attention, Feed Forward, and Add & Norm functions within them can also be considered as separate small building blocks.
Output modules: Linear and Softmax can each be used as individual building blocks, and when combined, they can form a new building block.

With these building blocks, we can easily construct a Transformer. Of course, the make_model() function performs some abstraction; some smaller building blocks are encapsulated in classes (for example, Linear and Softmax are encapsulated in the Generator class; details are not shown in the make_model() function). Let’s look at the correspondence between these modules and the code. The numbers in the image below represent the order in which a module appears in the code, and these numbers are also indicated in the comments below.

112

The specific code logic is as follows.

Rename the copy.deepcopy() function to c for better code conciseness. The deepcopy() function allocates new memory and completely copies the source instance; the copied object is completely independent of the source object. Subsequent constructors of various classes will call copy.deepcopy() to recreate the corresponding instance. For example, in the diagram above, instances labeled 1, 2, and 3 are each deeply copied at the decoder. This ensures that the two instances corresponding to label 1 are independent and unaffected by each other.
Construct MultiHeadedAttention, PositionwiseFeedForward, and PositionalEncoding objects.
Construct an EncoderDecoder object, which is the main Transformer class. Its parameters are Encoder, Decoder, src-embed, tgt-embed, and Generator. Let’s analyze the last three parameters first.
- src-embed is the return value of nn.Sequential(Embeddings(d_model, src_vocab), c(position)), representing the input encoding, corresponding to number 7 in the diagram above (composed of encodings 3 and 6). The nn.Sequential() function constructs a sequential container; the order of the modules within the container determines the order in which the model processes the data.
- tgt-embed is the return result of nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)), which means the input encoding, corresponding to number 9 in the above figure (composed of encodings 3 and 8).
- The Generator corresponds to number 10 in the diagram above, and it includes Linear and Softmax. The Generator transforms the Decoder’s output into the probabilities of the output words.
The Encoder and Decoder classes are very similar, so we’ll use the Encoder class as an example. The Encoder class consists of N EncoderLayers, whose parameters are d_model, c(attn), c(ff), and dropout, which represent the word embedding dimension, multi-head attention, FFN layer, and Dropout, respectively. As you can see, the attention mechanisms in both the Encoder and Decoder classes are instances of MultiHeadedAttention; the difference lies in the parameters passed, which determines whether an attention mechanism is cross-attention or masked multi-head attention.
Initialize the model parameters. For Xavier initialization, please refer to the paper “Understanding the difficulty of training deep feedforward neural networks”.

The specific code is as follows.

def make_model(src_vocab, tgt_vocab, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1):
    "Helper: Construct a model from hyperparameters."
    # copy.deepcopy是深度拷贝函数，即重新生成一个新实例。重新命名可以让后续代码比较简洁
    c = copy.deepcopy
    # 构建多头注意力层的实例，对应上图的数字标号1
    attn = MultiHeadedAttention(h, d_model)
    # 构建前馈神经网络层的实例，对应上图的数字标号2
    ff = PositionwiseFeedForward(d_model, d_ff, dropout)
    # 构建位置编码模块的实例，对应上图的数字标号3
    position = PositionalEncoding(d_model, dropout)
    # 总的Transformer模型
    model = EncoderDecoder(
        # EncoderLayer只包含一个Attention层，对应上图的数字标号4。Encoder则包括外面的N
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        # DecoderLayer包含两个Attention层，对应上图的数字标号5，Decoder则包括外面的N
        Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N),
        # 输入的Embedding和位置编码，Embeddings对应上图的数字标号6，Sequential就是两个编码合并的结果，对应上图的数字标号7
        nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
        # 输出的Embedding和位置编码，Embeddings对应上图的数字标号8，Sequential就是两个编码合并的结果，对应上图的数字标号9
        nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)),
        # Generator类包括Linear层和Softmax层，对应上图的数字标号10，负责依据Decoder的输出来预测下一个token
        Generator(d_model, tgt_vocab),
    )

    # This was important from their code.
    # Initialize parameters with Glorot / fan_avg.
    # 初始化模型参数，这里采用xavier初始化，即如果参数的维度大于1，则将其初始化成一个服从均匀分布的矩阵
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)
    return model

2.3 Main Class

The EncoderDecoder class is an encoder-decoder implementation based on the Transformer architecture, and its member variables are as follows:

encoder: An instance of the Encoder class, which is the implementation of the encoder.
decoder: An instance of the Decoder class, which is the implementation of the decoder.
src_embed: The word embedding generation module for the source language, which is an nn.Sequential object including Embebddings and PositionalEncoding. src_embed will perform embedding and positional encoding on the input.
tgt_embed: The word embedding generation module for the target language, which is an nn.Sequential object that includes Embebddings and PositionalEncoding. tgt_embed will perform embedding and positional encoding on the incoming output.
Generator: An object of the Generator class, consisting of a Linear layer and a Softmax layer, is responsible for predicting the Decoder’s output, that is, predicting the word at the current time step based on the Decoder’s hidden state output. The hidden state is input to a fully connected layer (the output size of the fully connected layer is the size of the dictionary), and a softmax layer is then applied to the fully connected layer to obtain the probability of the predicted word.

The forward() function of the EncoderDecoder class performs the encoding and decoding tasks, and it accepts four functions:

src: The source sequence, which contains the token’s corresponding number in the vocabulary. The shape of src is [batch_size, seq_len], for example [[ 0, 2, 4, 8, 1, 2, 2 ]], that is, the batch size is 1 and the sentence length is 7, where 0 is bos, 1 is eos, and 2 is pad.
tgt: target sequence, with a similar meaning to src.
src_mask: Source sequence mask, specifically used to mask the padding symbols. For example, [[ 0, 2, 4, 8, 1, 2, 2 ]] has a mask of [[True,True,True,True,True,False,False]], which masks the two padding characters.
tgt_mask: The target sequence mask, serving two purposes: preventing the attention computation from seeing future words; and masking padding symbols. Its shape is [batch_size, seq_len, seq_len], and the corresponding mask is as follows:

[True,False,False,False,False,False,False],
[True,True,False,False,False,False,False],
[True,True,True,False,False,False,False],
[True,True,True,True,False,False,False],
[True,True,True,True,True,False,False], # 例句是5个正式token，后面两个pad被置为False
[True,True,True,True,True,False,False],
[True,True,True,True,True,False,False],

The EncoderDecoder code is as follows:

# 继承nn.Module
class EncoderDecoder(nn.Module):
    """
    A standard Encoder-Decoder architecture. Base for this and many other models.
    标准的编码器-解码器架构，这是很多模型的基础。
    """

    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        """
        初始化函数有5个参数，从外部传入参数的目的是更加灵活，可以更换组件
        """
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder # 编码器对象
        self.decoder = decoder # 解码器对象
        # 源语言input embedding和position embedding的组合
        self.src_embed = src_embed 
        # 目标语言output embedding和position embedding的组合
        self.tgt_embed = tgt_embed 
        self.generator = generator # 类别生成器对象

    def forward(self, src, tgt, src_mask, tgt_mask):
        # 前向传播函数有四个参数：源序列，目标序列，源序列掩码，目标序列掩码
        "Take in and process masked src and target sequences."
        # 1. 将source, source_mask传入编码函数encode()，让编码器对源序列进行编码，得到编码结果memory
        # 2. 将memory，source_mask，target，target_mask一同传给解码函数decode()进行解码
        return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)

    # 编码函数，接受参数是源序列和掩码
    def encode(self, src, src_mask):
        # 1. 对src编码，得到input embedding
        # 2. 计算位置编码，将input embedding和位置编码相加，得到word embedding
        # 3. 使用编码器encoder进行编码，编码结果记作memory
        return self.encoder(self.src_embed(src), src_mask)

    # 解码函数，参数为：编码器输出（memory）、源序列掩码、目标序列和目标序列掩码
    def decode(self, memory, src_mask, tgt, tgt_mask):
        # 1. 对tgt编码，得到得到input embedding
        # 2. 计算位置编码，将input embedding和位置编码相加，得到word embedding
        # 3. 使用编码器decoder进行解码，解码器输出可以使用self.generator进行最后的预测
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)

2.4 How to call

The model is invoked as shown in the code below. The forward() function encodes the input into a hidden state and then decodes the hidden state to output logits (log odds). Its parameters are all obtained from an instance of the batch class, as follows:

src: A list of source sentences. The shape is [batch size, max sequence length]. Each sentence is extracted from the dataset and processed using a dictionary; for example: [0, 5, 12,..., 1, 2, 2]. Here, 0 represents bos, 1 represents eos, and 2 represents pad.
tgt: List of target sentences. The specific meaning is the same as above.
src_mask: The mask used by the attention layer (which will be analyzed in detail in later chapters).
tgt_mask: The mask for the decoder, which is the mask used by the attention layer (this will be analyzed in detail in later chapters).

out = model.forward(batch.src, batch.tgt, batch.src_mask, batch.tgt_mask)

0x03 Input

3.1 Input Category

Let’s take translating the Chinese phrase “我吃一个苹果” into the English phrase “I ate an apple” as an example. The overall input to the model is a sentence consisting of several words. Different models support different maximum sentence lengths. If the sentence is short, certain special words will be used to fill in the extra spaces. The source language and the target language (or encoder and decoder) correspond to two different independent inputs.

The encoder’s input corresponds to the Inputs in the graph, which is a list of tokens obtained from the original source text sequence. In machine translation, the Encoder receives a complete sentence at a time and processes it. For example, the phrase “I ate an apple” is processed by the tokenizer, resulting in a unique index for each character, say [5, 4, 6, 7, 8, 9, 10]. This index is then passed to the Encoder layer as inputs.

The decoder actually has two types of inputs:

Outputs (shifted right). Outputs are actually a concatenation of the decoder’s previous outputs. The purpose of shifting the entire sequence one position to the right. The decoder cannot output “I ate an apple” all at once, but outputs it word by word, or rather, it executes in a loop like an RNN. That is, the output of this time (as Outputs) is added to the previous input as the input for the next time, so as to generate subsequent words.
The encoder’s output. The encoder layer encodes the inputs into an intermediate hidden state (called memory in the Transformer implementation, corresponding to the Hidden State in RNNs) and outputs it to the decoder. In other words, the encoder encodes the complete sentence of the source language and outputs it to the decoder at once.

The specific details are shown in the image below.

113

3.2 Input Module

The input module specifically includes the following:

Tokenizer. Note: The input in the original paper’s diagram was a token. For better illustration, this paper sets the input to a natural language sentence and also includes a tokenizer.
The source language text embedding layer (corresponding to Input Embedding in the graph) and the position encoder (corresponding to Positional Encoding in the graph).
The target language text embedding layer (corresponding to Output Embedding in the diagram) and the position encoder (corresponding to Positional Encoding in the diagram).

114

3.3 Text Conversion

Returning to the example in Figure 1 at the beginning of this article, the sentence “Data visualization empowers users to” cannot be understood by the model. Therefore, we need to encode the natural language, that is, vectorize the text. This involves the following steps:

Tokenize the input text to obtain a token.
Find the token id corresponding to each token in the vocabulary (assuming the vocabulary size is 10000).
Convert each token ID into a token embedding (assuming the dimension is 512).
Add a position code to each position in the sequence.
The input embedding and the positional encoding are added together to obtain the final word embedding.

The specific process is as follows.

115

Let’s see how each step is done.

Word segmentation

Tokenization is the process of breaking down input text into smaller, more manageable semantic segments called tokens. A token is part of the model’s vocabulary, which is a list of lexical units used during LLM training. The tokenizer does two things:

First, the tokenizer breaks down the input text into smaller, more manageable tokens. Tokens can be words or sub-words. For example, the word “Data” is mapped to a token, while the word “empowers” is split into two tokens: “em” and “powers”.
Next, the tokenizer maps the tokens to different integers, which are the indices of the vocabulary. This can be considered a form of one-hot encoding.

embedding

Next, let’s look at the word embedding vector generation process in the NLP field, specifically how to convert the indices of words in the vocabulary into vectors that a Transformer can use—that is, embeddings. An embedding is a fixed vector representation of each word; it’s better suited for deep learning than pure integers because it captures the semantic meaning of the word. The size of the embedding vector depends on the model’s dimensionality. When the embeddings of each token in the input are stacked together, they form the input’s embedding matrix.

In the Transformer architecture diagram, there is an Embedding module above Inputs and Outputs, and each module is composed of two sub-modules.

The embedding modules related to inputs include Input Embedding and Positional Encoding.
- Input Embedding is responsible for encoding the token.
- Positional encoding adds positional information to the token. In practice, the Transformer receives the entire embedding matrix of the input sentence at once. The advantage of this is parallel processing, but the disadvantage is the lack of positional information; for example, the model cannot distinguish between “I love you” and “love me”. Positional encoding, on the other hand, adds positional information to each word.
The embedding modules related to Outputs include Output Embedding + Positional Encoding: similar to the above, so we will not repeat them here.

Token Embedding

Token embedding transforms text into a mathematical representation that the model can understand and process. This involves associating each token ID (one-hot) with a high-dimensional vector, where each dimension corresponds to a semantic aspect. Embedding typically involves a lookup operation: based on the token ID value, the data in the token_id row of the embedding matrix is retrieved as the embedding. The vector dimension depends on the model; the Transformer paper represents each token as a 512-dimensional vector. Token embedding is also called word embedding.

Positional Encoding

The order of words in a sentence is very important. For example, the following two sentences have exactly the same words, but their meanings are completely different because of the different word order.

Buy a train ticket from Shanghai to Beijing.
Buy a train ticket from Beijing to Shanghai.

Therefore, the model also needs to encode the position information of each token in the input sentence.

Word Embedding

Finally, the model adds the token embedding and positional encoding to obtain the final embedding representation. This combined representation captures the semantics of the token and its position in the input sequence. In LLM (Large Language Model), the result of adding positional encoding to word embeddings is often still referred to as “input embeddings” or “word embeddings with positional information.” For ease of explanation, we will continue to refer to it as Word Embedding.

The shape of the output tensor in this stage is [batch size, sequence length, embedding dimension]. Taking “Data visualization empowers users to” as an example, the 5 words are split into 6 tokens, and the final word embedding is a matrix of [batch size, sequence length, embedding dimension]. Since there is only one sentence, the batch size is 1, the sequence length is 6, and assuming the embedding dimension is 512, the matrix dimension is [1, 6, 512].

In fact, we can consider the embedding as LLM’s own language system (including textual information feature space and positional information feature space). The role of the input layer is to map (or encode) information from natural language, programming language, visual and auditory language, etc., into this high-dimensional language space. Next, various rich knowledge and structures are extracted from the high-dimensional language space through an attention mechanism, weighted accumulation and association to generate its own language, and finally “encoded” back into human language.

116

0x04 Transformer Layer

The core of Transformer processing lies in the Transformer block, which includes a multi-head attention mechanism and multiple perceptron layers. Most models consist of multiple such blocks, which are stacked sequentially one after another.

As can be seen from the Transformer’s constructor code, an Encoder class instance is constructed from N EncoderLayer class instances, and a Decoder class instance is constructed from N DecoderLayer class instances, which is consistent with the paper.

# EncoderLayer只包含一个注意力层，Encoder则包括外面的N参数
Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
# DecoderLayer包含两个注意力层，Decoder则包括外面的N参数
Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N),

EncoderLayer and DecoderLayer are the basic building blocks, and each block mainly includes the following:

Multi-head attention mechanism. It allows tokens to communicate with each other, exchange information, and thus capture contextual information and relationships between words.
FFN layer. A feedforward network that operates independently for each token. The goal of the attention layer is to route information between tokens, while the goal of the MLP is to refine the representation of each token.

This section first provides a brief overview of the multi-head self-attention mechanism and the FFN layer. The following sections will provide a detailed introduction to the EncoderLayer and DecoderLayer.

4.1 Multi-head self-attention mechanism

Self-attention mechanisms enable models to focus on relevant parts of the input sequence, allowing them to capture complex relationships and dependencies within the data. Multi-head attention mechanisms borrow the idea of multi-kernels in CNNs, applying different linear transformations to different heads. Essentially, multi-head self-attention mechanisms construct multiple subspaces, building multiple attention mechanisms on these subspaces instead of a single attention, thus capturing more dimensions of information and interrelationships. Self-attention is the only part of the LLM architecture that computes relationships between words in a sequence; therefore, it forms the core of language understanding, encompassing the understanding of lexical relationships.

Let’s examine how the multi-head self-attention mechanism is computed. For simplicity, we’ll assume the encoder also uses a mask, and break down some complex operations.

Step 1: Calculate the query, key, and value matrix based on the original embedding.

117

The input to the self-attention mechanism is an $n_{\text{tokens}} \times n_{\text{embd}}$ embedding matrix, where each row or vector represents an independent word. The first part of LLM computation extracts the relevant row for each word from the word embedding matrix. The word embedding for each token is transformed into three distinct vectors, called query, key, and value, respectively. These vectors are obtained by multiplying the input embedding matrix with the learned $W^Q$ , $W^K$ , and $W^V$ weight matrices (these matrices are part of the model parameters). The dot product of the query and key is used to determine the similarity between the two vectors, which is also the dot product attention mentioned in the paper.

Using Taobao search as an analogy can help us better understand these matrices. Let’s say we search for “Li-Ning shoes” on Taobao.

The query is the content you enter in the search bar.
The key is the product description and title returned on the page, which are actually keywords related to the candidate products in Taobao’s product database.
The value refers to the Li-Ning shoes themselves. Once a matching product description and title (key) are found based on the search term (query), we want to examine the product details.

By using these QKV values, the model can calculate an attention score, thereby determining how much attention each token should receive from other tokens when generating predictions.

Step 2: Masking Self-Attention

Masked self-attention allows a model to generate sequences by focusing on relevant parts of the input while preventing access to future tokens. The figure below illustrates how to compute masked self-attention using query, key, and value matrices.

118

The diagram shows three steps:

The attention score is calculated using the dot product. The alignment of each Q with each K is determined by the dot product of the Q and K matrices, and the result of the dot product is a matrix that reflects the relationships between all input tokens.
Scaling and masking are applied to the attention score. First, the attention score is scaled, and then a mask is applied to the upper triangle of the attention matrix to prevent the model from accessing future tokens, as the model needs to learn how to predict the next token without “peeking” into the future.
Softmax and dropout operations are applied. The attention score is converted into a probability through a softmax operation, and then a dropout operation is applied to randomly discard some elements.

The result obtained at this point is the attention weight.

Step 3: Assembling

The weights generated in the second step are multiplied with the V matrix to obtain the final output of the self-attention mechanism. Since it is the result of multi-head attention, the outputs of these heads need to be concatenated and fused through a linear layer.

4.2 FFN Layer

After multiple self-attention heads capture the different relationships between input tokens, the concatenated output is processed through an FFN (feed-forward network) layer to enhance the model’s representational power. The figure below illustrates how an FFN layer can project the self-attention representation to a higher dimension to enhance the model’s representational power.

119

The FFN layer consists of two linear transformations with an activation function in between. The first linear transformation increases the input dimension from 512 to 2048. The second linear transformation reduces the dimension back to its original size 512 to ensure that subsequent layers receive inputs of consistent dimension.

Unlike self-attention mechanisms that communicate with each other within a sequence, FFN computes each element independently, thus avoiding inter-element information exchange (inter-element interaction relies entirely on self-attention). This allows each element to digest and integrate its own information after the information exchange occurs at the attention layer, preparing it for further information exchange through self-attention in the next layer.

4.3 Auxiliary Architecture

In addition to the main modules mentioned above, the Transformer model also employs design methods such as LayerNorm (layer normalization) and ResNet (residual connections). Although not mentioned in the constructor of the source code above, these modules are crucial for improving the overall representational power of the model.

LayerNorm

LayerNorm helps stabilize the training process and improve convergence. It works by normalizing the various features of the input, ensuring that the mean and variance of the activations are consistent. This normalization is generally believed to help mitigate problems related to internal covariate bias, allowing the model to learn more effectively and reducing sensitivity to initial weights. From the architecture diagram, LayerNorm is applied twice in each Transformer block: once after the self-attention mechanism and once after the FFN layer; however, this is not always the case in practice.

Dropout

Dropout is a regularization technique used to prevent overfitting by randomly setting a subset of model weights to zero during training. This encourages the model to learn more powerful features and reduces reliance on specific neurons, helping the network better generalize to new, unseen data.

Residual connection

Residual connections were first introduced by Kaiming He in the ResNet paper in 2015. A residual connection adds the network’s input and output to obtain a new output $F(x)+x$ . The core idea is to allow information and gradients in the network to propagate directly across one or more layers, thus preserving some of the original information. This architectural innovation helps alleviate the vanishing gradient problem, enabling the training of deeper neural networks and fundamentally changing deep learning. In the Transformer paper, residual connections are used twice within each Transformer block: once before the Final Frontier Neural Network (FFN) and once after it.

0x05 Output Probabilities

After careful processing by a series of Transformer modules, the input data finally arrives at its destination—the final output layer. The output module consists of two parts: a linear layer and a softmax layer. The Harvard code encapsulates these two parts using the Generator class, converting the model’s output embedding into a prediction of the next word (a probability distribution over the entire vocabulary).

5.1 Decoder Results

After the input is processed through all decoding blocks, the final output is still a vector corresponding to several tokens, representing the various complex relationships between things woven together by high-dimensional probability vectors from the Transformer’s perspective. These relationships are essentially Yoneda embeddings of things under category theory (Yoneda embeddings represent an object by using all its relationships). The Transformer’s learning process is a process of kernel function selection and parameterization, and also a process of finding Yoneda embeddings: extracting all the relationships of an object to form its relationship graph—that is, a probabilistic model of the internal world.

Let’s first look at the Harvard code. During inference, the generator doesn’t use all the encoder’s outputs, but rather the vector out[:, -1] corresponding to the last token; that is, it only uses the guess result of the last word in the output sequence. During training, the generator uses all the encoder’s outputs. Below is an example of the inference code.

def inference_test():
    test_model = make_model(11, 11, 2)
    test_model.eval()
    src = torch.LongTensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
    src_mask = torch.ones(1, 1, 10)

    memory = test_model.encode(src, src_mask)
    ys = torch.zeros(1, 1).type_as(src)

    for i in range(9):
        out = test_model.decode(
            memory, src_mask, ys, subsequent_mask(ys.size(1)).type_as(src.data)
        )
        prob = test_model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.data[0]
        ys = torch.cat(
            [ys, torch.empty(1, 1).type_as(src.data).fill_(next_word)], dim=1
        )

Below is a training code example.

class SimpleLossCompute:
    "A simple loss compute and train function."

    def __init__(self, generator, criterion):
        self.generator = generator
        self.criterion = criterion

    def __call__(self, x, y, norm):
        x = self.generator(x)
        sloss = (
            self.criterion(
                x.contiguous().view(-1, x.size(-1)), y.contiguous().view(-1)
            )
            / norm
        )
        return sloss.data * norm, sloss

5.2 Conversion

The Transformer’s output is the word most likely to be placed at the end of the input sequence. However, the vector corresponding to the last token in the output is a 512-dimensional vector, which cannot be directly used for inference. Furthermore, every word in the vocabulary has the potential to be the next word, so the model needs to score the probability of all the words it knows and ultimately select the most suitable word to recommend to the user. Therefore, we ultimately need to obtain the probability of each word in the vocabulary being the next word. This requires several stages to complete the transformation from a 512-dimensional vector to 10,000 word probabilities.

In the “Embedding” section, we learned a mapping that converts a given (one-hot) word into a 512-dimensional vector before it can be processed by the Transformer. Now that the encoder-decoder process is complete, we need to reverse this mapping (invert it) to transform the output 512-dimensional vector back into the 10,000-dimensional space corresponding to the vocabulary. This can be understood as a classification task for the last token vector: classifying it into the correct token in the vocabulary, i.e., the classification of the next token. The Transformer achieves this through a linear layer, projecting from the vector dimension to the vocabulary length ( $[1,\text{embed\_size}] \to [1,\text{vocab\_size}]$ ). Each token in the vocabulary has a corresponding value, called a logit. This process is similar to a CNN where a convolutional layer is followed by a linear layer for classification.

Since the goal is prediction, a probability needs to be assigned to each token in the vocabulary based on the model’s output logits. These probabilities determine the likelihood that each token will become the next word in the sequence. Specifically, the softmax function is applied to transform the logits into a probability distribution with a sum of 1.

In the diagram below, Output Probabilities represent the probability distribution after the linear layer and subsequent Softmax pass. The red circles correspond to the Generator class. We will then sample these probabilities based on their likelihood of becoming the next word. For example, we can simply rank them. We obtain the index corresponding to the largest value, and then look it up in the dictionary to determine the predicted next word—that is, we obtain the next word’s index in the dictionary.

In short, this final linear layer and the softmax function work together to provide the mathematical foundation and probabilistic guidance for the model to predict the next word.

120

In subsequent articles, we will gradually delve into the various components of Transformer.

0x06 Explanation

To date, our understanding of Transformers has largely consisted of practical results and empirical accumulation, with a significant lack of theoretical and conceptual understanding. While some aspects intuitively work fine, it’s essential to examine the theoretical (machine learning, biology, and mathematics) explorations undertaken by researchers. These explorations attempt to fundamentally understand the mechanisms underlying neural networks and the Transformer’s internal workings, exploring why it’s a good structure. This understanding can help analyze failure cases, explore better network structures and effective training methods, and reduce model bias and illusions. It can also enable Transformers to better handle the length and complexity of prompts, completing complex tasks with limited computational resources, thus providing new ideas and methods for the development of NLP.

Note: The following classifications are not orthogonal and may overlap. Because the embodiment of mathematics is physics, and the ultimate goal of physics is mathematics, mathematics and physics complement each other and become indispensable means to understand neural networks and even the essence of intelligence.

6.1 Mechanical interpretability

In recent years, mechanistic interpretability has become an important direction in AI interpretability research. This research aims to dissect AI models (especially black-box neural network models) through reverse engineering, hoping to understand the internal workings of LLMs and locate the storage location of parameters. Several major schools of thought on mechanistic interpretability are as follows:

causal tracing. The main goal is to understand the information flow mechanism of LLM. The basic idea is to observe the changes in the final prediction when the parameters/vectors at a certain point in the model are changed, and to pinpoint the position most important to the final output based on the degree of change.
Circuit analysis. The main goal is to construct an input-to-output circuit using attention heads and FFNs as basic units. Its underlying assumption is that only a small subset of parameters are important for a given input/output. For example, one part of this research explores the functionality of different attention heads, while another part focuses on constructing the entire circuit.
Logit lens and neuron analysis. This approach emphasizes analysis from the perspective of individual neurons. Logit lenses pre-add the LM head to each intermediate layer to observe the characteristics of latent embeddings. Other works project neurons into an unembedding space to gain interpretability.
SAE (sparse autoencoder). This is Anthropic’s work, which aims to use SAE to increase the interpretability of neurons.

Let’s take causal tracing and circuit analysis as examples to look at the relevant research approaches.

causal tracing

Understanding the inner workings of a language model means identifying which elements (input elements, representations, and model components) in the forward propagation are responsible for specific predictions. We will now introduce several different approaches to identifying model behavior.

Sparse detection

The paper “Finding Neurons in a Haystack: Case Studies with Sparse” proposes sparse probing, a technique or concept designed to identify LLM neurons associated with specific features and to help understand how high-level human-interpretable features are represented in neuronal activation in such models.

The team used autoregressive LLM to apply probes to k neurons to examine their classification performance. They summarized their main findings as follows:

LLM neurons contain a large number of interpretable structures, and sparse probing is an effective method for locating such neurons (even in a superposition state), but it requires careful use and subsequent analysis to draw rigorous conclusions.
Many early layer neurons are in a stacked state, where the feature representation is a sparse linear combination of multi-semantic neurons, each activating a large number of unrelated n-grams and local patterns. Furthermore, based on insights from weight statistics and toy models, the paper concludes that the first 25% of fully connected layers use more stacking than the remaining layers.
Higher-level contextual and linguistic features (e.g., is_python_code) appear to be encoded by single semantic neurons, and primarily occur in the intermediate layers.
This indicates that sparsity increases with model size, but different features follow different dynamics: some features with dedicated neurons appear as the size increases; some features split into finer-grained features with larger size; many features remain unchanged or appear randomly.

Input Attribution

Input attribution methods are often used to localize model behavior by estimating the contribution of input elements (tokens in the case of LM) to model predictions.

The figure below illustrates three methods for calculating token contributions among attention heads. Relying solely on attention weights ignores the size of the vectors they operate on. This limitation can be mitigated by considering value-weighted or output value-weighted vectors ( $x'_j$ ) norms in the second method. Finally, distance-based analysis estimates the contribution of the weighted vector based on how close the weighted vector is to the attention output.

121

Model component attribution

Direct Logit Attributions (DLA) is a technique for interpreting the output activations of model components in the lexical space. DLA applies an unembedded matrix to model internal activations, effectively skipping further computation of downstream components. The following figure shows examples of applying DLA to the output token w: (a) DLA for the attention head; (b) DLA through the intermediate representation of the attention head; (c) DLA for the FFN block; and (d) DLA for a single neuron.

122

circuit analysis

The paper “Chain of Thought Empowers Transformers to Solve Inherently Serial Problems” treats the Transformer as a complex circuit of a certain depth and analyzes the complexity of the problems it can solve. Circuit complexity analysis uses $TC^0$ to represent a computational problem that can be solved using a circuit of fixed depth, while a sufficiently long thought chain can extend the expressive power of the Transformer beyond $TC^0$ .

The paper points out that, conceptually, CoT endows models with the ability to perform essentially sequential computations, a capability lacking in Transformers, especially at lower depths. Parallel processing can increase padding information, potentially influencing the sampling probability distribution in terms of width, and thus affecting the final inference performance. However, simple parallel inference can lead to models failing to provide depth information. Sequential processing, on the other hand, introduces intermediate information, deepening the LLM’s traversal of category objects and morphisms, gradually adjusting the sampling probability distribution, and achieving more accurate inference. CoT improves the expressive power of low-depth Transformers in addressing inherently sequential problems, allowing Transformers to avoid simple parallel inference and reason step-by-step sequentially.

The paper further demonstrates that, through T-step CoT, a fixed-depth Transformer with fixed bit precision and $O(\log n)$ embedding size can solve any problem that can be solved by a Boolean circuit of size T.

123

The above diagram shows the relationship between classes with different embedding sizes d(n) and CoT lengths T(n).

6.2 From a Machine Learning Perspective

Forward propagation angle

Some methods focus on the mapping between hidden states and weights in forward propagation, attempting to interpret the internal workings of language models by visualizing the weights and hidden states. Here, we mainly introduce the logit lens. The role of the logit lens is to demonstrate the model’s performance during the generation process by converting the hidden states of the LLM into word probabilities. This projection helps to understand how the LLM gradually constructs the output pattern during the generation process.

The idea and principle of Logit Lens are:

Language models assign features to the input layer by layer. So, can we observe how these features change layer by layer?
Since decoding a new token involves transforming the final hidden states once using a linear layer and then converting them into a probability distribution of the vocabulary using softmax, we can pre-add the LM head to each intermediate layer. By passing the activation values through the final normalization layer of the transformer and then multiplying them with the output embedding matrix, we can convert the activation values into the logit of each word in the vocabulary.

Back propagation angle

The paper “Backward Lens: Projecting Language Model Gradients into the Vocabulary Space” uses the backpropagation matrix to understand the operating mechanism of Transformer.

motivation

Backpropagation is the process of applying the chain rule to calculate derivatives and update the weights of a deep learning network model. This process begins with the model performing forward propagation to generate predictions $\hat{y}$ . The model is then compared to the desired target, and the difference is quantified using a loss function. Following this, the model begins backpropagation, calculating gradients layer by layer. The backpropagation algorithm updates the weights in the model by calculating the gradient at each layer. This mechanism not only enables the model to learn new information but also provides researchers with opportunities to interpret model behavior. Currently, discussions on how backpropagation gradients affect model learning and knowledge storage remain relatively scarce.

The motivation behind this research is to extend existing interpretability methods, particularly by applying them to the backpropagation process of language models, such as how to effectively incorporate gradient information into knowledge updates and model editing. By analyzing the gradient matrix during backpropagation, researchers can gain a more comprehensive understanding of the flow of information within the model.

Furthermore, this paper proposes a novel approach: by mapping the gradient matrix to the lexical space, it reveals the intrinsic mechanism of learning new knowledge in language modeling. Through this method, the researchers hope to gain a clear understanding of how the model stores and remembers information at multiple levels.

plan

Apply Logit Lens to the gradient matrix.

In the analysis, the researchers focused on the MLP layer, a crucial area for identifying and editing stored knowledge. The MLP module consists of two closely connected matrices ( $FF_1$ and $FF_2$ ).

Specifically, $FF_1$ maps input from $\mathbb{R}^d$ to $\mathbb{R}^{d_m}$ , and $FF_2$ maps it back to $\mathbb{R}^d$ . Because gradient matrices are high-dimensional and difficult to analyze comprehensively, researchers convert the outer-product form of each gradient matrix into a set of smaller vectors. Each matrix formed by $x_i^\top \cdot \delta_i$ can be interpreted from two perspectives simultaneously: as the span (linear combination) of $x_i$ , and as the span of $\delta_i$ . Researchers used this duality to analyze gradients by focusing on linear combinations of n vectors. They further pointed out that gradients of $FF_1$ use $x_i$ as the span set, while gradients of $FF_2$ use $\delta_i$ as the span set.

124

The image above illustrates the process of calculating the gradient matrix through the outer product $x^\top \cdot \delta$ . Each row of the matrix consists of the same values. Above the image, the matrix is described as the span of $\delta$ , while below, it is described as the span of $x$ . The vectors are transposed to emphasize the span effect.

125

The figure above illustrates forward and backward propagation in one MLP layer when the prompt “Leonard Messi’s effectiveness” yields the answer “Paris,” and how gradients affect model updates. Gradients (green) and weights (blue) interact. The first MLP matrix $FF_1$ tries to imprint encountered forward information into model weights. With vocabulary projection, the authors found this information corresponds to token “team.” The gradient of $FF_2$ pushes encoded information toward the embedding direction of the new target.

Mechanisms for knowledge storage and model editing

Next, we examine how gradients from backpropagation update MLP weights. The paper proposes a two-stage “imprint and shift” mechanism. This mechanism stores information in MLP layers by combining forward-pass input and target embedding, represented by:

\frac{\partial L}{\partial W} = x_i^\top \cdot \delta_i

Here, $x_i$ is the forward input and $\delta_i$ is the corresponding VJP. During updates, two phases occur:

Imprinting phase: attaches an imprint of the input to the MLP layer. $x_i$ is added to or subtracted from $FF_1$ neurons.
Offset phase: adjusts $FF_2$ outputs by subtracting $\delta_i$ from $FF_2$ neurons, amplifying effects after enabling VJP values, equivalent to promoting previously low-probability words.

126

The image above illustrates imprinting and offset in backpropagation. “grad” represents a single neuron in the gradient matrix. $FF_1$ gradients match forward input patterns, while $FF_2$ gradients align with new target embeddings.

6.3 Biological perspective

With the emergence and rapid development of neural networks, researchers have continuously sought biological interpretations, and biological advances continue to inspire AI model design.

Astrocytes

The paper “Building transformers from neurons and astrocytes” points out that biological networks composed of neurons and astrocytes can perform Transformer-like core computations. It explores astrocytes computationally and proposes a mathematical model showing how to build a biologically plausible Transformer using neuron-astrocyte interactions.

motivation

Transformers compare all words in a sentence to produce predictions via self-attention. For this to work, the model needs memory-like storage of words, which seems biologically hard under standard neuron communication.

However, studies on Dense Associated Memory suggest such mechanisms may emerge when at least three-neuron communication exists. Astrocytes can form tripartite synapses with neurons and produce slower calcium-based signaling, allowing them to store and integrate neural information as a memory buffer.

plan

The authors built a mathematical neuron-astrocyte network and showed how combining astrocytes with neurons can produce a biologically plausible Transformer-like system.

127

In the figure above, part A gives an overview of the neuron-astrocyte network. A Transformer block is approximated by a feedforward network with an astrocyte unit covering synapses (matrix H) between hidden and final layers. Part B shows Hebbian updates and presynaptic plasticity updates during writing, and astrocyte modulation of synaptic weights H during reading.

hippocampus

The paper “RELATING TRANSFORMERS TO MODELS AND NEURAL REPRESENTATIONS OF THE HIPPOCAMPAL FORMATION” analyzes the problem from hippocampal memory perspectives.

Although Transformers were developed without direct biological constraints, mathematically their architecture resembles hippocampal models, especially grid and place cell systems. Transformer-based LLMs may therefore partly mimic hippocampal-entorhinal information processing.

128

The figure above shows how Transformer replicates patterns observed in the hippocampus.

The TEM (Tolman-Eichenbaum Machine) model is a neuroscience sequence learner that captures many known neural phenomena in hippocampus and entorhinal cortex (MEC/LEC). The figure below compares TEM and Transformer.

129

6.4 Mathematical perspective

ODE perspective

Examining and interpreting neural networks using the concept of differential equations is a new research direction that has emerged in recent years. Deep neural networks (DNNs) share a common characteristic: input data is processed layer by layer in sequence, forming a time-discrete dynamic system. Based on this, researchers hypothesize that certain types of neural networks can be viewed as discrete differential equations, so readily available differential equation solvers can be used for computation, hoping to obtain better and more interpretable results.

God’s ordinary differential equations

The paper “Neural Ordinary Differential Equations” proposes a model called neural ordinary differential equations, which is a new class of deep neural networks. Neural ordinary differential equations are not limited to modifying existing architectures; they consider how to model data in a continuous manner using neural networks from a completely different perspective.

Derivation

Currently, commonly used neural networks, such as residual networks, form a hidden state by stacking a series of transformations or residual blocks to establish complex transitions. As the number of network layers increases and each step of the network inference computation becomes sufficiently small, i.e., approaching the limit, we can parameterize the continuous dynamics of the hidden layer neurons to obtain the corresponding Neural ODE. The specific derivation is shown in the figure below. We can define the output layer as the solution to the initial value problem of the ordinary differential equation (ODE) at a certain time, and this value can be calculated by an ODE solver.

130

Advantages

Regardless of their structure, neural networks are essentially fitting a complex nonlinear composite function, where the number of composites is equivalent to the number of layers in the neural network. To solve the network, we first need to find the gradients of the network parameters, which involves the chain rule. This requires retaining the activation values of all layers during forward propagation and then using these activation values for calculation during backward propagation. This consumes a lot of device memory or GPU memory, so it is generally impossible to train very deep networks.

Neural ODEs can directly calculate network gradients using equation solvers. However, as the network depth increases, computational errors accumulate. Therefore, an Adjoint State Method (ODE) is introduced to calculate the gradient of the ODE. This method transforms the calculation of network gradients into solving an ODE. The derivatives of the hidden layer states can then be used as parameters. These parameters are no longer discrete sequences but continuous vector fields, eliminating the need for forward propagation to calculate each value and reducing the need for extensive space storage of intermediate results.

In summary, the Neural ODE framework can use the adjoint state method to perform network learning without storing activation values, thus significantly reducing the large amount of memory used in the original backpropagation. It also provides a theoretical framework to study deep learning models from the continuous perspective of ODE.

Example

The core operation of Neural ODEs is parameterizing the derivatives of the hidden layer states, thereby establishing differential equations strongly correlated with the hidden layers. If a method can be used to directly solve the intermediate layer results, this becomes possible. For example, the paper points out a close relationship between residual networks and ordinary differential equations (ODEs). Specifically, residual networks can be viewed as a discretized version of the Euler method for solving ODEs; that is, ResNets can be considered a discretization of a specific set of Neural ODEs. Therefore, readily available differential equation solvers can be used for computation.

131

Do Residual Neural Networks discretize Neural Ordinary Differential Equations?

The paper “Do Residual Neural Networks discretize Neural Ordinary Differential Equations?” further explores the connection between ResNets and Neural ODEs.

The paper first quantifies the distance between the hidden state trajectories of ResNets and the solutions of their corresponding Neural ODEs. It then finds that optimizing the smoothness of ResNets using gradient descent can regularize the Neural ODEs at a certain rate, achieving depths and requiring optimization time consistent with gradient descent. Based on this finding, the paper proposes using a memory-free adjoint method to train ResNets, demonstrating that this method theoretically supports training deep ResNets if the residual function and input satisfy the Lipschitz condition. Finally, the paper successfully fine-tunes very deep ResNets using the adjoint method without memory consumption in the residual layers.

Multi-particle dynamic system perspective

The paper “Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View” proposes a new understanding of the Transformer architecture from the perspective of Multi-Particle Dynamic Systems (MPDS), mathematically interpreting it as a numerical ordinary differential equation (ODE) solver. In fact, this is another perspective on ODE.

Multi-particle dynamic systems are a common type of dynamic system in physics. In fluid mechanics, they are typically simulated using ordinary differential equations (ODEs), known as convection-diffusion equations. For example, the derivative of a particle’s position with respect to time is the sum of two functions: the first, F, is an N-ary function representing the interaction of N particles (often called diffusion); the second, G, describes the individual interaction of each particle (often called convection). This aligns naturally with the Multi-Head Self-Attention (MHA) and Fibre-Frequency Negation (FFN) mechanisms in the Transformer architecture. MHA considers the semantics and dependencies of different words in a sentence, using this information to capture the sentence’s inherent structure and representation. FFN, on the other hand, applies the same linear transformation to each word at each position in the sentence, encoding the context of each position into a higher-dimensional representation. FFN represents the process of convection, while multi-head self-attention mechanisms represent the process of diffusion. Therefore, we can use this multi-particle convection-diffusion equation to explain the meaning of the Transformer architecture. Let’s examine its derivation.

In fluid mechanics, the motion of a single particle is simulated using two parts (corresponding to number 1 in the diagram below):

One is the motion of the particle itself, often referred to as Convection.
One is the effect of other particles on it, which is often referred to as diffusion.

In fluid mechanics, a multi-particle dynamic system is typically simulated using the equations indicated by symbol 2 in the figure below.

Solving the above ordinary differential equation using the Lie-trotter method yields equation number 3.

The multi-head self-attention mechanism can be represented by formula number 4, which can be transformed into formula number 6, corresponding to the function F in the multi-particle dynamics system above.

FFN can be represented by formula number 5, corresponding to the function G in multi-particle dynamic systems.

Combining labels 5 and 6, we get label 7 as the first layer of the Transformer.

Therefore, the processing flow of Transformer can be viewed as: after a period of processing, tokens are presented in another position in a higher-dimensional space from an initial position.

132

The authors also point out that the Lie-Trotter splitting scheme is actually an outdated numerical solution method, a first-order approximation. In practice, Strang splitting is used to solve ordinary differential equations. Strang splitting, a second-order approximation, offers a better numerical solution. Therefore, the authors propose that if we map the numerical solution of Strang splitting for ordinary differential equations onto a neural network, this network would be called a Macaron structure, as shown in the figure below.

133

The above method takes a purely mathematical approach, viewing Deep Neural Networks as a numerical solver of Ordinary Differential Equations. The paper maps these solutions to better numerical solutions of the Ordinary Differential Equations, thus mapping them back to a better network structure. This new Transformer structure significantly improves performance compared to traditional methods. Unlike industry companies that explore through experiments, the Macaron structure is designed using mathematical methods based on “human” thought processes, resulting in a better network.

Stream mapping perspective

The paper “A Mathematical Perspective On Transformer” attempts to provide a general and easily understood framework for studying Transformers from a mathematical perspective. A DNN can be viewed as a flow map to another $\mathbb{R}^d$ , while a Transformer can be considered a flow map on $\mathcal{P}(\mathbb{R}^d)$ . To achieve this flow map that transforms in the metric space, Transformers establish a mean-field interacting particle system.

The paper’s model focuses only on two key components of the Transformer architecture: self-attention mechanism and layer normalization.

Layer normalization effectively confines particles within the space of $S^{d-1}$ .
The self-attention mechanism achieves nonlinear coupling between particles through an empirical measure. In other words, the self-attention mechanism is a nonlinear coupling mechanism in an interacting particle system.

The complete Transformer is represented as shown in the figure below.

134

The paper also introduces a simpler and easier-to-use alternative model for the self-attention mechanism: a Wasserstein gradient flow of an energy function, for which there are already well-established methods for optimal configuration of energy functions at points on a sphere.

Differential angle

Transformer architectures tend to overemphasize irrelevant context, thus arbitrarily assigning attention weights to these unrelated elements. This is because, as the context length increases, the sum of attention to tiny, irrelevant tokens can exceed the attention given to a few relevant tokens, overwhelming them. As the input length increases, the classic Transformer may find it increasingly difficult to capture key information.

The authors of the paper “Differentical Transformer” refer to this irrelevant context as attention noise. Therefore, to address this issue, they proposed the DIFF Transformer. Specifically, “differential attention” calculates the attention score as the difference between two separate softmax attention maps, thus eliminating noise through subtraction. This amplifies attention to the answer range while eliminating noise, prompting the model to focus on key information within the context, thereby enhancing its ability to model context.

Overall Architecture

For ease of explanation, the paper uses a decoder-only model as an example to illustrate the architecture. The overall architecture of the model is consistent with the traditional Transformer layout, consisting of L stacked DIFF Transformer layers, each layer formed by a differential attention module and a feedforward network module. Given an input sequence x, the model embeds and packages the input into $X_0$ , which is then further processed layer by layer to finally obtain the output $X_L$ .

Compared to the Transformer, the main difference of the Differential Transformer is that it uses differential attention instead of traditional softmax attention, while maintaining the overall macroscopic layout unchanged. Furthermore, the paper also references LLaMA and adopts two improvements: pre-RMSNorm and SwiGLU. Here, $W_G$ , $W_1$ , and $W_2$ are learnable matrices.

135

Differential Attention

“Differential attention” refers to eliminating attention noise by using the difference between two softmax functions. This idea is similar to the differential amplifier proposed in electrical engineering, which uses the difference between two signals to eliminate common-mode noise at the input. Furthermore, the design of noise-canceling headphones is based on a similar idea.

The differential attention mechanism is as follows:

Given input X, first project it into queries, keys, and values $Q_1$ , $Q_2$ , $K_1$ , $K_2$ , and $V$ (corresponds to number 1 in the diagram below).
The query and key vectors are divided into two groups, and two separate softmax attention operations are calculated. The difference between these two softmax operations is then used as the final attention score (see figure 2 below).

The degree of trade-off between the two attention maps can be dynamically controlled by the size of $\lambda$ , thus better adapting to different inputs and task requirements.

136

Multi-head differential attention

Multi-head attention mechanisms can also be used in the DIFF Transformer. Let $h$ represent the number of attention heads. This method uses different projection matrices $W_i^Q$ , $W_i^K$ , and $W_i^V$ , where $i \in [1,h]$ . The scalar $\lambda$ is shared among heads within the same layer. The final result is then obtained by concatenating the outputs of each head and projecting them.

The figure below illustrates the multi-head differential attention mechanism and a code example. $\mathrm{GroupNorm}(\cdot)$ is used to emphasize that $\mathrm{LN}(\cdot)$ is applied independently to each head. Since differential attention tends to have a sparser pattern, the statistics between heads are more diverse. To improve the gradient statistics, the $\mathrm{LN}(\cdot)$ operator normalizes each head before the concatenation operation.

137

The specific code is provided below. Address: https://github.com/microsoft/unilm/tree/master/Diff-Transformer

class MultiheadDiffAttn(nn.Module):
    def __init__(
        self,
        args,
        embed_dim,
        depth,
        num_heads,
    ):
        super().__init__()
        self.args = args
        self.embed_dim = embed_dim

        # arg num_heads set to half of Transformer's num_heads
        self.num_heads = num_heads

        # arg decoder_kv_attention_heads set to half of Transformer's num_kv_heads if use GQA
        # set to same as num_heads if use normal MHA
        self.num_kv_heads = args.decoder_kv_attention_heads if args.decoder_kv_attention_heads is not None else num_heads
        self.n_rep = self.num_heads // self.num_kv_heads

        self.head_dim = embed_dim // num_heads // 2
        self.scaling = self.head_dim ** -0.5

        self.q_proj = nn.Linear(embed_dim, embed_dim, bias=False)
        self.k_proj = nn.Linear(embed_dim, embed_dim // self.n_rep, bias=False)
        self.v_proj = nn.Linear(embed_dim, embed_dim // self.n_rep, bias=False)
        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=False)

        self.lambda_init = lambda_init_fn(depth)
        self.lambda_q1 = nn.Parameter(torch.zeros(self.head_dim, dtype=torch.float32).normal_(mean=0,std=0.1))
        self.lambda_k1 = nn.Parameter(torch.zeros(self.head_dim, dtype=torch.float32).normal_(mean=0,std=0.1))
        self.lambda_q2 = nn.Parameter(torch.zeros(self.head_dim, dtype=torch.float32).normal_(mean=0,std=0.1))
        self.lambda_k2 = nn.Parameter(torch.zeros(self.head_dim, dtype=torch.float32).normal_(mean=0,std=0.1))

        self.subln = RMSNorm(2 * self.head_dim, eps=1e-5, elementwise_affine=True)

    def forward(
        self,
        x,
        rel_pos,
        attn_mask=None,
    ):
        bsz, tgt_len, embed_dim = x.size()
        src_len = tgt_len

        q = self.q_proj(x)
        k = self.k_proj(x)
        v = self.v_proj(x)

        q = q.view(bsz, tgt_len, 2 * self.num_heads, self.head_dim)
        k = k.view(bsz, src_len, 2 * self.num_kv_heads, self.head_dim)
        v = v.view(bsz, src_len, self.num_kv_heads, 2 * self.head_dim)

        q = apply_rotary_emb(q, *rel_pos, interleaved=True)
        k = apply_rotary_emb(k, *rel_pos, interleaved=True)

        offset = src_len - tgt_len
        q = q.transpose(1, 2)
        k = repeat_kv(k.transpose(1, 2), self.n_rep)
        v = repeat_kv(v.transpose(1, 2), self.n_rep)
        q *= self.scaling
        attn_weights = torch.matmul(q, k.transpose(-1, -2))
        if attn_mask is None:
            attn_mask = torch.triu(
                torch.zeros([tgt_len, src_len])
                .float()
                .fill_(float("-inf"))
                .type_as(attn_weights),
                1 + offset,
            )
        attn_weights = torch.nan_to_num(attn_weights)
        attn_weights += attn_mask
        attn_weights = F.softmax(attn_weights, dim=-1, dtype=torch.float32).type_as(
            attn_weights
        )

        lambda_1 = torch.exp(torch.sum(self.lambda_q1 * self.lambda_k1, dim=-1).float()).type_as(q)
        lambda_2 = torch.exp(torch.sum(self.lambda_q2 * self.lambda_k2, dim=-1).float()).type_as(q)
        lambda_full = lambda_1 - lambda_2 + self.lambda_init
        attn_weights = attn_weights.view(bsz, self.num_heads, 2, tgt_len, src_len)
        attn_weights = attn_weights[:, :, 0] - lambda_full * attn_weights[:, :, 1]

        attn = torch.matmul(attn_weights, v)
        attn = self.subln(attn)
        attn = attn * (1 - self.lambda_init)
        attn = attn.transpose(1, 2).reshape(bsz, tgt_len, self.num_heads * 2 * self.head_dim)

        attn = self.out_proj(attn)
        return attn

Turing completeness

The paper “ASK, AND IT SHALL BE GIVEN: TURING COMPLETE - NESS OF PROMPTING” is the first theoretical proof that the prompt mechanism in the Large Language Model (LLM) is Turing complete.

Turing completeness is a core concept in computation theory, used to describe the computational power of a system. A system is considered Turing complete if it possesses conditional branching, looping, or recursive capabilities and has theoretically infinite storage. Given sufficient time and resources, such a system can simulate any other computable computer and perform any programmable task. Systems with these characteristics can be used to simulate any other Turing-complete system. Therefore, Turing-complete systems are equivalent and, theoretically, can be used to simulate any other Turing-complete system. The Turing machine is considered the ultimate abstraction of all computable processes; in traditional computation theory, the Turing machine is used to measure the computational power of other systems.

When we say that LLM prompts are Turing complete, it means we can view LLM as a general-purpose calculator. With a well-designed prompt, a fixed-size Transformer model can theoretically compute any computable function and accomplish any programmable task. More importantly, this fixed-size model almost reaches the theoretical upper limit of computational complexity for all unlimited-size Transformer models. This provides a completely new perspective for using LLM to solve complex problems and lays a solid theoretical foundation for the prompt project.

The main theoretical and technical contributions of this paper are:

Expressive Power: Demonstrates Turing completeness of the hint. Researchers prove that there exists a fixed-size Transformer $\Gamma$ such that for any computable function $\phi$ , there exists a corresponding finite hint $\pi_\phi$ , such that for any input $x$ , Transformer $\Gamma$ , guided by the hint $\pi_\phi$ , can compute the result of $\phi(x)$ . Importantly, the constructed Transformer $\Gamma$ is independent of the specific function $\phi$ , the hint is also independent of the input $x$ , and the input $x$ can be of arbitrary length.
Chain-of-Track (CoT) Complexity: Research shows that the constructed Transformer $\Gamma$ can compute any class function steps $O(t(n))$ , even for an input of length $n$ . Notably, a single Transformer can achieve almost the same CoT complexity as all Transformer classes $O(t(n)^2)$ and $O(t(n)\log t(n))$ under $\mathrm{TIME}(t(n))$ .
Precision Complexity: The research also demonstrates that the constructed Transformer $\Gamma$ can compute any bit precision $O(\log(n+t(n)))$ under $\mathrm{TIME}(t(n))$ . This means that even a single Transformer can achieve the same precision complexity as all Transformer classes.

The paper reveals the true potential of Prompt: with proper design, we can enable Transformer models to perform any complex computational task. For Prompt engineers, prompts are no longer just simple text given to a model, but can be viewed as a programming language, expressing complex logic and operations through appropriate syntax and structure. This means that when designing prompts, we can not only focus on how to make the model understand the task, but also design prompts that can efficiently complete computations from the perspective of computational theory. If the prompt is designed cleverly enough, it can simulate arbitrary computational processes, giving Prompt engineering a deeper scientific foundation.

Category Theory

In June 2024, Symbolica’s chief scientist, Paul, published a paper aiming to unify the description and research of deep learning architectures through category theory.

Category theory is a branch of mathematics that studies mathematical structures and the relationships between them. It focuses on the mappings between objects and morphisms, as well as the rules governing the combination and composition of these mappings. Category theory provides a unified language for describing and comparing the commonalities and similarities between different mathematical structures, enabling mathematicians to establish connections and discover commonalities across different fields. In making analogies, category theory helps us discover similarities between different mathematical domains, identifying common patterns and structures. By abstracting problems into the language of category theory, we can simplify complex problems into more general forms, making analogies and reasoning easier. Some fundamental concepts of category theory, such as objects, morphisms, homomorphisms, and natural transformations, can help us build bridges between different mathematical fields. The application of these concepts makes analogies more flexible and efficient, thereby promoting a deeper understanding and solution of problems. Therefore, category theory is a highly effective tool for analogies in mathematics, enabling mathematicians and researchers to discover new insights and connections across a wide range of mathematical fields. Furthermore, category theory has also found wide application in other fields, such as computer science, physics, and philosophy.

If we consider deep learning models as categories, then the layers of a deep learning model can be viewed as objects within those categories, and the data flow and transformations between layers can be considered as morphisms. In deep learning, monads can be used to describe the constraints that the model must satisfy, such as symmetry or equivariance, while algebra can be used to describe the model’s parameters and forward propagation. Monad algebraic homomorphisms can be used to describe transformations between model layers, such as from the output of one layer to the input of another. Using category theory to construct and analyze deep learning models in this way can help achieve model credibility. For example:

Category theory provides a clear way to describe the components of a model and the interactions between them, which helps in understanding how the model works.
The constraints that the model must satisfy, such as equivariance and symmetry, are defined by monads to ensure that the model behaves as expected.
The properties of a model can be formally verified to ensure that they meet specific mathematical and logical rules.

From a categorical perspective, a transformer is “to find the piecewise linear function of each layer through pre-training” and then parameterize it.

The reason large language models can answer questions so well is partly because their training data contains information from various categories. By learning these categories, the model can make analogies and inferences when answering questions. Large language models are typically trained using massive corpora containing rich language and knowledge. These corpora cover a wide range of topics, domains, and concepts, allowing the model to learn a vast amount of categories and related information. When the model receives a question, it can try to find similar analogies from the categories it has learned, then map the question to similar questions, and thus provide an answer. This process of analogy and inference is achieved through the hierarchical structure and weight parameters of the neural network within the model.

6.4 From a physics perspective

Boltzmann Prize-winning physicist Hermann Hodgkin once mentioned in an interview, “If you can’t describe the brain in mathematical terms, you’ll never know how it works.” And given his own habits, “If a problem has nothing to do with the physics I know, I won’t make any progress.” Therefore, as artificial intelligence is reshaping all aspects of human society, it’s necessary to understand how the ideas of physics influence people’s perception of neural networks and even themselves.

Data acts as an initialization mechanism, driving continuous updates to network connection weights to achieve a smart, adaptive physical model. This update process involves end-to-end optimization of an objective function, which is essentially Langevin dynamics performed in a high-dimensional space. The mystery of neural networks lies in their high-dimensional weight space, which inherently follows a canonical ensemble distribution. Semi-rigorous physical analysis reveals the distribution of the weight space and the symmetry breaking of data-driven weights. From a purely physical perspective, one can obtain a complete picture of the steady-state dynamics of non-equilibrium neural networks and the hidden dynamical phase transitions; even, one can generalize examples from large language models to a two-body spin model, thereby gaining insight into the essence of intelligence.

Basic dynamic characteristics

The paper “THE ASYMPTOTIC BEHAVIOR OF ATTENTION IN TRANSFORMERS” reveals the fundamental dynamic characteristics of the attention mechanism in Transformers through rigorous mathematical analysis. ASYMPTOTIC studies how the properties of a system or function change as it approaches infinity or a specific value.

The paper demonstrates that under various conditions, all tokens asymptotically converge, and this convergence can lead to model collapse, limiting output diversity. This finding not only deepens our understanding of the Transformer model but also provides important theoretical guidance for improving model design. Future research can build upon this foundation to further explore more complex model dynamics and develop more effective attention mechanism variants.

Main theorems and proofs:

Theorem 3.2: In the single-head case, when the attention matrix is time-invariant, positive definite, and symmetric, the system dynamics exhibit as a Riemann gradient vector field.
Theorem 4.1: When the initial position of the token is located inside one of the hemispheres of the ellipsoid, the system will converge to the consensus equilibrium point.
Theorem 5.1: In the autoregressive case, for almost all initial conditions, the system will converge to the consensus state determined by the first token.
Theorem 6.1: Under certain assumptions (e.g., U is symmetric), if all tokens start from one of the hemispheres, the tokens will converge to a consensus equilibrium point (which is asymptotically stable).

The figure below shows the results for several specific cases of the continuous model, where $Q(t)$ , $K(t)$ , and $U(t)$ represent the query matrix, key matrix, and value matrix, respectively.

138

The figure below illustrates Theorem 3.2. On the left side of the figure, we can see the motion of 10 tokens on an ellipsoid defined by a randomly generated positive definite symmetric matrix. As expected, all tokens converge to a consensus equilibrium. In this case, the dynamics is a gradient vector field. The right side of the figure shows the time evolution of the corresponding potential.

139

The figure below illustrates Theorem 4.1. The left side of the figure shows the movement of 10 tokens on a sphere. We can see that all tokens start and remain within a hemisphere, eventually converging to a consensus equilibrium. Their time evolution is shown on the right side of the figure.

140

The following diagram illustrates Theorem 6.1. On the left side of the diagram, we can observe the tokens converging to the consensus equilibrium point, while the right side shows the corresponding time evolution.

141

Structure of physical spin systems

The blog post “Transformers Are Secretly Collectives of Spin Systems” argues that the neural network architecture blueprint for the Transformer module can be derived from the structure of physical spin systems familiar in classical statistical mechanics. More specifically, the blog authors believe that the forward and backward propagation of the Transformer module can be mapped to computational magnetization in a vector spin model. Furthermore, the Transformer can be imagined as a collection of differentiable spin systems whose behavior can be shaped through training.

Training a deep transformer model is equivalent to orchestrating a set of transformer modules by establishing a differentiable associative structure, where the magnetization of one spin system drives the magnetization of the next. The wobbling (billions) of parameters during training drive the cascaded reaction behavior of the spin system set to better adapt to the ensemble (meta) task specified by the data and loss function.

Angle of force

Some researchers believe that the transformer mechanism is essentially describing a motion trajectory. Attention is message passing, which is actually calculating the force. MLP can be seen as calculating the motion trajectory according to the equation of motion under the action of force. The process of transformer optimization is to seek the force and the equation of motion through data training to construct a motion trajectory that meets the requirements.

0x07 Summary

We first present the architecture of LLaMA, a classic application of Transformer. In its inference process, in each step, a sequence of tokens is input, which is transformed into a three-dimensional tensor $[b,s,h]$ by the Embedding layer. After a series of calculations, the result is finally mapped to the vocabulary space by the logits layer, and the output tensor has the dimension $[b,s,\mathrm{vocab\_size}]$ .

142

The Transformer’s processing flow is essentially a token flow process: a token moves from an initial position to another position in a higher-dimensional space over a period of time. This is a migration from one semantic space to another; in other words, a token is a representation of an ordinary differential equation at different times and in different dimensions. In this process, Attention, FFN, and esNet are indispensable but each plays a distinct role. Attention extracts and aggregates information, ResNet provides information bandwidth, and most of the learned knowledge or information is stored in FFN. Its specific characteristics are as follows:

When a sentence comes in, it is first discretized into a set of individual word tokens. Then, Q, K, and V act like pointers, mapping these word entities to the underlying concepts, thus achieving entity recognition and concept binding.
The encoder in a transformer has the same function as the encoder in an RNN: to extract features from the input sequence at each time step.
$h_{t-1}$ and $y_t$ at each time step in an RNN.
$m_Y=\mathrm{MultiHead}(X,Y)$ is equivalent to the calculation of the attention context between the encoder and decoder in an RNN.
$h_{t-1}$ and $y_t$ of each time step of the RNN unit.
$\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)\times V$ , through accumulation and multiplication, realizes a fully connected graph of concepts, representing all possible propositional structures (subject-verb-object), and ultimately obtains a new set of possible propositional structures.
Next, through the subsequent fully connected layer (similar to a dictionary of propositional structures), new propositions (sentences) are obtained.
By increasing the number of layers, transformers can be combined to form nested structures of logic ranging from simple to complex, thus enabling full-text-level reasoning.

7.1 Effects

The Transformer paper compared the mainstream feature extraction frameworks at the time with three dimensions: computational complexity of each layer, complexity of sequential operations, and maximum path length.

143

We can explore these three indicators separately.

First, let’s look at the complexity of sequence operations. This is the only weakness of the self-attention mechanism. When the sequence length n is large, the time complexity is high. The demand for long texts in the era of large models makes this weakness even more pronounced. Currently, there are many methods to address this issue.
Next, let’s look at the complexity of sequential operations. The complexity of the self-attention mechanism is $O(1)$ , meaning it can be completed in one step, offering the highest parallelism. RNNs, on the other hand, have a complexity of $n$ , because each computation depends on the previous result, requiring $n$ steps to complete, thus making parallelism impossible. The biggest problem with recurrent layers is that they cannot be trained in parallel; the sequence computation complexity is $O(n)$ . Self-attention layers, like convolutions, can be fully parallelized.
Finally, let’s look at the maximum path length, which represents the maximum length of data that can be transferred from one location to another. Attention is essentially a global query operation; any two locations can be directly connected, and the transfer of information between all elements can be completed in $O(1)$ time. Its information transfer speed far surpasses that of convolutional and recurrent layers; CNN is $O(\log_k n)$ ; while in the worst case, the distance between the start and end positions of an RNN is $n$ .

7.2 Advantages and Disadvantages

In addition to the advantages analyzed above, Transformer has other advantages, such as:

The model has relatively high interpretability (how much correlation exists between different words). The Self-Attention model is even more interpretable; the distribution of the Attention results indicates that the model has learned some syntactic and semantic information. RNNs, due to their complex internal state updates, are often considered “black box” models, making their internal decision-making processes difficult to understand. In contrast, the Attention mechanism provides an intuitive way to visualize and understand how the model focuses on different parts of the sequence. By analyzing the attention weights, we can clearly see which input elements play a key role in the model’s predictions.
High adaptability: The Transformer architecture incorporates stacked codecs, a design that makes it highly adaptable not only in natural language processing but also in multiple fields such as computer vision and speech recognition.

The disadvantages of Transformer are also obvious, such as:

The self-attention mechanism itself has quadratic complexity, which makes the architecture computationally expensive and memory-intensive when dealing with long input sequences or resource-constrained situations.
Positional encoding is inherently a compromise. Word vectors preserve the linguistic information (part of speech, semantics) of words. However, positional encoding lacks this variability in semantic space; it’s essentially an artificially designed index. Therefore, adding this positional encoding to word vectors is unreasonable and fails to accurately represent positional information.
It is not as good as RNN and CNN in acquiring local information.
The cage of parameter thresholds. Large models, due to their large number of parameters, often have a large number of redundant parameters. These parameters may not learn effective information during training, but instead increase the complexity and difficulty of training the model. A large number of parameters can also lead to overfitting. Another side effect of too many parameters is that the model cannot learn higher-level effective features: due to the large number of redundant parameters, the model may not be able to effectively learn higher-level features. This may limit the performance of the model, especially when dealing with complex tasks. The massive number of parameters in large models also makes the optimization process more difficult. Optimization algorithms such as gradient descent may encounter problems such as local optima, vanishing gradients, or exploding gradients on large models.

Because of the various problems with the Transformer, researchers are exploring various methods to optimize it and improve the model’s generalization ability and interpretability. For example:

Many researchers are exploring techniques such as knowledge distillation, model pruning, and parameter sharing to reduce the number of model parameters, thereby reducing computational resource consumption and improving training efficiency.
Many researchers are making improvements to the Transformer architecture: sparsifying the attention module, introducing memory information into attention, attention operations on external memory (kv pairs), linear stochastic attention, introducing recursion into the Transformer, and linear attention (performer).
Many non-Transformer studies have been working towards the goal of “retaining the advantages of RNNs while trying to achieve the performance of Transformers”.

0xFF Reference

interpreting GPT: the logit lens nostalgebraist
A Mathematical Framework for Transformer Circuits (Anthropic blog 2021)
A Mathematical Perspective On Transformers
A PRIMER ON THE INNER WORKINGS OF TRANSFORMER-BASED LANGUAGE MODELS
Analyzing Transformers in Embedding Space
ASK, AND IT SHALL BE GIVEN: TURING COMPLETENESS OF PROMPTING
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems
Differential Transformer
https://arxiv.org/pdf/2410.05258
DIFFERENTIAL TRANSFORMER
EMNLP 2024 Best Paper: Understanding the Transformer’s Operation Mechanism from the Backpropagation Matrix PaperWeekly
Finding Neurons in a Haystack: Case Studies with Sparse
Four types of emergence: a typology of complexity and its implications for a science of management
Full Stack Transformer Inference Optimization Season 2: Deploying Long-Context Models ) Yao Fu | Website | Blog | Twitter / X
GPT4 Technical Principles Five: The Illusion of Large Models, the Solution Lies in the One That Ties the Knot Wang Qingfa Qingxi
Grothendieck Graph Neural Networks Framework: An Algebraic Platform for Crafting Topology-Aware GNNs
Jawahar, Ganesh, et al. “What Does BERT Learn about the Structure of Language?” ACL 2019
Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models
The Working Principle of LLM CoT Wang Qingfa Qingxi
Is the LLM Prompt Turing Complete? The first research of LLM demonstration | Heavyweight AI cat repair prompt
MetaFormer is Actually What You Need for Vision
More About Attention Li Xinchun
Neural Ordinary Differential Equations
nGPT: Normalized Transformer with Representation Learning on the Hypersphere
Reformer Model - Breaking the Limits of Language Modeling Hugging Face Blog
softmax is not enough
State-Free Inference of State-Space Models: The Transfer Function Approach
THE ASYMPTOTIC BEHAVIOR OF ATTENTION IN TRANSFORMERS
The Platonic Representation Hypothesis
TRANSFORMER EXPLAINER: Interactive Learning of Text-Generative Models
transformer Model Structure Explained and Implemented zhang
Transformers Are Secretly Collectives of Spin Systems mcbal
Is the Transformer biologically reasonable? MIT team uses neurons and astrocytes to build ScienceAI drug molecule design
Transformer model (Part 1) OnlyInfo
Transformer model (Part 2)
The physical principle of OnlyInfo Transformer Matthias Bal Qingxi
Understanding and Improving Transformer: From a Multi-Particle Dynamic System (Point of View)
Understanding how LLM inference works with llama.cpp omrimallis
Wavelets based physics-informed neural networks to solve non-linear differential equations
[Official Bilingual] Intuitive Explanation of Attention Mechanisms, the Core of Transformer | [Chapter 6 of Deep Learning] 3Blue1Brown’s
10,000-word article introduces the mathematical framework of “language, statistics, and categories” for large language models. Translated by Tai-Danae Bradley: Wang Qingfa
There’s a Transformer in the Brain Too! Similar to the “Hippocampus” Mechanism The Neuroscience Mechanism Behind Artificial Intelligence and Algorithm Learning of Large Language Models Yamu’s Tool for Opening the Black Box is Here! Transformer Explainer Makes Transformer Models Transparent Xiaozhi ‘s Future Exploration AGI Series | Extra 01. (A New Perspective) The Consistency Between Transformer and the Brain’s Neocortex MetaUniTech Forum | Wang Liwei: From Empirical Accumulation to Filling Theoretical Gaps Beijing Academy of Artificial Intelligence A Brief Discussion on Several Schools of Thought on LLM Mechanistic Interpretability (Part 1) Time Traveler Understanding how llama.cpp Completes Large Model Inference Hugulas Neural Network Theoretical Research on Physics Huang Haiping [Modern Physics Knowledge Magazine] Phase Transitions of Categories and the Formation of Knowledge Interpreting Small Models - SLM Half-baked Full-Stack Craftsman Oh Jia ArchiSelf Paper Reading: Differential Transformer Eddie’s Inevitable Path to Reducing the Illusion of Large Models Qingxi https://poloclub.github.io/transformer-explainer