Exploring the Transformer Series (10) --- Self-Attention
Exploring the Transformer Series (10) --- Self-Attention
0x00 Overview
The core of the Transformer, or rather, the key difference between it and other architectures, is its self-attention mechanism. This mechanism allows the model to consider the dependencies between each word and all other words in a sentence, using this information to capture the sentence’s internal structure and representation, and ultimately calculating the correlations (weights) between words. We can divide the self-attention mechanism into three stages:
- Input: As we learned earlier, attention mechanisms accept three inputs: query, key, and value. However, for self-attention, there is only one input sequence, from which Q, K, and V all come. This input sequence is a list of vectors, and the vectors have certain relationships. Taking machine translation as an example, the input sequence is the source or target sentence, and each token in the sentence corresponds to a vector.
- Computation: The self-attention mechanism calculates the relationship between each vector in the sequence and other vectors in the sequence (that is, the relationship between each word and all words in the sentence), so that each token in the sequence can perceive other tokens. For the current vector, the self-attention mechanism calculates the dot product between the query (the current token) and all keys (tokens of interest), applies the Softmax function to obtain weights on the dot product, and uses these weights to perform a weighted average of all associated values. This allows the “understanding” of other words to be incorporated into the currently processed word.
- Output: A sequence, such as a list of vectors, but all vectors in the list take into account their contextual relationships, and are a global feature representation that contains the internal relationships of the sequence.
The specific details are shown in the image below.

0x01 Principle
1.1 Design Concept
Self-attention wasn’t invented by the Transformer, but its performance on other models was less than ideal. So we’re curious why the Transformer still uses self-attention. The paper explains its design philosophy as follows:
Motivating our use of self-attention we consider three desiderata. One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required. The third is the path length between long-range dependencies in the network.
We will analyze the three factors in detail below.
- The total computational complexity of each layer. Transformer’s self-attention uses a scaled dot product attention scoring function, which reduces computation compared to additive attention, while achieving similar results.
- Parallel computation. Self-attention can be computed in parallel, and the computational complexity can be measured by the minimum number of sequence operations required.
- The path length between long-distance dependencies in a network. RNNs need to read the entire sentence to capture the relationships between words, CNNs need to stack multiple convolutional layers to capture the relationships between words, while self-attention is completely parallel, and each word can be directly associated.
While it has several advantages, it’s easier said than done. How does Transformer achieve this? We will analyze it step by step.
1.2 Input
From a macroscopic perspective, the Transformer has only one input sequence, from which Q, K, and V are derived. See the diagram below for details.
From a microscopic perspective, taking the encoder as an example, the Q, K, and V of self-attention have two sources:
- The QKV of the first encoder layer is derived from the linear transformation of the matrix X composed of the input vectors x. The linear transformation is achieved by matrix multiplication .
- The QKV of the subsequent encoder layer is derived linearly from the output of the previous encoder layer.

We will now take the first encoder layer as an example and trace the paths of individual words in the source sequence through the Transformer. For the sake of explanation and visualization, we will temporarily ignore the details and only trace the “line” corresponding to each word.
Suppose we are translating from English to Chinese, inputting the Chinese phrase “我爱你” (I love you). Assume the model dimension is and the input sequence length is . The source sequence first passes through an embedding and positional encoding layer, which generates an embedding vector for each word in the sequence. The matrix formed by these embedding vectors is .
The input sequence will then be transformed using three matrices: , , . Specifically, each element of the input sequence will be multiplied by these three matrices respectively, resulting in:
Stacking pairs of together yields the matrix , and similarly, the matrices and . Alternatively, it can be expressed directly in matrix form:
These three independent matrices, Q, K, and V, are used to calculate the attention score. Each “row” of these matrices is a vector corresponding to a word in the source sequence. Each such “row” is generated from its corresponding source word through a series of transformations such as embedding, positional encoding, and linear transformations. All of these transformations are trainable operations. This means that the weights used in these operations are not predetermined but learned using backpropagation.
1.3 QKV Analysis
The first step in a self-attention mechanism is to generate query vectors, key vectors, and value vectors using tokens. This involves the concepts of query, key, and value (often abbreviated as q, k, and v in various related papers and websites). The query vector represents the token or position currently being processed; it indicates the information the model needs to “query.” The key vector represents a unique identifier for each token in the sequence, used for comparison with the query. The value vector contains the actual content or features of each token in the sequence, contributing to the generation of the current token’s output.
To understand the underlying implementation principles of LLM, it’s essential to understand the QKV matrix within the Transformers Block. A significant portion of cutting-edge research on large-scale models relies on QKV matrices, including techniques like attention, quantization, and low-rank compression, aiming to achieve extreme performance and storage compression without compromising results. This is fundamentally because QKV weights account for over 50% of the weights in large language models. During inference, QKV storage increases linearly with the context length, while computational complexity increases quadratically. In short, query, key, and value are crucial for Transformers and the self-attention mechanism.
Many people have wondered why names like QKV are used, and how to understand this underlying concept. We’ve introduced QKV in previous articles, but haven’t gone into depth. This article will provide a more detailed analysis. Because deep learning is essentially a science with practical applications, a definitive theoretical analysis hasn’t yet been found. Therefore, we will explain it from different perspectives, hoping to help readers gain a better understanding of QKV.
From a psychological perspective
Researchers have discovered that the attention mechanism can be traced back to the concepts of nonvolitional cue, voluntary cue, and sensory inputs proposed by William James, the father of American psychology, in the 1890s. These three concepts correspond to the Key tensor, Query tensor, and Value tensor, which are related to the Key, respectively, and the attention mechanism is constructed from these three elements.
Let’s analyze this with a simple example. You originally went to buy salt (purposeful attention, i.e., voluntary cue), but when you got to the store, you saw Transformers, and your attention was drawn to the Transformers (subconscious attention, i.e., involuntary cue). We can conclude that:
- Key: A series of items (salt and Transformers).
- Value: The subconscious attraction of this series of items to people (in your subconscious, the attraction of Transformers is definitely higher than that of salt).
- Query: The item you want (salt).
The effect of attention is to increase the weight value of salt (the target item). By multiplying the ‘weight vector’ by the ‘key’, even if the target item’s ‘subconscious attraction (i.e., the key)’ is not high, the probability of selecting the target item increases because the target item has a high weight and other items have low weights.
Database perspective
The names of query, key, and value also hint at the overall approach to attention computation. Therefore, understanding Q, K, and V from the perspective of search applications might be more helpful. We can view the attention mechanism as a kind of fuzzy addressing, or a fuzzy, differentiable, vectorized database (or dictionary) lookup mechanism. The relationship between Q, K, and V is: finding the content (Value) corresponding to data (Key) that is similar to or related to existing data (Query). Its specific characteristics are as follows:
- Key and Value are components of a database.
- Each element in the database consists of an address (Key) and a value (Value) (a <Key,Value> data pair). In other words, the value is stored at the address specified by the Key.
- The key is the address, which is the location to be searched. This address summarizes the characteristics of the value at that address, or in other words, the key can reflect the semantic information of the value.
- Value is the value associated with the address key, which is the actual data representing the semantics and the content required by the user.
- Query refers to the information being retrieved; it’s a task-related variable. Suppose we have a query with Key=Query. This query addresses the database by comparing the similarity between the Query and the addresses of all elements with the Key in memory, aiming to retrieve the corresponding Value. In self-attention mechanisms, this question-and-answer format of Q and K inquires about the closeness between the two terms. Intuitively, the Key is the bridge between the Query (what we are looking for) and the Value (what we will actually obtain).
- A standard dictionary lookup is an exact match, returning the value corresponding to the matched key, and retrieving only one entry from the stored content. The attention mechanism, however, uses a combination of vectorization, fuzzy matching, and merging. It calculates the similarity or matching degree (i.e., attention weight) of each key based on the similarity between the query and the key. Then, it retrieves the value from each key address and performs a weighted sum of these values based on their matching degree. This similarity score determines the weight of the corresponding value in the final output.
To explain using a simple example, if we search for “Li-Ning shoes” on Taobao, the Query (Q) is the query you enter in the search bar. The Key (K) is the product description and title returned on the page, which are essentially keywords related to candidate products in the database. The Value (V) is the Li-Ning product itself. The attention mechanism is this query process; that is, attention compares your query Q with the K in Taobao’s database, calculates the similarity between these K and Q, and finally returns the top few products V with the highest similarity. The process is as follows:
- Calculate the similarity between the query and all keys in the database (the relevance of the query, i.e., how likely it is that you are what I am looking for).
- After obtaining the similarity scores, the results are sorted.
- Based on the similarity ranking results, the product IDs that need to be obtained are obtained.
- Retrieve the corresponding Values based on the product ID.
Therefore, the QKV concept in self-attention mechanisms is essentially a database with global semantic integration capabilities. The <Key,Value> data pairs are the elements of the database, and Q is the task-related query vector.
The image below illustrates the details of self-attention from a database perspective.


seq2seq angle
Let’s return to a specific task for analysis, which might provide more clarity. For example, in machine translation, a query can be defined as the hidden state of the decoder at a certain step, i.e., the predicted output for the previous word. A key is the hidden state of the encoder at each time step. We align each query with all keys, so each step of the decoder will obtain a different alignment vector. Alternatively, the encoder’s encoded sequence provides the vector key and value; the hidden state of the decoder provides querythe vector. This is analogous to the decoder looking up words in the encoder’s encoded sequence.
Reconstructing word vectors from a different perspective
Previously, we looked at how to address and retrieve information from a database perspective, with the ultimate goal of outputting a new vector. In this new vector, the value of each dimension is obtained by weighted summation of the values of several word vectors in that dimension. Therefore, the core of the self-attention mechanism is to reconstruct word vectors (query + aggregation), and the encoded output of each input word will incorporate the encoded information of the other words through the attention mechanism.
Interoperability
When reading an article, to understand the meaning of a sentence, you not only focus on the sentence itself, but also look back at other related sentences or words in the context. Let’s take the previous two sentences as examples for further explanation.
- Several distributor transformers had fallen from the poles, and secondary wires were down.
- Transformer models have emerged as the most widely used architecture in applications such as natural language processing and image classification.
How can we semantically differentiate the polysemous word “Transformer”? We must consider the context of the word to better identify its meaning; that is, we must consider not only the word itself but also the influence of other words, i.e., the influence of context. For example, the neighboring words “pole,” “fallen,” and “wires” in the first sentence suggest that “Transformer” is related to the physical environment. In the second sentence, “model” and “natural language processing and image classification” directly tell us that “Transformer” here is a concept related to deep learning. Ultimately, we can infer the precise meaning of “Transformer” through the context (other words in the sentence).
We understand the principle, but how do we put it into practice? How do we infer context from other words in a sentence? Humans can identify which words provide context, but computers are clueless because they only process numbers. The solution is the attention mechanism, which simulates the process in the Transformer.
The Transformer uses the dot product operation to associate each word in the input sequence with every other word, essentially allowing words to interact with each other. Then, it sums these words using a weighted summation, ultimately capturing the interactions between a specific word and every other word in the sentence. The modified words are as follows:
- transformer 1 = 0.7 transformer + 0.1 pole + 0.1 fallen + 0.1 wires
- transformer 2 = 0.6 transformer + 0.1 language + 0.1 image + 0.2 model
Ultimately, the two transformer words reconstructed their own semantics through operations with other words in the sentence.
Next, let’s take a closer look at the two operations: feature extraction and weighted summation.
Feature extraction
How are the features extracted by K obtained? Based on the idea of self-attention and the mechanisms of the human brain, we need to review all items before accurately defining a specific item. Therefore, for each query Q, the attention mechanism will:
- The similarity or relevance is calculated by using the inner product of the query and each key, which determines how much influence each element has on the target element (i.e., the key and query will produce an alignment coefficient). The more similar or relevant the key and query are, the greater the influence of the value, and the more it should be responsible for predicting the input.
- The attention mechanism then performs a softmax operation on the dot product result, ensuring that the sum of the weights of all values is 1. This is to guarantee that the total amount of features contributed by all source elements remains constant. If multiple keys are highly similar to the query, then only a portion of their respective channels will be opened (as if “attention is spread across these few source elements”). From this perspective, the output can be understood as the result of interpolating values based on key-query similarity. Clearly, this output representation carries information from other words.
- The final matrix Y is the semantic matrix in the latent space obtained by fusing the input matrix X with contextual information, where each row represents a token.
In the training process, after receiving Y, the model calculates the loss function, and after backpropagation, Q gradually learns the features of V. This mechanism allows the model to learn different behaviors based on the same attention mechanism and capture dependencies of various ranges within a sequence.
Weighted summation
Let’s understand the weighted summation operation through a real-world example.
We want to learn about Yu Dafu (Query), but since human energy is limited, we need to focus our limited energy on key information. This way, we can quickly obtain the most useful information with fewer resources, achieving better results. A library (Source) contains many books (Value), and reading books is equivalent to acquiring detailed information (Value) from them. To improve efficiency, we assigned each book a number and an information summary (Key). Thus, we can search for many books using the Key, such as A Thin Offering, Sinking, Late-Blooming Osmanthus, A Spring Night’s Intoxication, Returning Home, etc., and can also find Popular Literature (of which Yu Dafu was the chief editor).
By comparing the query with the information summary carried in the key, we can determine their relevance. Books with higher relevance have greater weight. The first few books have higher weight and require more attention, while “Popular Literature” has a slightly lower weight and can be skimmed.
Let’s say we spend a total of 11 hours learning about Yu Dafu. We would spend 2 hours on A Thin Offering, Sinking, Late-Blooming Osmanthus, A Spring Night’s Intoxication, and Returning Voyage, and 1 hour on Popular Literature. Normalizing the time to a probability value that sums to 1, we get [0.18, 0.18, 0.18, 0.18, 0.18, 0.09]. Therefore, Yu Dafu = 0.18 A Thin Offering + 0.18 Sinking + 0.18 Late-Blooming Osmanthus + 0.18 A Spring Night’s Intoxication + 0.18 Returning Voyage + 0.09 Popular Literature. The final information we obtain is the weighted sum of the contents of all the books. Thus, after reading all of these books, we will have a comprehensive understanding of Yu Dafu—a weighted sum.
The essence of the attention mechanism is to use Q and K to calculate “attention weights,” and then use these attention weights to perform a weighted summation on V. Mechanistically, the focusing process of the attention mechanism is reflected in the weight coefficients; a larger weight indicates more attention being invested in the corresponding value, meaning the weight represents the importance of the information. The attention mechanism can be interpreted as routing multiple local information sources into a global tree structure of local representations. In this example, calculating the relevance is equivalent to in the attention mechanism, normalization is the softmax operation, and then the final read count/feature vector is obtained through weighted summation.
Elhage reconstructs the representation of the attention head (still equivalent to the design of the vanilla Transformer). The reconstructed representation can be expressed as the following formula, which shows the characteristics of interoperability, feature extraction, and weighted summation.
1.4 Summary
Let’s take “I ate an apple” as an example to see the process of self-attention.
- First, we determine which target token to use for the self-attention mechanism. This target token is “a,” meaning that “a” is used to determine the relationship between itself and the other three words.
- “One will perform a dot product on all tokens in the sentence “I ate an apple”, then perform a softmax operation to normalize and generate weights.
- Multiplying the weights by the V vector yields a weighted vector, meaning “one” can be represented by a combination of 0.2 * I, 0.1 * ate, 0.5 * one, and 0.3 * apple. Therefore, the word-associative representation of “one” is: in this new vector, the value of each dimension is obtained by weighted summation of the values of several word vectors in that dimension. This new vector is the weighted summation of the “one” word vector after the attention mechanism, and the word possesses word-associative properties.
- Decode output “an”.
It can be seen that, besides oneself, “one” in the whole sentence is most closely related to “apple,” followed by “I.” Therefore, “one” can also be understood as “I-one-apple.” This reveals the essence of “one” in this sentence through “transformation.” “One” itself hasn’t changed, but through “transformation,” it shows another variant state: “I-one-apple.” The outward appearance hasn’t changed, but the essence has.

As can be seen, self-attention is a dynamic, data-driven transformation, a dynamic transformation of the input vector space. This transformation is not fixed but depends on the content of the input data. Therefore, the essential idea of attention can be rewritten as follows: calculate weights by calculating similarity and then sum them up by weight.
0x02 Implementation
2.1 Weight Matrix
Modern neural networks rarely directly incorporate word embeddings into network structure calculations; they typically perform a linear transformation first. In fact, the Transformer multiplies each token’s embedding vector x by three different weight matrices, performing a three-dimensional linear projection (or linear transformation) to derive matrices Q, K, and V. (Note that the “weights” mentioned here refer to connection weights in a neural network, not the semantic association weights between tokens in Attention.) Furthermore, each Transformer block has its own .
These three weight matrices are trained through backpropagation during model training. During training, the model randomly initializes these three weight matrices. During the model’s execution phase (prediction phase), these three matrices are fixed; they represent the fixed node connection weights in the Transformer neural network architecture, and have been pre-trained. The three matrices are essentially the logic the model has learned to allocate Q, K, and V.

Why introduce a weight matrix? Or, why not just use X directly and instead perform a linear transformation? Why derive three vectors Q, K, and V (i.e., query vector, key vector, and value vector) from the embedding of each token in a sequence? These questions can be considered from the following perspectives.
First, let’s look at the disadvantages of using embedding directly.
- The input embedding actually only undergoes one linear transformation, and its feature extraction or representation learning capabilities are extremely limited.
- There are no learnable parameters in a single dot product operation. To identify different patterns, we need different methods for calculating similarity and more parameters.
- If you directly perform self-attention on the original embedding, the calculated similarity result will be a symmetric matrix, with the values at the diagonal edges always being the largest. This is because each character/word is inherently most concerned with itself, which contradicts the original intention of self-attention.
- will likely result in an attention matrix similar to an identity matrix. This degenerates self-attention into a point-wise linear mapping, limiting its ability to capture attention, and making the forward and backward attention between positions i and j identical. However, we generally expect the order of two tokens in a sentence to reflect different information. For two words, the importance of A to B is not necessarily the same as the importance of B to A. For example, “A loves B” and “B loves A” are not necessarily of the same degree.
- The values on the diagonal of this symmetric matrix are most likely to be the largest in the current row. This means that after softmax, the attention on the diagonal will be the largest in the current row. In other words, regardless of the position or combination, the attention of each token is almost entirely on itself, which violates the original intention of Transformer to capture contextual information using a self-attention mechanism.
Let’s take a look at the advantages of using a weight matrix.
- Matching. is obtained by calculating and . The simplest way is to directly multiply these two matrices. However, this can have problems: the two matrices might not match in shape, making direct matrix multiplication impossible. Multiplying these two matrices on the left by matrices and respectively solves both of these problems.
- Learnable. In attention mechanisms, the query, key, and value of each word should not only be related to the word itself, but also to the corresponding task. The query, key, and value of each word should not be manually specified, but rather learnable. Therefore, we can use learnable parameters to describe the transformation process from word embeddings to query, key, and value. After extensive training, each element will find the most suitable query, key, and value required to complete its respective task. These relevant training parameters are contained in three weight matrices.
- Increased fitting ability. All three weight matrices are trainable, which increases the number of parameters the model can learn, expands the feature space, and increases the model’s fitting ability.
- Information exchange. We specify that the i-th row of the input matrix X represents the input at time i. For the vectors in this matrix, the weight matrix is shared throughout the entire process. That is, different share the same , and through this operation, and have exchanged information to some extent. In other words, words have exchanged information with each other to some extent by sharing weights.
Furthermore, readers might ask, since K and Q have the same dimension, why not use a single weight matrix? Or, why use three different weight matrices? The main reason for using different weight matrices is to provide more flexible model representation capabilities and to capture complex dependencies in the data. In fact, this also answers why Q, K, and V need to be distinguished.
- Increased expressive power. Incorporating different linear transformations is equivalent to making different projections on x, projecting the vector x into different spaces. This means that Q and K can be expressed in different semantic spaces, helping the model capture richer semantic information and dependencies.
- Distinguishing or separating roles: In self-attention mechanisms, Q, K, and V play distinct roles. Q represents the information we want to query, or the information we hope to obtain at the current position; K represents the key we use to match Q, or the information that each position in the sequence can provide; and V represents the value we want to extract once a match is found, or the actual content that should be obtained from each position. Using different weight matrices better distinguishes these different roles, providing Q and K with the ability to capture different dependencies, enhancing the model’s understanding of the input data, and improving model performance.
- Increase flexibility. If Q and K use the same weight matrix, the result is the same as the result of multiplying X by itself. This strictly restricts the relationship between them to a fixed pattern, limiting the model’s flexibility. Furthermore, directly deriving V from Q ignores the importance of determining the correlation through K, further reducing the model’s flexibility in processing information.
- Parallel processing. The design of the Transformer model allows for efficient parallel computation when processing sequences. The independence of Q, K, and V enables the model to simultaneously calculate attention scores at all positions throughout the entire sequence, which greatly improves computational efficiency.
In summary, while using the same values for their own dot product (or sharing the weight matrix) may work well in some cases, using different weight matrices provides greater flexibility and representation for Q and K, which helps improve model performance and generalization ability.
We further refine the attention formula using three weight matrices as follows.

2.2 Calculation Process
The formula for the Scaled Dot-Product Attention module is as follows:

In this formula, is the dimension of the vector, and . If only one head is set, is the dimension of the model . If 8 heads are set, then . If the model dimension is 512, then equals 8. The dimensions of Q and K are both , and the dimension of V is , where L is the length of the input sequence. is , and is .
Let’s summarize the calculation process as follows (corresponding to the order from bottom to top in the diagram above):
- Input. Q, K, and V are points that map the input to a high-dimensional space, and the relationships between them are captured through subsequent transformations.
- Calculate the score function. The similarity between the query and all keys is calculated to obtain the attention score (query relevance). The formula is , which is essentially matrix multiplication (i.e., dot product) between the transposes of the Q and K matrices. This step calculates the similarity between vectors in a high-dimensional space.
- Scaling. The scores matrix is scaled by dividing it by the square root of the vector dimension .
- Masking. If a mask matrix exists, the score matrix elements corresponding to positions in the mask matrix that have a value of True are set to negative infinity. This is because during the entire model’s operation, it may be necessary to ignore some inputs depending on the actual situation. We will explain this in detail in the next article.
- Normalization (alignment function). This normalizes the dot product result, specifically by using the softmax operation to normalize the weights, thus highlighting more important weights. The formula is . This step maps the scores in the real number field to a probability distribution.
- The generated result (context vector function) uses a weighted average linear transformation of ‘a’ on the Value. This can be understood as the output ‘y’ being an interpolation of values based on key-query similarity. The calculation formula is .
From a functional analysis perspective, the correlation calculation and weighted summation steps in the Attention mechanism can be viewed as a dynamic transformation of the input vector space. This transformation is not fixed but depends on the content of the input data.
Next, we will review some key points from the above process in detail.
2.3 Dot Product Attention Function
The Transformer paper uses a multiplication function, or dot product attention function, to calculate similarity. From an abstract algebraic perspective, the attention mechanism is more like a relational operation: it uses the relationship between two elements (such as similarity) to determine whether two elements belong to the same class. Only after classification is a weighted combination based on the similarity between input elements performed.
Solution Selection
As shown below, common similarity calculations include dot product (multiplication) and addition.
Concatenation similarity involves concatenating two vectors and then using a learnable weight to calculate the inner product to obtain the similarity, also known as Additive Attention, which means .
The corresponding value for V is:
As can be seen from the formula, the advantages of the addition method are:
- It can handle keys and queries of different dimensions (while the dot product operation requires the query and key to have the same length).
- The calculation is simpler, but it is wrapped with tanh and v, which is equivalent to a complete hidden layer. Therefore, the overall complexity is actually close to that of the multiplication scheme.
- While the performance difference between dot product attention and additive attention is not significant in most tasks, additive attention may slightly outperform dot product attention in some long sequence tasks.
Moreover, the multiplication approach has a disadvantage: as the vector dimension increases, the upper limit of the dot product result becomes higher and higher, and the difference between the dot product results becomes larger and larger. Therefore, scaling needs to be added when calculating the Attention weights.
In terms of performance, the paper “Massive Exploration of Neural Machine Translation Architectures” conducted a comparative experiment.

The results show that the additive attention mechanism is slightly, but consistently, superior to the multiplicative attention mechanism.
So why does Transformer choose dot product attention instead of additive attention? There are some discussions online, and some points of consideration are:
- Dot multiplication can be efficiently parallelized on hardware using matrix multiplication, enabling fast computation.
- Dot product attention can capture the similarity between queries and keys, giving higher weights when queries and keys are similar, which helps the model capture complex dependencies in the input sequence.
- When performing operations on a representation by dividing it into different parts, dot product is a more flexible and convenient method for calculation.
The reasons given in the paper are based on considerations of efficiency and modeling capability, as detailed below:

Interpretation
Let’s further interpret the formula. The dot product is the angle between two vectors, representing the projection of one vector onto the other. The larger the projection value, the higher the correlation between the two vectors. If the angle between two vectors is 90 degrees, then the two vectors are linearly independent and have no correlation whatsoever. In fact, the dot product calculates the product of their aligned lengths. Because in machine translation, this vector is a word vector, a numerical mapping of words in a high-dimensional space, and the high correlation between word vectors indicates that, to a certain extent, in addition to focusing on the current word A, similar words B will also receive more attention.
Let’s take a look at Su Jianlin’s brilliant interpretation of the formula, which will give us a deeper understanding of it.
Decomposing yields the product of two vectors: , which means the product of the two vectors is decomposed into the product of the magnitude of each vector and the cosine of the angle between them. Where:
- depends only on the current position i, so it does not change the relative size of the attention, but only the sparsity.
- is the tensor magnitude at other positions, capable of changing the relative magnitude of the conditional probability , but it does not involve the interaction between i and j, and can only be used to express some absolute signals.
- is used to express the interaction between q and k, and it has the greatest degree of freedom.
To increase the relative importance of a certain position j, the model has two options:
- Increase the module length .
- Increasing is equivalent to decreasing and changing .
However, due to the “curse of dimensionality,” significantly changing the size of the angle in high-dimensional space is not so easy. Therefore, if it can be accomplished by increasing the modulus , the model will preferentially choose to do it. The direct consequence of this is that may not be sufficient (meaning that the angles trained are only a finite set, while when performing length extrapolation, it has to face a larger set, thus failing to make correct predictions). This may be the main reason why Attention cannot perform length extrapolation.
2.4 softmax
definition
The significance of the Softmax operation is normalization. We call the result before softmax normalization the attention score, and the result after softmax normalization the attention score the attention weight. Given real numbers , the Softmax function transforms it into a probability distribution , where each is calculated using the following formula:
The key properties of Softmax are as follows:
- Nonnegativity: For any , .
- Normalization: The sum of all outputs is 1, i.e., .
- The use of the exponential function: The exponential function ensures that the output value is positive and amplifies the difference in large values.
algorithm
The algorithm requires two loops. First, it iteratively calculates the sum of the denominators. Then, iteratively calculates the softmax value for each value in the vector, i.e., scaling each element. This process requires two memory reads and one memory write-back operation. The specific algorithm is as follows, where is the denominator.

Specific examples are shown below.

necessity
The softmax function is widely used in neural networks for the following reasons:
- During inference, softmax transforms the raw outputs (logits) of the output layer into an effective probability distribution, giving the outputs a probabilistic physical meaning. This facilitates the interpretation and calculation of the loss function while maintaining numerical stability and symmetry. This makes softmax the preferred activation function for multi-class classification problems.
- Guiding model learning during training. In the training of neural networks, the cross-entropy loss function is typically used in conjunction with the softmax function. The cross-entropy loss function quantifies the difference between the probability distribution of the model’s output and the true label, while the output of the softmax function provides a probability distribution. This mechanism allows the model to more effectively adjust weights during training to improve the accuracy of estimating the true probability distribution.
Why incorporate softmax into the attention mechanism? Because it seems that softmax is more harmful than beneficial, for example:
- Without Softmax, the computational complexity is significantly reduced. This is because removing Softmax results in the product of three matrices, . Since matrix multiplication satisfies the associative law, we can first calculate to obtain a matrix, and then left-multiply it by Q. Because , the overall complexity is approximately . In other words, removing Softmax reduces the complexity of Attention to the ideal linear level of .
- Without Softmax, memory usage would be significantly reduced. Let’s first look at the challenge that FlashAttention aims to address. Without Softmax, we can perform block-based computation on the matrix. For example, we can cut Q, K, and V into blocks along N (seqence length dimension), compute one block of Q and one block of and then immediately perform a matrix-matrix multiplication (GEMM) operation with one block of V. On the one hand, this avoids moving the P matrix in HBM and SRAM; on the other hand, the P matrix does not need to be explicitly allocated, eliminating the level storage overhead in HBM.
The softmax function is incorporated into the attention mechanism because it has the following advantages or functions.
- The softmax operation essentially quantifies the information contribution of each word. In attention mechanisms, we intuitively want to focus on semantically related words and downplay irrelevant ones. This is essentially a multi-class classification problem. In multi-class classification, we want the neural network output to reflect the probability of each class, meaning the value of each output node represents the probability of the corresponding class. This requires the output values to meet two conditions: first, each output value should be between 0 and 1; second, the sum of all output values should equal 1. Thus, the output can be interpreted as a probability distribution. While direct linear normalization can satisfy the first condition, it often fails to satisfy the second because it doesn’t consider the relative differences between scores and cannot reflect the relative strength or confidence level of the original scores. The softmax function, however, satisfies both conditions simultaneously.
- The Softmax function converts each score into a probability by applying an exponential function and then normalizing these exponential values. This ensures that each output value is between 0 and 1, and that the sum of all output values is 1. More importantly, the use of the exponential function amplifies the differences between scores, better reflecting the relative confidence levels in the original scores.
- Given a vector of numbers, Softmax first widens the differences between these numbers (due to the exp function) and then normalizes them. Essentially, it performs a soft version of the argmax operation, or acts as a smooth approximation of argmax. Unlike argmax, which brute-force selects a maximum value (producing a one-hot vector), softmax smooths this output by distributing the 1s corresponding to the maximum value in the one-hot output to other positions according to the magnitude of the input elements. Therefore, the resulting vector approximates a one-hot vector (the degree of approximation varies depending on the order of magnitude of the numbers).
- In addition, when multiplying two matrices K and Q, the rank of the linear product will not exceed d. Softmax will have a certain degree of rank-increasing effect. If softmax is not used, the rank of the linear attention will be even lower and the expressive power will be worse.
shortcoming
Softmax also has some inherent limitations. For example, the paper “softmax is not enough (for sharp out-of-distribution)” points out that as the input size increases, the output coefficients of the softmax function tend to become more uniformly distributed (attention becomes more dispersed). That is, even if the attention coefficients of these tokens are sharp within the distribution (even if they were sharp for in-distribution instances), inputting more tokens will lead to more dispersed attention (or increased attention entropy), resulting in inconsistent training and prediction results. See the figure below for details.

This paper reminds us that even widely used functions like softmax can have limitations in their applicability, especially when dealing with data outside the training distribution. We need to place greater emphasis on the generalization ability of models and explore more robust model architectures.
improve
There have also been many improvements made to softmax.
Log-Softmax
The standard softmax formula involves many exponentiations and divisions, resulting in high computational costs. We can reduce these costs by taking the logarithm of the values. Specifically:
This approach is called Log-Softmax. Compared to Softmax, Log-Softmax offers many advantages, including improved numerical performance and gradient optimization. These advantages are crucial for implementation, especially when training the model is computationally expensive, where they can yield substantial benefits. Furthermore, the use of log probabilities provides better information-theoretical interpretability. When used in classifiers, Log-Softmax penalizes the model when it fails to predict the correct class.
Hierarchical Softmax(H-Softmax)
The standard softmax time complexity is , which may not be significant in classification tasks, as only one calculation is needed after the last layer. However, in NLP generation tasks, we need to perform softmax on a vector the size of the vocabulary to use the softmax value of each word as its likelihood, thus selecting the word with the highest likelihood. When the vocabulary is very large, the computational complexity becomes a serious problem, requiring time complexity for each token prediction. Therefore, softmax needs to be modified to adapt to practical tasks, and H-Softmax is used to solve this problem.
H-Softmax is the same as Hierarchical Softmax in Word2vector. The solution of H-Softmax is to incorporate Huffman Tree. It decomposes the original multi-class classification using softmax into multiple sigmoids, and then uses Logistic Regression to determine whether to take the left or right subtree in the Huffman tree. Finally, its output value is the probability of taking a certain branch of the tree.
Hierarchical softmax uses a binary tree representation of the output layer. Assume that k words are leaf nodes, each representing the relative probability of its child nodes. Each word in the vocabulary has a path from the root of the binary tree to itself. Let k(k, k) represent the k-th node on the path from the root node to the word k, and let kh(k) represent any child node of k. If k is true, then [k] = 1; otherwise, [k] = −1. The Hierarchical Softmax can then be represented as shown in the following diagram.

adaptive softmax
Common methods for improving softmax can be broadly categorized into two types: one is to generate an approximate probability distribution using an accurate model, and the other is to generate an accurate probability distribution using an approximate model. These are optimizations from a mathematical perspective. But from a hardware perspective, how should we optimize it in conjunction with a GPU?
The paper “Efficient softmax approximation for GPUs” proposes Adaptive Softmax, which offers some insights. Adaptive Softmax draws on Hierarchical Softmax and some variations. Unlike previous work, this method leverages the characteristics of GPUs to accelerate computation, thereby improving the computational efficiency of the softmax function and making it suitable for neural networks with very large vocabularies.
The idea is simple: most words in an article are composed of a small number of words from the vocabulary, a concept known as the long tail theory or the 80/20 rule. Language models, when predicting words, often need to predict the probability of each word (usually using softmax). The vocabulary can be very large, with many low-frequency words. Therefore, we can leverage the uneven distribution of words by dividing them into high-frequency and low-frequency categories during training. We first predict which category a word belongs to, and then predict the specific word. This simple classification significantly reduces the computational cost of softmax. Previously, each word required a calculation; now it’s: . becomes significantly smaller, while , although large, has a small . Furthermore, Adaptive Softmax further accelerates computation by combining modern architectures and matrix multiplication operations, making it more suitable for GPU units.
The hierarchical model of Adaptive Softmax is organized as follows: (i) The first layer includes the most frequent words and vectors representing clusters. (ii) The second layer consists of clusters associated with rare words, with the largest cluster associated with the least frequent words. Specifically, as shown in the figure below, blue represents high-frequency words (generally grouped into one category), and white represents low-frequency words (there are three white boxes representing three categories of low-frequency words). At each time step, the current word is first predicted to belong to either the high-frequency or low-frequency category, and then Softmax is calculated within the obtained category to obtain the final result.
So how do we perform the classification? The paper also provides a calculation model, assuming B is the batch size, d is the size of the hidden layer, and k is the size of the feature, as shown in the formula in the figure below.

2.5 scaling
The “Scaled” in Scaled Dot-Product Attention refers to scaling, which involves dividing by a denominator after the dot product. Therefore, we raise a question: why is the attention calculation formula divided by ?
in conclusion
We refine the original paper’s explanation as follows: When the dimension is large, tends to have large values (proportional to the dimension). When these large values pass through the softmax function, the distribution of the dot product result tends to be steep (large variance, concentrated in the region of large absolute values). This pushes the dot product result into the gradient flattening region of the softmax function, causing the gradient of the softmax function to become very small. This means the model will have difficulty converging, increasing the learning difficulty. Therefore, the Transformer authors scaled by , thus decoupling the steepness of from . This effectively offsets the scaling effect of the dot product caused by the increase in dimension, ensuring that the dot product result is within a reasonable range regardless of the value of the dimension , thereby helping to maintain gradient stability and accelerating the model training process.
For the larger values, please refer to the footnotes in the paper.

is very similar to the hyperparameter Temperature. Temperature is used to adjust the probability of predicted words in the softmax output layer of the model, controlling the randomness and creativity of the generated text.
Problem Derivation
We have summarized the above issues as follows:
- If increases, the variance of will increase.
- Increased variance leads to larger differences between elements of vectors.
- As the difference between elements increases, softmax degenerates into argmax, where the maximum value is 1 and the other values are 0.
- If only one value is 1 and the others are 0, the gradient during backpropagation will become 0, which is known as gradient vanishing.
We will analyze them one by one.
Variance increases
Essentially, when we multiply two matrices (e.g., Q and ), we perform multiple dot products on their respective column vectors. For example, we multiply the first value of the first vector with the first column of the second vector, and so on, to obtain the values in the output matrix.
When q and k are normally distributed (with a mean of zero and a variance of 1), the variance of the dot product of q and k is . The derivation is as follows:
Take a single column from the Q matrix and a single row from the K matrix. Assume that and are independent and identically distributed random variables with a mean of 0 and a variance of 1. Then also has a mean of 0 and a variance of 1. That is, each such qk pair generates a new vector with an expected value of 0 and a variance of . When is large, the variance of this vector is also large. In other words, the variance of the dot product is proportional to the number of dimensions. When we perform a dot product on low-dimensional vectors, the variance of the output is often small. When we perform a dot product on high-dimensional vectors, the output number often has a high variance.
In the real world, representing complex semantic concepts requires high query and key dimensions, and high dimensionality leads to a larger .
The difference between elements increases
Increased variance leads to larger differences between elements in a vector. This is an obvious conclusion, as increased variance indicates greater variability between data points. Some numbers will be very large, while others will be very small.
softmax degenerates
Each component in softmax is as follows:
If each element of x is expressed as the maximum element minus an interpolation, i.e., , then softmax can be rewritten as:
Since is a common factor, it can be factored out and canceled out.
If is much smaller than , then will be close to 0. Therefore, all terms except will be close to 0, i.e.:
Therefore, as the variance increases, the softmax function degenerates into the argmax function. When the numbers are large, the degenerate softmax tends to assign higher probabilities; when the numbers are small, softmax tends to assign lower probabilities. Ultimately, the degenerate softmax function will allocate most of the probability distribution to the largest element, which is close to assigning the largest element a value of 1 and all other elements a value of 0.
In other words, softmax distributes the 1s corresponding to the maximum value in the one-hot output to other positions according to the size of the input element values. However, when the magnitudes of the input arrays differ significantly, the “1s distributed out” become less and less. When the magnitudes differ to a certain extent, softmax distributes almost all the probability distribution to the label corresponding to the maximum value, and its effect is reduced.
Gradient vanishing
Assuming our input is on a large order of magnitude, then softmax will produce a vector that is close to one-hot encoding, , the derivative of which is as follows:
The result of the above formula is close to 0, meaning almost all gradients are close to zero. Therefore, dot products with very large absolute values will receive almost zero gradients during training, which is very detrimental to optimization based on gradient descent. During backpropagation, the model continuously adjusts and changes larger numbers, while values with lower probabilities are hardly updated. Therefore, since not all parameters are updated, this leads to slow training progress.
How to reduce variance?
Therefore, before applying SoftMax, we need to find a solution to help reduce the variance of these numbers. Now, our problem has been simplified to reducing the variance of the product matrix containing the initial attention scores.
The solution proposed by vanilla Transformer is to scale the matrix to obtain the same variance as before. This is because dividing by the square root of the dimension means that even if the dimension is increased (which is necessary to capture more complex patterns), we can scale the matrix while increasing the dimension to maintain consistent variance.
The role of entropy
Information entropy measures uncertainty. In attention mechanisms, entropy can be used to measure the uncertainty of the model’s output, or in other words, the entropy of the attention weight distribution of a query vector can be used to measure its attention to different key vectors.
- High entropy indicates that the model’s attention is scattered, meaning that the query vector has high attention weights on multiple input data (key vectors). The model distributes attention across multiple inputs, thus extracting information from multiple sources without a clear focus. Therefore, a high-entropy distribution can capture broader contextual information, making it more suitable for scenarios requiring global information.
- Low entropy indicates that the model’s attention is focused, meaning that the query vector has a high attention weight on a few key data points, and the model’s selection of these key vectors is more explicit and definite. Therefore, low-entropy distributions tend to selectively extract information, making them more suitable for scenarios that require focusing on key information.
In order for the model results to generalize better to unknown lengths, the Attention mechanism should be designed so that has entropy invariance as much as possible.
Entropy invariance means that the entropy value should be insensitive to the length . More specifically, if we add a few more tokens to an existing set of tokens, the newly calculated will naturally change, but we hope that will not change too much.
From another perspective, we can view uncertainty as the “focus” of attention: if the entropy is 0, then attention will focus on a single token; if the entropy is , then attention is evenly distributed across all tokens. We want the entropy to remain constant because we want existing tokens to continue focusing on the same tokens even after introducing new ones, rather than having the introduction of new tokens excessively “spread” the existing attention, causing a significant change in the summation result. Scaling attention using can alleviate this problem to some extent, that is, scaling the attention from
Modified to
Where is the training length and is the prediction length. After this modification, the entropy of attention becomes more stable with changes in length.
2.6 Summary
Finally, we summarize the computational process of the self-attention mechanism as shown in the figure below.

0x03 Implementation
3.1 Harvard Code
The attention() function defines the operation process of the standard attention mechanism. The calculation formula is: , which is also the core operation of the entire transformer.
Input & Output
First, it should be noted that some terms used in the text and footnotes below this section are explained as follows.
- batch_size: The number of sentences in the sample.
- seq_length is the sentence length.
- d_model is the dimension of the model.
- head_num is the number of attention heads.
- head_dim is the attention dimension for a single head, and its size is d_model / head_num.
- embedding_size: The size of the word embedding.
Secondly, the input parameters for the attention() function are as follows:
- The query, key, and value are the input vector set, which, as mentioned in the paper’s formula, are Q, K, and V after being calculated using . This calculation process takes place within the
forward()function of theMultiHeadedAttentionclass. The query, key, and value have two possible shapes:- A single attention head, i.e., a self-attention call to the attention() function, has a shape of (batch size, seq_length, d_model).
- In a multi-head attention mechanism, where the
attention()function is called, the shape is (batch size, head_num, seq_length, head_dim). Since Transformer uses a multi-head attention mechanism, the shape of the query, key, and value will always be the second type.
- Mask: Used to mask certain locations to prevent attention calculations from being performed on those locations. There are two types of masks: src_mask and tgt_mask.
- Dropout: Dropout rate, used to add randomness, which helps prevent overfitting.
The attention() function has two outputs:
torch.matmul(p_attn, value): The weighted average of value, with weights derived from p_attn.p_attn: The weights calculated from the query and key, which were not used later.
Here we need to explain what a mask is. There are two types of masks:
src_mask: Its shape is (batch size, 1, 1, seq_length). Since the mask is the same for all heads, the second dimension is 1, and broadcasting can be used duringmasked_fill. Here, it’s a self-attention mask, so each time step can attend to all other time steps. The third dimension is also 1, so broadcasting is also used.tgt_mask: The shape is (batch size, 1, seq_length, seq_length).
Illustrations & Codes
We can use the following diagram to verify the code.

The corresponding code is as follows.
def attention(query, key, value, mask=None, dropout=None):
"""
本函数计算缩放点积注意力(Scaled Dot Product Attention)
query, key, value:是经过权重矩阵线性转换过的Q,K,V矩阵,具体线性转换是在后续会介绍的 MultiHeadedAttention类中。query, key, value有两种可能的形状:
1. 若注意力为单头自注意力,则形状为(batch size, 词数, d_model)。
2. 若注意力为多头自注意力,则形状为(batch size, head数, 词数,d_model/head数)
在哈佛代码中使用的是多头自注意力,所以query, key, value的形状只会是第二种。
mask(掩码矩阵):有两种参数,一种是src_mask,另一种是tgt_mask。
"""
"""
用query最后一个维度的大小来给d_model赋值。之所以这样可以获取,是因为query和输入的shape相同,
若注意力为单头自注意力,则最后一维都是词向量的维度,也就是d_model的值。
若注意力为多头自注意力,则最后一维是 d_model / h,h为head数
"""
d_k = query.size(-1)
"""
执行QK^T/√d_k,得到注意力分数。
1. key.transpose(-2, -1)将将最后两个维度进行转置,得到key的转置矩阵K^T。
2. torch.matmul()函数做矩阵乘法,即将query矩阵的每一行与key的转置矩阵的每一列进行点积(对应元素相乘并求和),得到新矩阵的每个元素。此处操作对于上图的标号1。
3. math.sqrt(d_k)操作会对矩阵相乘结果进行缩放。此处操作对于上图的标号2。
scores是一个方阵, 形状为(batch_size, head数,词数,词数)
"""
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
# 判断是否使用掩码张量
if mask is not None:
# 如果使用掩码,则屏蔽不想要的元素
# masked_fill()将掩码张量和scores张量每个位置一一比较, 如果掩码张量处为0,则把注意力分数中对应元素设置为-1e9,因为后续还要进行softmax操作,softmax会让负无穷变为0(是理论上还是用到了很少一点点未来的信息,因为毕竟还是有一个小小的数值)。此处对于上图的标号3。
scores = scores.masked_fill(mask == 0, -1e9)
"""
执行公式中的softmax,对scores的最后一维做归一化操作,得到注意力权重,这些权重值加起来的和为1。此处对于上图的标号4。
p_attn的形状如下:
1. 若p_attn是自注意力,则形状为(batch size, seq_length, seq_length)
2. 若p_attn是多头注意力,则形状为(batch size, head_num, seq_length,seq_length)
# 这样获得最终的注意力张量
"""
p_attn = scores.softmax(dim=-1) # 得到注意力权重
# 判断是否使用dropout进行随机置0
if dropout is not None:
# 如果提供了dropout,则对p_attn进行dropout操作。
p_attn = dropout(p_attn)
"""
用注意力权重p_attn对value向量进行加权求和,得到最终的输出softmax(QK^T/√d_k)V,我们命名为Z。
1. 对于自注意力,Z的形状为(batch size, seq_length, d_model),即最终结果。
2. 对于多头注意力,Z的形状为(batch size, head_num, seq_length,d_model/head_num), 而并非最终结果,后续在MultiHeadAttention的forward()函数中还要将多头的结果进行合并,变为(batch size, seq_length, d_model)
torch.matmul对应上图的标号5
"""
return torch.matmul(p_attn, value), p_attn # 返回Z和权重p_attn
Reanalysis of attention
After reviewing the source code, let’s take a look back at the attention mechanisms in the encoder and decoder.
The purpose of using self-attention in an encoder is to find the relationships within the input sequence itself.
- In the source sequence, for each token, we collect information on which characters in the source sequence are most relevant to the token itself. This yields a relevance weight matrix (score).
- Next, we calculate the softmax of the score and convert the score into a probability p_attn.
- Using p_attn, the source sequence is converted into a source hidden state. The specific operation is torch.matmul(p_attn, value), where each token combines the relevant information of all tokens in the source sequence.
- Finally, the returned values are the source hidden state
torch.matmul(p_attn, value)andp_attn. The source hidden state is passed to the decoder as the parameterMemory.
The decoder uses two attention structures.
- The purpose of using self-attention is to find the relationships within the target sequence itself.
- For each token in the target sequence, we collect information on which characters in the sequence are most relevant to the token itself. This yields a relevance weight matrix, or score.
- Next, we calculate the softmax of the score and convert the score into a probability p_attn.
- The target sequence is converted into the target hidden state using p_attn. The specific operation is torch.matmul(p_attn, value), where each token combines the relevant information of all tokens in the source sequence.
- Finally, the returned values are the target hidden state torch.matmul(p_attn, value) and p_attn.
- The purpose of using cross-attention is to align the source sequence with the target sequence.
- Let the target hidden state be Q, and the source hidden state (parameter Memory) be K and V.
- Find which tokens in the source hidden state are most correlated with each token in the target hidden state. Obtain the correlation weight matrix (i.e., the score).
- Next, we calculate the softmax of the score and convert the score into a probability p_attn.
- The target hidden state is transformed into a new target hidden state using p_attn. The specific operation is torch.matmul(p_attn, value), where each token combines the relevant information of all tokens in the source hidden state.
- Finally, the target hidden state, torch.matmul(p_attn, value), is returned, where p_attn is the value.
3.2 llama3
Let’s learn using industry-standard code. First, here’s the overall code for the Transformer. The Transformer is the core of the model, connecting the word embedding layer, TransformerBlock layer, normalization layer, and output layer. As you can see from the code below, the Transformer contains many TransformerBlock layers. Each layer has its own attention mechanism. Each attention mechanism has its own parameters. Therefore, it can be said that attention is a graph-based information transmission mechanism, where the edges between nodes are learned: .
class Transformer(nn.Module):
def __init__(self, params: ModelArgs):
super().__init__()
self.params = params
self.vocab_size = params.vocab_size
self.n_layers = params.n_layers
# 词嵌入层
self.tok_embeddings = VocabParallelEmbedding(
params.vocab_size, params.dim, init_method=lambda x: x
)
# 将32个TransformerBlock存于ModuleList
self.layers = torch.nn.ModuleList()
for layer_id in range(params.n_layers):
self.layers.append(TransformerBlock(layer_id, params))
# 归一化层
self.norm = RMSNorm(params.dim, eps=params.norm_eps)
# 输出层,输入特征数为词嵌入的维度,输出特征数为词表大小
self.output = ColumnParallelLinear(
params.dim, params.vocab_size, bias=False, init_method=lambda x: x
)
# 旋转位置编码中的旋转矩阵
self.freqs_cis = precompute_freqs_cis(
params.dim // params.n_heads,
params.max_seq_len * 2,
params.rope_theta,
)
@torch.inference_mode()
def forward(self, tokens: torch.Tensor, start_pos: int):
# 批次大小和序列长度
_bsz, seqlen = tokens.shape
# 应用词嵌入之后的输入h,大小为(bsz, seqlen, dim)
h = self.tok_embeddings(tokens)
# 确保旋转矩阵和输入h位于同一设备(GPU)
self.freqs_cis = self.freqs_cis.to(h.device)
# 从旋转矩阵中提取旋转角度(频率)
freqs_cis = self.freqs_cis[start_pos : start_pos + seqlen]
mask = None
if seqlen > 1:
# 创建大小为(seqlen, seqlen)的张量,并用“负无穷大”填充
mask = torch.full((seqlen, seqlen), float("-inf"), device=tokens.device)
mask = torch.triu(mask, diagonal=1) # 上三角化
# When performing key-value caching, we compute the attention scores
# only for the new sequence. Thus, the matrix of scores is of size
# (seqlen, cache_len + seqlen), and the only masked entries are (i, j) for
# j > cache_len + i, since row i corresponds to token cache_len + i.
# 将大小为(seqlen, start_pos)的全0张量和mask进行水平拼接,形成最终的mask
mask = torch.hstack(
[torch.zeros((seqlen, start_pos), device=tokens.device), mask]
).type_as(h)
# 输入h被32个TransformerBlock逐个处理
for layer in self.layers:
h = layer(h, start_pos, freqs_cis, mask)
# 将最后一个TransformerBlock的输出进行归一化
h = self.norm(h)
# 将归一化之后的结果经过输出层,进行线性计算
output = self.output(h).float()
return output # 大小为(bsz, seqlen, vocab_size)
Next, let’s take a look at the code for each layer.
class TransformerBlock(nn.Module):
def __init__(self, layer_id: int, args: ModelArgs):
super().__init__()
self.n_heads = args.n_heads
self.dim = args.dim
self.head_dim = args.dim // args.n_heads
self.attention = Attention(args)
self.feed_forward = FeedForward(
dim=args.dim,
hidden_dim=4 * args.dim,
multiple_of=args.multiple_of,
ffn_dim_multiplier=args.ffn_dim_multiplier,
)
self.layer_id = layer_id
self.attention_norm = RMSNorm(args.dim, eps=args.norm_eps)
self.ffn_norm = RMSNorm(args.dim, eps=args.norm_eps)
def forward(
self,
x: torch.Tensor,
start_pos: int,
freqs_cis: torch.Tensor,
mask: Optional[torch.Tensor],
):
h = x + self.attention(self.attention_norm(x), start_pos, freqs_cis, mask)
out = h + self.feed_forward(self.ffn_norm(h))
return out
The code for Attention is as follows.
class Attention(nn.Module):
def __init__(self, args: ModelArgs):
super().__init__()
# 注意力中K(Key)和V(Value)的头数,n_kv_heads=n_heads时为多头注意力,n_kv_heads=1时为多查询注意力,否则为分组查询注意力
self.n_kv_heads = args.n_heads if args.n_kv_heads is None else args.n_kv_heads
# 分布式训练的进程数,也就是模型训练的并行度或参与计算的设备(GPU)数量
model_parallel_size = fs_init.get_model_parallel_world_size()
# 将Q(Query)的头数,根据并行度进行拆分
self.n_local_heads = args.n_heads // model_parallel_size
# 将K(Key)和V(Value)的头数,根据并行度进行拆分
self.n_local_kv_heads = self.n_kv_heads // model_parallel_size
# 假如本地Q数为2,本地KV数为1,我们需要把本地KV复制为2份,以便矩阵相乘
self.n_rep = self.n_local_heads // self.n_local_kv_heads
# 每头(Q)的输出维度。每头的输出会concat,来生成注意力的最终输出
self.head_dim = args.dim // args.n_heads
# ColumnParallelLinear和RowParallelLinear是用于并行训练的两种并行线性层
self.wq = ColumnParallelLinear(
args.dim, # 输入的特征数
args.n_heads * self.head_dim, # 输出的特征
bias=False, # 注意力里面的线性计算一般不使用偏置单元
gather_output=False,
init_method=lambda x: x,
)
# 大小为(dim, n_kv_heads * head_dim)
self.wk = ColumnParallelLinear(
args.dim,
self.n_kv_heads * self.head_dim,
bias=False,
gather_output=False,
init_method=lambda x: x,
)
# 大小为(dim, n_kv_heads * head_dim)
self.wv = ColumnParallelLinear(
args.dim,
self.n_kv_heads * self.head_dim,
bias=False,
gather_output=False,
init_method=lambda x: x,
)
# 大小为(n_heads * head_dim, dim)
self.wo = RowParallelLinear(
args.n_heads * self.head_dim,
args.dim,
bias=False,
input_is_parallel=True,
init_method=lambda x: x,
)
self.cache_k = torch.zeros(
(
args.max_batch_size,
args.max_seq_len,
self.n_local_kv_heads,
self.head_dim,
)
).cuda()
self.cache_v = torch.zeros(
(
args.max_batch_size,
args.max_seq_len,
self.n_local_kv_heads,
self.head_dim,
)
).cuda()
def forward(
self,
x: torch.Tensor,
start_pos: int,
freqs_cis: torch.Tensor,
mask: Optional[torch.Tensor],
):
bsz, seqlen, _ = x.shape
# 输入x分别和权重wq、wk和wv进行线性计算,得到xq、xk和xv
xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)
# 改变xq、xk和xv的形状
xq = xq.view(bsz, seqlen, self.n_local_heads, self.head_dim)
xk = xk.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)
xv = xv.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)
# 对xq和xk应用旋转位置编码
xq, xk = apply_rotary_emb(xq, xk, freqs_cis=freqs_cis)
self.cache_k = self.cache_k.to(xq) # 确保cache_k和xq位于同一设备(GPU)
self.cache_v = self.cache_v.to(xq) # 确保cache_v和xq位于同一设备(GPU)
# 将xk和xv加载进缓存
self.cache_k[:bsz, start_pos : start_pos + seqlen] = xk
self.cache_v[:bsz, start_pos : start_pos + seqlen] = xv
# keys和values的取值从数组的最左边开始
keys = self.cache_k[:bsz, : start_pos + seqlen]
values = self.cache_v[:bsz, : start_pos + seqlen]
# 将keys和values根据n_rep复制出相应的份数,以便矩阵相乘
# repeat k/v heads if n_kv_heads < n_heads
keys = repeat_kv(
keys, self.n_rep
) # (bs, cache_len + seqlen, n_local_heads, head_dim)
values = repeat_kv(
values, self.n_rep
) # (bs, cache_len + seqlen, n_local_heads, head_dim)
xq = xq.transpose(1, 2) # (bs, n_local_heads, seqlen, head_dim)
keys = keys.transpose(1, 2) # (bs, n_local_heads, cache_len + seqlen, head_dim)
values = values.transpose(
1, 2
) # (bs, n_local_heads, cache_len + seqlen, head_dim)
# 计算注意力得分
scores = torch.matmul(xq, keys.transpose(2, 3)) / math.sqrt(self.head_dim)
if mask is not None:
scores = scores + mask # (bs, n_local_heads, seqlen, cache_len + seqlen)
scores = F.softmax(scores.float(), dim=-1).type_as(xq)
output = torch.matmul(scores, values) # (bs, n_local_heads, seqlen, head_dim)
# 改变output的形状为(bs, seqlen, n_local_heads * head_dim),n_local_heads * head_dim等同于将每头的head_dim进行concat
output = output.transpose(1, 2).contiguous().view(bsz, seqlen, -1)
# 将output和权重wo进行线性计算,得到注意力的最终输出
return self.wo(output)

0x04 Optimization
4.1 Optimization Strategy
For long sequences and high-dimensional models, the computational cost of self-attention mechanisms is extremely high. This significant computational burden also limits the model’s ability to scale to even longer sequences beyond the training data, resulting in the length extrapolation problem. Therefore, optimization of self-attention mechanisms is necessary. Various optimization methods exist, and this paper selects several typical aspects for analysis.
Optimization from a sequence perspective
In the original self-attention mechanism, each token performs attention calculations with all other tokens. However, the attention distribution in the Transformer is actually uneven. For example, the study “How Do Language Models put Attention Weights over Long Context” found significant differences in the attention distribution across different layers. The initial layer has a roughly uniform attention distribution. However, the attention patterns in the intermediate layers become more complex, with most probabilistic quality concentrated on the initial label (attention convergence) and the most recent/last label (recent bias). That is, most of the intermediate layers in the Transformer exhibit a “V-shaped” attention distribution. This means that many tokens in the intermediate layers are not very useful. Perhaps reducing the context length by compressing tokens could accelerate inference or support longer contexts.
Therefore, the idea of sparse computation (or restricted attention) was proposed, where each token performs attention computation only with a subset of tokens (rather than all of them). The key lies in determining which subset of tokens the current token is paying attention to. Several typical approaches are as follows.
- Atrous self-attention: Each token performs attention calculations with other tokens at equal intervals.
- Local self-attention: Similar to n-gram, each token performs attention calculations with several nearby tokens.
- Sparse attention combines the mechanisms of vacuous attention and local attention. Based on rules (or dynamically determined), each token has both a local receptive field and a global receptive field.
Several classic models are as follows:
- Longformer is a type of sparse attention, which includes local attention, gap attention, and global attention. Global attention involves performing all attention calculations on tokens at specific locations (such as CLS in BERT).
- Reformer’s approach is based on the assumption that after softmax, the attention score calculated for a query and all other tokens primarily depends on a few highly similar tokens. Therefore, the final attention score can be approximated using the most similar Top-N tokens. Reformer uses Locality Sensitive Hashing (LSH) to group tokens with similar attention scores into a single bucket, preserving the similarity of the high-dimensional space in the low-dimensional space. This significantly reduces computational complexity, especially when dealing with large-scale data.
- Linformer proved that the Transformer’s Attention matrix is a low-rank matrix, indicating that most of the matrix’s content can be obtained from a small subset of the largest singular values. The higher the layer, the more information is concentrated on the largest singular values. Linformer made some modifications to the Self-Attention structure (the main difference being the addition of two linear mapping layers), reducing the complexity to linear.
- BigBird employs a block-sparse attention pattern that combines local (window), global (fixed-point), and random attention. Block-sparse attention: applies the attention mechanism to consecutive local blocks, rather than the entire sequence. Global attention points: to maintain the global nature of interactions, markers such as CLS markers are added to ensure that attention operations are always performed on each other. Random attention: randomly introduces additional attention connections into the sequence to ensure that there is always a path (possibly through multiple intermediate nodes) between any two positions that allows them to be connected.
- Dual Chunk Attention (DCA) is a training-free framework for inferring the context window of LLMs. Instead of linearly scaling the position indices or increasing the fundamental frequency of RoPE, DCA chooses to reuse the original position indices and their embeddings from the pre-trained model, but redesigns the construction of the relative position matrix to reflect the relative positions between two labels as accurately as possible. DCA’s innovation also lies in its compatibility with Flash Attention, requiring only modifications to the inference code without extensive retraining.
Furthermore, length extrapolation is an inconsistency between training and prediction, and the main approach to resolving this inconsistency is to localize attention. Many improvements with good extrapolation are, in a sense, variations of local attention. Localized attention imparts “translation invariance” to the entire model by limiting the perceptual range of attention.
Window truncation can also serve as a good baseline for length extrapolation, but this comes at the cost of forcibly cutting off attention outside the window and sacrificing remote dependencies, so it is not the final solution.
DCA
Because Qwen2.5-Turbo uses DCA to process long sequences in blocks, reducing computational complexity and improving the model’s inference speed, we will focus on introducing DCA again.
DCA consists of three components, each of which helps the model effectively capture long-range and short-range dependencies in a sequence.
- Intra-block attention refers to the processing of tags within the same block.
- Inter-block attention is used to handle the tags between different blocks.
- Continuous block attention is used to process tags in consecutive, different blocks.
The image below shows intra-block attention.

The image below shows inter-block attention.

The image below illustrates contiguous block attention, a special case of inter-block attention. Contiguous block attention is used to maintain the locality of LLM, which means that LLM tends to rely primarily on neighboring tokens to predict the next token.

After combining the three types of attention, the following is obtained.

Optimize from multiple perspectives
Specifically, this includes MHA, MQA, and GQA, which we will introduce in detail in a later article.
Optimize MHA from both hardware and software perspectives.
At the hardware level, for example, the currently used HBM (High-Speed Bandwidth Memory) can improve read speed, or more thoroughly, the von Neumann architecture can be abandoned, and the way computing units read data from memory can be changed. Instead of focusing on computing units, storage can be the center, creating “in-memory computing” that integrates computing and storage.
On the software side, there have been many recent optimizations, such as Flash Attention and Paged Attention. We will discuss these in detail in a later article.
Optimize from other perspectives
that will be introduced below does not change the structure of Transformer, but rather performs inference in two separate steps.
4.2 Case Study
Next, we will introduce a few featured cases.
Attention weight refinement
Traditional self-attention models may be insufficient to capture the complex item dependencies in sequence recommendation scenarios due to a lack of explicit emphasis on attention weights, which play a crucial role in allocating attention and understanding the relevance between items. To better leverage the potential of attention weights and improve the ability of sequence recommendation to learn higher-order dependencies, the paper “Pay Attention to Attention for Sequential Recommendation” proposes the Attention Weight Refinement for Sequential Recommendation (AWRSR) method. This method enhances the effectiveness of the self-attention mechanism by giving additional attention to attention weights, thereby achieving a more refined attention distribution of relevance between items.
Existing attention weight operations revolve around calculating the similarity or distance between item embeddings or distributions, as shown in the lower right part of the figure below. This approach does not consider the potential correlations within the attention weights, which could further reveal higher-order transformations within the attention weights themselves. As shown in the upper right part of the figure below, the attention weight matrix A originates from the representation of items in the sequence, encoding the relationship of each element in the sequence relative to all other elements. For example, represents the allocation of attention or importance from the first item to all other items in the sequence, meaning that refining these weights (essentially capturing the correlation between weights/attention) has the potential to model higher-order transformations. Given this, comparing and can help in understanding the dependencies between two elements in the sequence (located in the first and second positions).

The paper designs four refinement mechanisms to calculate the dependencies between attention weights, namely: calculating the attention weights between attention weights (pay attention to attention).
- Simple refinement. This mechanism simply applies a new set of trainable matrices to the attention weight matrix A, transforming it into new queries and keys to compute higher-level weights, as shown in Figure 1 below. Here, W is the trainable matrix. The softmax function is then applied to the new weight matrix B; the product of softmax(B) and the set of items is equivalent to summing the values, meaning each value is scaled through refined higher-order attention weights.
- Value-weighted refinement. This is further transformed into Simple refinement through attention to obtain new weights B. The aim is to refine and re-express the weight correlation in a more complex higher-order space. A matrix form is provided to simplify calculation, as shown in Figure 2 below.
- Additive refinement. The purpose of this mechanism is to merge/balance attention weights at different levels, as shown in Figure 3 below.
- Stochastic refinement. This mechanism is tailored for STOSA (STOchastic Self-Attention), attempting to transform the original random weights A of STOSA into a new form that may preserve the probabilistic properties of the weights. A new attention distribution is calculated for item k. Here, and are two random matrices, in the form shown in Figure 4 below.

Linear attention
The self-attention mechanism at the core of the Transformer is a significant source of its computational cost. To optimize it, the research community has proposed many techniques, such as sparse attention, low-rank decomposition, and kernel-based linear attention.
The vanilla Transformer uses Softmax attention, which requires constructing an N×N fully connected matrix. For extremely long sequences, this matrix becomes enormous. It increases the model’s complexity by the square of n when processing long texts, resulting in a complexity of .
To alleviate the efficiency bottleneck of standard self-attention mechanisms, the paper “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention” proposes a kernel-based linear attention mechanism, which decomposes the similarity function into the dot product of feature maps. Following the notation in the Linear Attention work, we define as the softmax kernel function. Mathematically, the goal of linear attention is to approximate using .
The “right product kernel trick” transforms the traditional quadratic computational complexity into linear complexity, significantly reducing the computational burden of processing long sequences. The complexity of each head can be reduced to , where d’ is the dimension after feature mapping, linearly related to the sequence length. Specifically, linear attention avoids repeatedly calculating the entire attention matrix by recursively updating the product of the key-value matrices, thus maintaining constant computational complexity during inference.

The figure below illustrates the computation of softmax attention (left) and linear attention (right). The input length is N, and the feature dimension is d, where d ≥ N. Tensors within the same frame are associated with the computation. The linearization formula achieves O(N) time and space complexity.

The TransNormer in the paper “The Devil in Linear Transformer” exemplifies linear attention, as shown in the figure below. To maintain the overall linear complexity of the architecture, TransNormer replaces the lower layers with local attention, and then uses normalized linear attention with the denominator removed in the later layers.

PolaFormer
In the paper “PolaFormer: Polarity-aware Linear Attention for Vision Transformers,” the authors propose a polarity-aware linear attention mechanism (PolaFormer) designed to address the limitations of previous linear attention models by incorporating neglected negative interactions. Simultaneously, to address the common problem of excessive information entropy in the attention weight distribution within linear attention, they provide the mathematical foundation demonstrating that if an element-wise computed function has positive first and second derivatives, the q,k response can be rescaled to reduce entropy. These enhancements collectively provide a more robust solution to bridge the gap between linearization and softmax-based attention.
Research Background
Linear attention, as a more feasible solution, replaces the Softmax operation in the q,k dot product with kernelized feature maps, effectively reducing the time and space complexity from O(N²d) to O(Nd²). Although linear attention improves computational efficiency, it still falls short of Softmax-based attention in terms of expressive power. The paper analyzes and identifies two main reasons for this deficiency, both stemming from information loss during the Softmax approximation process:
- Negative values are lost. Linear attention models that rely on non-negative feature maps (such as ReLU) cannot maintain consistency with the original q,k dot product. These feature maps only preserve positive-positive interactions, while crucial positive-negative and negative-negative interactions are completely lost. This selective representation limits the model’s ability to capture a full range of relationships, resulting in weakened expressiveness and reduced discriminative power of attention maps.
- The high information entropy of the attention distribution. Without the exponential scaling of softmax, linear attention results in a more uniform weight distribution and lower entropy. This uniformity weakens the model’s ability to distinguish strong and weak q,k pairs, impairs its attention to important features, and degrades performance in tasks requiring fine details.
Ideas
The core idea behind polarity-aware attention is to address the limitations of existing linear attention mechanisms, which often discard valuable information from negative components. When processing negative components, PolaFormer decomposes the query and key vectors into their positive and negative parts. This decomposition allows the mechanism to consider the effects of positive and negative similarity on the attention weights separately. These decompositions are then substituted into the inner product of q and k, resulting in the figure below.

Previous linear attention methods, such as ReLU-based feature mappings, eliminate negative components by mapping them to zero, which leads to significant information loss when approximating the q,k dot product. To address this issue, polarity-aware attention mechanisms separate q and k according to their polarities and independently compute their interactions. The calculation of attention weights is shown in the figure below.

PolaFormer explicitly separates the q,k pairs based on polarity, handling same-sign and opposite-sign interactions during inner product computation. These interactions are processed in two streams, enabling a more accurate reconstruction of the original softmax attention weights. To avoid unnecessary complexity, the authors split the v vector along the channel dimension, handling both types of interactions without introducing additional learnable parameters. The outputs are then concatenated and scaled. and are two learnable polarity-aware coefficient matrices, applied with element-wise multiplication, which learns complementary relationships between like-sign and opposite-sign values.
See the image below for details.

To address the issue of excessively high information entropy in the attention weight distribution, a common problem in linear attention models, the authors provide a mathematical foundation demonstrating that if an element-wise computed function has positive first and second derivatives, the q,k response can be rescaled to reduce entropy. This theory helps clarify why previous feature mappings increase information entropy, leading to an overly smooth attention distribution. For simplification, the authors employ a channel-level learnable power function for rescaling, preserving the sharpness of the exponential function inherent in Softmax. This allows the model to capture sharp attention peaks, improving its ability to distinguish between strong and weak responses. Simultaneously, to differentiate the hierarchy between different channels, the authors designed a learnable power function to capture the varying importance of each dimension.

MiniMax-01
MiniMax-01 is the first model to be deployed on a large scale using a linear attention mechanism. While linear attention can keep complexity increasing linearly, it is essentially an “experimental” approach. MiniMax-01 made bold innovations at the attention mechanism level, implementing a new linear attention mechanism for the first time in the industry and putting it into a production environment for the first time. In its 80 attention layers, each softmax attention layer is preceded by 7 linear attention (lightning attention) layers.
Model Architecture
MiniMax-01 is based on a linear attention mechanism, employs a hybrid architecture (Hybrid-Lightning), and integrates the MoE architecture.

Lightning Attention
MiniMax’s Lightning Attention is a linear attention mechanism, an I/O-aware optimization based on TransNormer. Lightning Attention’s linear attention mechanism solves the problem of matrix operations being impossible in causal modeling when calculating unidirectional attention due to the need for summation. It calculates the right multiplication first, successfully reducing the complexity to . Specifically, Lightning Attention avoids the summation operation in causal language modeling by tiling (block computation), thus achieving theoretically linear complexity. For very long sequences, Lightning Attention divides the Q, K, and V matrices into multiple blocks along the row dimensions, each block being of fixed size. The attention calculation is then divided into intra-block computation (using left product) and inter-block computation (using right product). First, the attention scores between words within a block are calculated independently (intra-block). Then, information is progressively passed between blocks through a recursive update method (inter-block), ultimately capturing global semantic relationships.
This process is similar to group discussions: first, each group solves its own problem, then the results from all groups are combined to arrive at the global answer. This optimization significantly reduces computation and memory requirements, and also lowers the complexity from the quadratic complexity of traditional Softmax attention to linear complexity. Alternatively, you can think of the model as flipping through a huge book; even if you can only read a few pages at a time, it remembers the previous content and eventually processes all the knowledge in the entire book.
The following is an algorithm description of Lightning Attention forward propagation.

Hybrid-lightning
To balance efficiency and global information capture capabilities, MiniMax also proposes a Hybrid-lightning approach based on Lightning Attention. In this approach, 7 out of every 8 layers in the Transformer use Lightning Attention for efficient handling of local relationships, while the remaining layer retains traditional Softmax attention. This solves the efficiency problem of softmax attention, ensuring the capture of key global context, and also enhances the scaling ability of Lightning Attention. Compared to traditional mechanisms, one is like reading every single word in a book, while the other is about selecting key points and occasionally glancing at the table of contents to check the overall picture. The efficiency difference is obvious.
In addition, it introduces Varlen Ring Attention to directly concatenate the entire text into a continuous sequence, allowing variable-length sequence data to be allocated resources in the model on demand; it uses data packing on pre-trained data to concatenate text of different lengths into continuous long sequences; and it improves Linear Attention Sequence Parallelism (LASP+) in distributed computing, enabling the model to collaborate efficiently across multiple GPUs without the need for windowing the text.
The table below shows the formulas for calculating the number of attention architecture parameters and FLOPs based on the number of layers l, model dimension d, batch size b, and sequence length n. It is clear that the larger the model size, the more significant the advantages of Lightning Attention and Hybrid-lightning over softmax attention become.

Transformer²
The paper “Transformer²: SELF-ADAPTIVE LLMS” proposes a machine learning system called Transformer², which dynamically adjusts model weights based on different tasks. Through singular value tuning and adaptive weight strategies, it selectively adjusts individual components in the weight matrix in real time, improving the generalization and adaptability of LLMs and enabling them to adapt to unseen tasks. Several interesting ideas are as follows:
- Although the paper is titled , it does not actually change the structure of Transformer, but rather performs inference in two separate steps.
- Use SVD decomposition for training instead of LoRA.
- Use the Reinforce method instead of SFT.
Research Background
Adaptability
In nature, adaptation is a very common phenomenon. For example, octopuses can quickly change their skin color and texture to blend into their environment, thus avoiding predators and catching prey; the human brain can reconnect its neural circuits after injury, enabling individuals to regain lost functions and adapt to new ways of thinking or behaving. The adaptability exhibited by organisms allows life to thrive in constantly changing environments. All of this embodies the classic saying—“survival of the fittest.”
In the field of artificial intelligence, the concept of adaptation is equally compelling. Imagine a machine learning system that can dynamically adjust its weights to continuously learn and evolve in unfamiliar environments. Compared to static AI models deployed in an environment, such adaptive models are significantly more efficient at learning and hold the promise of becoming lifelong models that remain consistent with the dynamic nature of the real world.
SVD
Just as the human brain stores knowledge and processes information through interconnected neural pathways, LLM stores knowledge in its weight matrices. These matrices are the “brain” of the LLM, preserving the essence it has learned from training data. To understand this “brain” and ensure its effective adaptation to new tasks, a careful study of its internal structure is necessary. Singular Value Decomposition (SVD) provides valuable insights into this.
SVD achieves this by identifying principal components in the LLM weight matrix. SVD decomposes the vast and complex knowledge stored in the LLM into smaller, meaningful, and independent parts (such as different components like mathematical and language understanding). Research has found that enhancing the signal of certain components while suppressing the signal of others can improve the performance of the LLM in downstream tasks.
SVD can be viewed as a surgeon meticulously manipulating the brain of an LLM. This surgeon breaks down the vast and complex knowledge stored in an LLM into smaller, more meaningful, and independent parts (e.g., different paths or components for mathematics, language comprehension, etc.).
In other words, pre-training already equips the model with sufficient ability to solve problems, while fine-tuning essentially brings that ability to emerge. Therefore, lightweight fine-tuning can be achieved by constructing submatrices and adjusting only some principal components of the singular value matrix. Furthermore, since only the singular value scale itself is fine-tuned, it can be trained with a small amount of data without causing large-scale forgetting or collapse.
Research Motivation
While composability and scalability are crucial for effective adaptation, current LLM training methods struggle to achieve both simultaneously. Sakana AI’s research aims to propose a groundbreaking solution to realize this vision and bridge these gaps.
Traditionally, LLM post-training attempts to optimize a model with broad capabilities through a single comprehensive training run. For LLM, incorporating even a single sentence of new knowledge requires retraining. While this “one-shot” fine-tuning framework is ideal from a simplification standpoint, it’s difficult to implement in practice. For example, post-training remains highly resource-intensive, leading to significant computational costs and extremely long training times. Furthermore, introducing more diverse data often presents clear performance trade-offs, making it challenging to simultaneously overcome overfitting and task interference. Additionally, current fine-tuning processes often suffer from catastrophic forgetting and low generalization. Fine-tuning a model on a specific dataset and then evaluating it will likely result in a significant drop in performance across all tasks, and even some performance degradation on the model itself (overfitting + forgetting). This is because while the fine-tuning process is simple, it compromises the model’s original capabilities.
In contrast, adaptive models offer a more flexible and efficient approach. Instead of attempting to train an LLM all at once to perform all tasks, it’s better to develop expert modules, then develop them offline and augment them into the base LLM as needed. Expert modules can be developed offline and augmented into the base LLM on demand. This allows the model to dynamically modify its behavior based on the current task without constant readjustment. In addition to the benefits of having independent components, this modularity supports continuous learning, enabling the model to add new skills over time without catastrophic forgetting. Furthermore, adaptive LLMs reflect a well-established principle in neuroscience and computational biology: the brain activates specific regions based on the current task and dynamically reorganizes its functional networks to respond to evolving task demands.
In principle, the first step in implementing adaptive LLM can be achieved by developing specialized expert modules, each fine-tuned using techniques such as LoRA. These expert modules can then be dynamically combined at runtime according to task requirements, a process that can be efficiently managed using a MoE-like system. However, making this approach both scalable and composable presents several challenges. First, fine-tuning the LLM to create multiple expert modules significantly increases the number of parameters that need to be trained. In fact, even using parameter-efficient methods like LoRA, the cumulative size of these modules increases rapidly, leading to increased storage and computational demands. Second, these expert modules are prone to overfitting, a phenomenon particularly prevalent when training on smaller datasets or in narrower task domains. Third, the flexible combination of these expert modules also presents challenges that remain unresolved.
Furthermore, multiple LoRAs lack composability. While LoRAs can be combined within the realm of diffusion model algorithms—for example, style LoRA and character LoRA can be applied simultaneously on a single model and function independently—this is not a characteristic of LoRAs but rather a reflection of the stability of the diffusion algorithm itself. In the realm of large models, the combined use of multiple LoRAs is mathematically meaningless. Therefore, the mathematical significance of singular value tuning lies in its sufficient interpretability (vector direction and scale) and composability (e.g., interpolation), allowing for the combination of multiple tuners for joint use.
Ideas
Singular Value Fine-Tuning (SVF)
To overcome these limitations on composability and scalability, the Transformer² authors first proposed Singular Value Fine-Tuning (SVF), a novel Parametric Efficient Fine-Tuning (PEFT) method that uses reinforcement learning (RL) to enhance or suppress signals of different “brain” components, to obtain adaptive and efficient building blocks.
SVF uses SVD to decompose the “brain” (i.e., the weight matrix) of an LLM into several independent components, which may be shared across multiple tasks. For example, some components are shared between language understanding and reasoning tasks. During training, SVF uses RL to train combinations of these components to handle different tasks. During inference, SVF first identifies the task type and then dynamically adjusts the combination of components.
Specifically, during the training phase, SVF learns a set of z-vectors, with one z-vector corresponding to each downstream task. Each z-vector can be viewed as an expert for that task; it’s a compact representation specifying the expected strength of each component in the weight matrix, acting as an “amplifier” or “attenuator” to modulate the influence of different components on the model’s behavior. For example, suppose SVF decomposes the weight matrix into five components [A, B, C, D, E]. For a mathematical task, the learned z-vector might be [1, 0.8, 0, 0.3, 0.5], indicating that component A is crucial for the mathematical task, while component C has almost no impact on its performance. For a language understanding task, the z-vector might be [0.1, 0.3, 1, 0.7, 0.5], indicating that although component C contributes less to the mathematical task, it is crucial for the language understanding task. SVF leverages RL to learn these z-vectors on a predefined set of downstream tasks. The learned z-vectors enable Transformer² to adapt to a variety of new downstream tasks while introducing only a minimal number of additional parameters (i.e., z-vectors). Why is this parameter count so low? This is because matrices U and V are not involved in training, so only a one-dimensional vector z is needed to express the change of a specific dense matrix.
The figure below provides an overview of SVF. During training, Transformer² uses SVF and RL to learn “expert” vectors z, which scale the singular values of the weight matrix. During inference, Transformer² proposes three different methods to adaptively select/combine the learned expert vectors.

Furthermore, Transformer² uses Reinforce instead of SFT. The authors explain that RL has lower requirements for the dataset. Since RL optimizes for reward values rather than the logical relationship of next-tokens, the impact of CoT can be ignored, allowing for training on more general datasets.
Adaptability
During the inference phase, Transformer² employs a two-stage adaptation strategy to effectively combine task-specific z-vector sets. The name Transformer² also reflects its two-step process. Its core lies in its ability to dynamically adjust key components within the weight matrix.
First, the model analyzes the incoming task to understand its requirements, and then applies task-specific adjustments to generate the optimal result. That is, the first inference analyzes the task type, and the second inference selects a specific tuner (or, similar to MoE, performs combined inference among experts) based on the task type to solve the specific problem. By selectively adjusting key components of the model weights, this framework allows the LLM to dynamically adapt to new tasks in real time.
In the first phase, Transformer² executes the model and observes its test-time behavior, gathering relevant information to understand the skills required to solve the current problem. For a given task or a single input cue, Transformer² identifies the characteristics of the task using one of three adaptation methods.
- Tip-based adaptation: Specially designed adaptation tips allow the LLM to classify the input prompt and select the appropriate z vector.
- Classifier-based adaptation: Using a task classifier trained with SVF, the task is identified during inference, and an appropriate z vector is selected.
- Few-shot adaptation: Combining multiple pre-trained z vectors through weighted interpolation.
These three methods work together to ensure that Transformer² can achieve powerful and efficient task adaptation, laying the foundation for its outstanding performance in a variety of scenarios.
In the second stage, the Transformer² framework modulates the weights accordingly by combining the z vectors, thereby generating the final response best suited to the task.
The following diagram shows the overall architecture of Transformer².

Titans
The paper “Titans: Learning to Memorize at Test Time” proposes a novel neural long-term memory module that allows attention mechanisms to focus on the current context while utilizing long-term historical information. The advantage of this neural memory lies in its ability to be trained rapidly in parallel and maintain fast reasoning. The paper points out that attention mechanisms can serve as short-term memory due to their limited context but accurate dependency modeling; while neural memory, due to its ability to memorize data, can serve as long-term, more persistent memory. Based on these two modules, the paper introduces a new architecture—Titans—and proposes three variants to effectively integrate memory into the architecture.
Research Background and Motivation
In recent years, recurrent models and attention mechanisms have been widely used in deep learning. Recurrent models aim to compress data into a fixed-size memory (hidden state), while attention mechanisms allow the model to focus on the entire context window, capturing direct dependencies between all labels. However, this more precise dependency modeling incurs quadratic computational costs, limiting the model’s context length.
Core Innovation
The core innovation of this paper lies in proposing a neural long-term memory module capable of learning and remembering during testing. It can learn historical context and help the attention mechanism process current context while utilizing past information. Results show that this neural memory offers the advantage of rapid parallel training while maintaining fast reasoning.
This module works as follows:
- Memory Acquisition: This module treats the training process as an online learning problem, aiming to compress past information into its parameters. Inspired by human memory, this module considers “unexpected” events (i.e., surprising inputs) as more memorable. It measures the “surprise level” of the input by calculating the gradient of the neural network relative to the input and uses this metric to update the memory.
- Forgetting Mechanism: To address the problem of limited memory, this module introduces an adaptive forgetting mechanism that takes into account memory size and the degree of surprise of the data, thereby better managing memory.
- The structure of memory: The paper explores different memory structures and finds that deep memory modules (i.e., using multilayer perceptrons) are more effective than linear models.
- Memory retrieval: This module retrieves the memory corresponding to the query through simple forward pass-through (without updating weights).
Titans architecture
Based on long-term neural memory modules, the paper proposes the Titan architecture, which consists of three branches:
- Core branch: Uses attention mechanisms for data processing, focusing on a limited context window.
- Long-term memory branch: Uses neural long-term memory modules to store and recall historical information.
- Persistent Memory: Encodes task-related knowledge using learnable but data-independent parameters.
The paper proposes three different Titan variants to effectively integrate memories into the system architecture.
- Memory as a Context (MAC): Treating long-term memory as the context of current information and using attention mechanisms to fuse this information.
- Memory as a Gate (MAG): Uses a gating mechanism to fuse long-term memory with information from core branches.
- Memory as a Layer (MAL): This method treats long-term memory modules as a layer in a deep neural network.
Long-term memory
To design a long-term neural memory module, we need the model to be able to encode abstractions of past history into its parameters. Therefore, a simple approach is to train a neural network and expect it to remember its training data. However, memory has almost always been a troublesome phenomenon in neural networks, limiting the model’s generalization ability, raising privacy issues, and thus leading to poor performance in tests.
Based on this, Google believes an online meta-model is needed to learn how to remember or forget data during testing. In this setup, the model learns a function that can remember data without overfitting the training data, thus achieving better generalization performance during testing.
- Learning Process and Surprise Metric. The key idea behind training long-term memory is to treat training as an online learning problem, compressing past information into long-term neural memory modules. Inspired by the human ability to remember events that deviate from expectations (surprisingly), model surprise can be simply defined as its gradient relative to the input. The larger the gradient, the greater the deviation between the input data and past data. Therefore, using this surprise score, the memory can be updated as shown in Figure 1 below.
- Unexpected metrics can lead to a loss of important information after significant unexpected events. From the perspective of human memory, even if an event is memorable, it may not continue to surprise us for a long time. To improve this phenomenon, Google breaks down unexpected metrics into (1) past unexpectedness, which measures the degree of unexpectedness in the most recent past; and (2) momentary unexpectedness, which measures the unexpectedness of incoming data. See icon 2 below for details.
- The aforementioned unexpected metrics are based on a loss function . Google focuses on associative memory, aiming to store past data as key-value pairs. Similar to the Transformer, given , Google uses two linear layers to project into keys and values, as shown in icon 3 below. Google expects the memory module to learn the association between keys and values, and for this purpose, it defines the loss as shown in icon 4 below.
- Forgetting Mechanism. When dealing with very large sequences (such as millions of tokens), managing which past information should be forgotten is crucial, even when using deep or very large matrix values for memory. Therefore, Google uses an adaptive forgetting mechanism that allows the memory to forget information that is no longer needed, thus better managing limited memory capacity. That is, given the next token , the forgetting mechanism is as shown in Figure 5 below.
- Retrieving a Memory. After exploring how to design and train a long-term memory module that can learn from memories at test time, the remaining key question is how to retrieve information from those memories? Google simply uses forward passes (i.e., inference) without updating the weights to retrieve the memory corresponding to the query. Formally, given an input , Google uses a linear layer to project the input, i.e., , and retrieves the corresponding (or useful) information from the memory using the following formula, corresponding to number 6 in the diagram below.

Fusion Memory
The next important problem to be solved is: how to effectively and efficiently integrate neural memory into deep learning architectures?
From a memory perspective, the K and V matrix pairs in a Transformer can be interpreted as associative memory blocks. Due to their precise modeling of dependencies and finite context windows, they can be used as short-term memory modules to handle the current context window size. On the other hand, neural memory, capable of continuously learning from data and storing it in its weights, can thus function as long-term memory. Google addresses these questions with three different Titans variants.
If we consider tokens as instructions, then Test Time Scaling Law might be about modifying the token sequence to allow the model to execute according to the new process. Titans might be a prototype of such a self-verifying and error-correcting mechanism. Attention mechanisms can be compared to on-chip instruction caches, and neural memory models can be compared to code segments.
Memory as a Context (MAC)

The architecture of Titans’ first variant, MAC, is shown in the diagram above, using memory as the context of current information.
This architecture has two key advantages: First, the attention module has both historical and current context, enabling it to determine whether long-term memory needs to store information based on current data. Second, the attention module helps long-term memory store only useful information from the current context. This means that not all tokens in every segment are useful, and memorizing all tokens could lead to memory overflow. Therefore, the attention module helps the memory understand which information is useful, thus better managing memory capacity.
Memory as a Gate (MAG)

The architecture of the second variant of Titans, MAG, is shown in the diagram above:
In one branch, Google directly uses the input data to update long-term memory; in the second branch, Google uses Sliding Window Attention (SWA). The overall attention mask for this architecture is shown in the figure above, where Sliding Window Attention (SWA) acts as precise short-term memory, while the neural memory module acts as decaying memory for the model. This design can also be viewed as a multi-head architecture, where each head has a different structure.
Memory as a Layer (MAL)

The third variant of Titans, MAL, uses deep neural networks, an architecture more commonly seen in the literature, where hybrid models are stacked with recurrent models having full or sliding window attention.
SANA
The paper “SANA: EFFICIENT HIGH-RESOLUTION IMAGE SYNTHESIS WITH LINEAR DIFFUSION TRANSFORMERS” defines linear attention as follows.

Linear attention can reduce the effectiveness of attention operations. To compensate for this, Sana replaced the FFN with an MLP (i.e., a 1x1 convolutional network) and a 3x3 depthwise convolutional network. This not only improves the fitting ability of the entire Transformer block, but also eliminates the need for positional encoding—since convolutional networks can naturally model relative positional relationships and break the symmetry of the Transformer.

0xFF Reference
- 2024 || CorDA: Principal Component Fine-tuning for Content-Related Large Models Machine Learning and Large Models
- 2024 || MiLoRA: Fine-tuning large models while preserving key components. Machine Learning and Large Models
- [NDSS 2024] Zhejiang University proposes a dynamic attention mechanism: improving the robustness of Transformer models without additional resource consumption [For Machine Learning]
- [NLP] Adaptive Softmax listenviolet
- [Detailed Reading of Classics + Code Analysis] Attention is not all you need [Update - March 22, 2021] Lost Little Bookboy
- A Mathematical Framework for Transformer Circuits Nelson Elhage*, Neel Nanda*
- CorDA: Context-Oriented Decomposition Adaptation of Large Language Models
- Efficient softmax approximation for GPUs
- ICLR 2025 | Polarity-Aware Linear Attention! Zhang Zheng’s Team at Harbin Institute of Technology Proposes the PolaFormer Visual Foundation Model [Machine Heart]
- MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning
- A Garfield compilation of specific implementations of Lightning Attention in MiniMax-01
- More About Attention Li Xinchun
- Pay Attention to Attention for Sequential Recommendation
- PolaFormer: Polarity-aware Linear Attention for Vision Transformers
- RecSys’24 | Enhancing Self-Attention Mechanisms with Additional Attention for Sequence Recommendation [Qiufeng’s Study Notes]
- softmax is not enough (for sharp out-of-distribution)
- Softmax and its variant DengBoCong
- Synthesizer: Rethinking Self-Attention for Transformer Models
- Titans: Learning to Memorize at Test Time
- Mathematical Framework of Transformer Circuits (Hao Bai)
- Transformer Explainer: Interactive Learning of Text-Generative Models
- Transformer Interpreter: Interactive Learning for Text Generation Models (Wuying Temple [AI Empire])
- Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
- Transformer² : SELF-ADAPTIVE LLMS
- Transformer² aims to create “living” AI models, dynamically adjusting weights and adapting to the environment like an octopus. (Machine Learning)
- Why is the attention mechanism scaled in Transformer? Learn by looking at the diagram.
- Why did Transformer introduce MHA? (Tech blogger [Ding Shixiong’s Big Model])
- Transformer’s creators have launched Transformer²! The AI model is now alive, dynamically adjusting its weights. (New Intelligence)
- Transformer Upgrade Path: 15. Key Normalization Aids Length Extrapolation (Su Jianlin)
- A Fundamental Insight into the Multi-Head Self-Attention Mechanism of Transformers Author: Nikolas Adaloglou Translator: Wang Qingfa
- A Fundamental Insight into the Multi-Head Self-Attention Mechanism of Transformers Author: Nikolas Adaloglou Translator: Wang Qingfa
- [NLP] What to do when the vocabulary is too large—Adaptive softmax model and code analysis by Nanfeng
- What is Scaled Self-Attention? Why is scaling necessary when calculating attention weights in Transformer? The Way of AI Algorithms
- Explaining the fundamental differences between the Attention mechanism and convolutional neural networks from the perspective of functional analysis. (OxAA55h [套码的汉子])
- Scale Operations of Attention Based on Entropy Invariance (by Su Jianlin)
- Cosine similarity might be useless? For some linear models, similarity isn’t even unique. (Machine Heart)
- Ten times faster! Adaptive softmax HaoC
- Illustrated Transform
- How does the brain “navigate” in society? A Nature sub-journal reveals a grid-like representation of social hierarchies, a tool for collective intelligence scientists.
- How does the brain represent knowledge? Can we see the essence of reality in it? Patricia [Collective Intelligence Club]
- Neuroscience Mechanisms Behind Large Language Models (by Ya Mu)
- How to understand the dot product in linear algebra? Jin Chao will be teaching a class on [Data Analysis Learning and Practice].
- Attention mechanism OnlyInfo
- Radical architecture, 4 million contexts, completely open source: MiniMax-01 has a bit of a “Transformer moment” feel to it. Wang Zhaoyang [Silicon Star Pro]
- Part Six: LLM Attention: A Simple Explanation of Q, K, and V Matrices (Original by Dongdongqiang)
- A simple diagram explaining the linear attention mechanism in Deep Sea.
- Paper Overview | Sana: High-Efficiency Generation of 4K Images Using Linear Transformers [Genius Programmer Zhou Yifan]
- Discussing the evolution of large-scale model architectures, The Art of memory [zartbot]
- Nearly eight years later, Google’s Transformer successor, “Titans,” has arrived, breaking the contextual memory bottleneck. [Machine Heart]
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models: https://arxiv.org/abs/2401.04658
- TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer: https://arxiv.org/abs/2307.14995
- Paper Interpretation: Transformer^2: Self-adaptive LLMs (by Xiaoliyu)
- Discussing the evolution of large-scale model architectures, The Art of memory [zartbot]
- How Do Language Models put Attention Weights over Long Context https://lyaofu.notion.site/How-Do-Language-Models-put-Attention-Weights-over-Long-Context-10250219d5ce42e8b465087c383a034e
- Llama 3 Architecture and Source Code Analysis [OnlyInfo]
