Exploring the Transformer Series (12) --- Multi-head Self-Attention

0x00 Overview

MHSA (Multi-Head Self-Attention) is the core module of the Transformer model. The Transformer is essentially a general-purpose differentiable computer that combines many excellent features.

The Transformer’s message-passing-like architecture is versatile (i.e., complete) and powerful (i.e., efficient), capable of covering many real-world algorithms, thus giving the Transformer very strong expressive power (in forward propagation).
Through backpropagation and gradient descent, the Transformer can be continuously optimized.
Because the computation graph of the Transformer is shallow and wide, and the self-attention mechanism allows us to compute each element in the sequence in parallel when processing sequential data, the Transformer can be better mapped to our highly parallel computing architectures (such as GPUs) for efficient computation.
Multi-head attention mechanisms, by running multiple self-attention layers in parallel and combining the results, can simultaneously capture information from different subspaces of the input sequence, enhancing the model’s expressive power. This characteristic enables Transformers to better understand complex patterns and semantic information in data, allowing for excellent applications in natural language processing, computer vision, and other fields, demonstrating strong generalization capabilities.

Multi-head attention is the cherry on top of the cake. Its ingenuity lies in its ability to process data simultaneously by running multiple attention heads with unique perspectives in parallel. This allows the model to analyze the input sequence from multiple angles, capturing rich features and dependencies. It’s similar to a group of experts analyzing various aspects of a complex problem, or like having multiple perspectives looking at the same thing simultaneously, each seeing different details. The diagram below visually illustrates the multi-head attention mechanism: Query, Key, and Value are divided into different Heads, and self-attention is independently computed within each Head.

1201

0x01 Research Background

1.1 Problem

So far, the attention mechanism seems promising, but it has also revealed some shortcomings:

For example, during encoding, models may overemphasize the current position while ignoring information from other positions, thus missing important dependencies or features. In procedural terms, since Q, K, and V all originate from the input X, during computation $QK^T$ , the model tends to focus on its own position, that is, $QK^T$ activation values on the diagonal will be significantly larger, which weakens the model’s ability to focus on other high-value locations and limits the model’s understanding and expressive capabilities.

For example, the attention mechanism uses Q to find relevant K, but “relevant” can have different forms and definitions. For instance, a thing often has multiple aspects, and information/features from various aspects should be used comprehensively and measured from multiple angles. For example, the sentence below contains several different emphasis dimensions such as font size, background color, font color, bold/underline/slash, which need to be considered from multiple perspectives.

1202

Furthermore, the human attention mechanism is naturally capable of processing multiple pieces of information simultaneously. Imagine you are reading a book on a crowded bus; your brain can automatically focus on the content of the book while also paying attention to ambient sounds, such as someone calling your name or the bus arrival announcement.

To date, in our learning journey, the current Transformer attention mechanism has only focused on a single aspect of things, rather than multiple aspects.

1.2 Root Cause

Embedding is the true underlying cause behind multi-headed attention. Embedding is a mapping of human concepts, or rather, a way or method of expressing human concepts. Human concepts are an extremely complex system because concepts need sufficient internal complexity to cope with the complexity of the external world. For example, a word has multiple dimensions such as semantic logic, grammatical logic, contextual logic, positional logic within a sentence, and classification logic. Moreover, the relationships between words are not limited to the simple proximity caused by semantic classification. There are often hundreds, thousands, or even tens of thousands of intrinsic connections between the things represented by one word and the things represented by other words.

In other words, concepts are vectors configured to work across tasks, removing non-essential information and retaining the most deterministic results. Based on this, a single concept vector stored in long-term memory can be projected using different functions to be used for tasks in different specific domains. Each task can actually be considered an independent vector space. For example, in the above example, font and color are two different subspaces (low-dimensional spaces).

Currently, attention is focused only on a single vector space, which inevitably leads to a situation where, although the generated vectors can effectively map human concepts within that space, they cannot effectively reflect the rich external world. Therefore, we need a mechanism that allows the model to select information from different subspaces.

1.3 Solution

Multi-head attention is the solution proposed by researchers. Multi-head attention can be understood as splitting a high-dimensional vector into H low-dimensional vectors, and then solving for the attention in each of the H low-dimensional spaces. This allows the model to analyze and understand the input information from different perspectives, ultimately producing an output containing encoded representations from different subspaces, thereby enhancing the model’s expressive power. The Transformer paper elaborates on the multi-head attention mechanism as follows.

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

Multi-head attention mechanisms extend the self-attention mechanism. In traditional self-attention mechanisms, you can only use a set of queries (Q), keys (K), and values (V) to calculate attention weights. However, in multi-head attention mechanisms, you can use multiple different sets of Q, K, and V for calculation. Each attention head has its own independent set of Q, K, and V, and multiple sets of Q, K, and V are generated through independent linear transformations.

Different attention heads (Qs) search for different aspects of relevance. For example, one Q might capture syntactic dependencies, while another captures semantic dependencies. This allows each attention head to focus on different aspects and features of the text, enabling it to not only grasp the main idea but also understand the connections between words. This allows it to capture context and subtleties from multiple perspectives and learn multiple sets of self-attention weights in parallel. Finally, the results from multiple attention heads are concatenated and integrated through another linear transformation to obtain the final output. The multi-head attention mechanism is illustrated in the figure below. Here, D represents the hidden size, H represents the number of heads, and L represents the Lth token in the sequence.

1203

In the example above, we use multi-head attention, which involves simultaneously focusing on multiple aspects of information, such as font and color. Each attention head focuses on a different representation subspace. This allows us to effectively locate emphasized content on the webpage and flexibly select various relationships and features within the text to extract richer information. The model’s final “attention” is actually a synthesis of attention from different “representation subspaces,” balancing the potential biases of a single attention mechanism.

1204

There are two relatively concrete examples that can give everyone an intuitive understanding of the self-attention of bulls.

Example 1 is from the perspective of an expert’s expert. A team collaborates to complete a software project, with each team member responsible for their area of expertise. The product manager is responsible for overall project planning and requirements analysis; the project manager is responsible for project control; the front-end developer is responsible for user interface-related work; the back-end engineer is responsible for server logic and database management; and the test engineer is responsible for project quality assurance. Each team member independently contributes to the project using their professional skills, and ultimately, their individual results are integrated to form a complete software product.

Example 2 leans more towards a collaborative perspective. In rugby, there’s a saying that a game should be watched four times: the first time for a general overview, the second time from the offensive players’ perspective, the third time from the defensive players’ perspective, and the fourth time to synthesize all the previous understandings and watch it again. However, this requires watching four times. It’s better to have several people watch the game together. During the viewing, some focus on the offensive players’ perspective, some on the defensive players’ perspective, some on the overall picture, some on key players, some on the coach’s strategies, and finally, someone integrates the different opinions and insights to form a complete understanding of the game.

0x02 Principle

2.1 Architecture Diagram

Multi-head attention is a variant of self-attention. The architecture and formula of multi-head attention are shown in the figure below. h Scale Dot-Product Attention structures (left) are combined in parallel to form Multi-Head Attention (right). Each Scaled Dot-Product Attention structure performs contextual information fusion on the input context features separately. Based on this, we concatenate multiple such feature fusion operations to obtain multiple independent output feature tensors, and then concatenate these tensors.

1205

In the image above, $W^Q$ , $W^K$ , $W^V$ these three matrices can have different numbers of columns, but they all have the same number of rows, $d_{model}$ . Let $d_{model}$ be the channel dimension of the input and output tensors of the multi-head attention mechanism module, and $h$ be the number of heads. In the paper, $h=8$ , therefore $d_k=d_v=d_{model}/h=64$ , $d_{model}=512$ .

bias

$W^Q$ , $W^K$ , $W^V$ these three projection layers and the last projection layer $W^O$ (Z * Output_weights) can be used to add or not add bias.

For example, according to the LLaMA3 source code, it does not include bias.

class Attention(nn.Module):
    def __init__(self, args: ModelArgs):
        super().__init__()
        self.n_kv_heads = args.n_heads if args.n_kv_heads is None else args.n_kv_heads
        model_parallel_size = fs_init.get_model_parallel_world_size()
        self.n_local_heads = args.n_heads // model_parallel_size
        self.n_local_kv_heads = self.n_kv_heads // model_parallel_size
        self.n_rep = self.n_local_heads // self.n_local_kv_heads
        self.head_dim = args.dim // args.n_heads

        self.wq = ColumnParallelLinear(
            args.dim,
            args.n_heads * self.head_dim,
            bias=False, # 没有偏置
            gather_output=False,
            init_method=lambda x: x,
        )
        self.wk = ColumnParallelLinear(
            args.dim,
            self.n_kv_heads * self.head_dim,
            bias=False,  # 没有偏置
            gather_output=False,
            init_method=lambda x: x,
        )
        self.wv = ColumnParallelLinear(
            args.dim,
            self.n_kv_heads * self.head_dim,
            bias=False,  # 没有偏置
            gather_output=False,
            init_method=lambda x: x,
        )
        self.wo = RowParallelLinear(
            args.n_heads * self.head_dim,
            args.dim,
            bias=False,  # 没有偏置
            input_is_parallel=True,
            init_method=lambda x: x,
        )

In addition, the paper PaLM: Scaling Language Modeling with Pathways mentions that if no bias term is added to the fully connected layer and the layer norm, the stability of training can be improved.

No Biases – No biases were used in any of the dense kernels or layer norms. We found this to result in increased training stability for large models.

weight matrix

If it’s Scaled Dot-Product Attention, i.e., a single-head attention mechanism, the parameters it needs to learn are actually three matrices: $W^Q$ , $W^K$ , $W^V$ . This parameter set is usually small and tends to be a sparse matrix. As the semantics become more complex, the model performance may become insufficient because the number of parameters reaches the capacity limit.

Multi-head processing means dividing the word embeddings into several blocks, that is, converting each character into 512/H information of several dimensions. These blocks are then assigned to different heads, each of which performs attention computation independently. For the Q, K, and V values obtained from each head, we need to perform a linear transformation separately. The process of calculating Q, K, and V remains the same, but now the weight matrix for performing the transformation is changed from a set $(W^Q, W^K, W^V)$ to multiple groups: $(W_0^Q, W_0^K, W_0^V)$ , $(W_1^Q, W_1^K, W_1^V)$ , …, $(W_h^Q, W_h^K, W_h^V)$ . By transforming these weight matrices, we can assign multiple sets of Q, K, and V values that focus on different contexts.

Multi-head attention mechanisms increase model capacity through more weight matrices, enabling the model to learn more complex representations. In multi-head attention, each attention head focuses on only an independent subspace of the input sequence. Different heads (angles) have different areas of focus, and combining multiple heads allows the model to understand the input data more comprehensively. Alternatively, it can be understood as follows: different attention heads can learn different dependencies between different positions in the sequence; combining multi-head attention can capture multiple dependencies, providing richer and more powerful representations. This allows the Q, K, and V weights of multiple heads to improve the model’s expressive power with the same number of parameters.

These self-attention “heads” do not focus on predetermined points, but rather start randomly, processing large amounts of data and learning on their own to naturally identify various linguistic features. Some of the features they learn are understandable to us, while others are more elusive.

$W^O$ matrix

The above operation is equivalent to splitting a process into 8 independent subprocesses, each processing 1/n of the original embedding. The final vector length obtained by each process is 1/n of the original embedding length. How to combine the outputs of different attention heads? The system concatenates the results of the 8 subprocesses along the ‘d’ dimension using a concatenation method, directly stitching them into a long vector. However, the concatenated matrix doesn’t actually organically fuse the 8 “small embeddings”; it’s simply a concatenation of the matrices. This raises several questions:

The operation of directly concatenating multiple heads assumes that each head or subspace has the same importance, and that the similarity learned in each subspace has the same importance—that is, the weights of these heads are the same. However, the weights of each head are definitely different in reality. How can they be organically integrated? Or, how can the weight ratios between different heads be adjusted?
The self-attention mechanism module is connected to a fully connected network. FFN requires a single matrix as input, not multiple matrices. Furthermore, due to the presence of residual connections, the input and output dimensions of the multi-head attention mechanism should be the same.

In summary, we need a method for compression, transformation, and fusion to organically integrate the eight small semantic logic subspaces into a unified embedding. Furthermore, we need to restore the multi-head attention output to the original embedding’s dimensionality, i.e., a 512-dimensional vector length. However, organic fusion is a complex matter and difficult to achieve successfully using only human intervention. Therefore, researchers proposed making the fusion process learnable and trainable. This involves setting a learnable parameter; if the system deems a particular head important, it simply increases the learnable parameter for that head, resulting in a larger output matrix, which is similar to increasing the weight of that head.

The final result is the $W^O$ solution. Utilize $W^O$ compression and fusion of multi-head outputs are used to improve feature representation and generalization capabilities. $W^O$ is similar to $W^Q$ , $W^K$ , $W^V$ , and is also a weight matrix trained in the model training process. The dimensions remain unchanged before and after the $W^O$ operation. That is, the final output has the same shape as the input word embedding.

2.2 Design Concept

Let’s work backwards or speculate on the design philosophy of the Transformer authors: to process data using a divide-and-conquer + fusion approach. Divide-and-conquer involves treating data differently, while fusion involves combining the data.

Subspace & Divide and Conquer

Embedding

As mentioned earlier, embedding is the true underlying cause behind the bullish trend. Let’s examine the semantic logic subspace within this embedding. We assume there are eight attention heads, each with its own learnable weight matrix $W_i^Q$ , $W_i^K$ and $W_i^V$ . These are all trained on a massive dataset of comparative corpora during the training phase of the Transformer large model. They are specifically used to decompose the logical subspace of each token in the Embedding space.

These weight matrices decompose the original high-dimensional vector into eight subdivided embedding vectors, each corresponding to a subdivided semantic logic subspace (semantic logic, syntactic logic, contextual logic, classification logic, etc.). Essentially, the Attention mechanism operates within these subdivided logical subspaces of the embedding. Each attention point independently focuses on a different subspace context, considering multiple issues simultaneously, thereby obtaining richer feature information.

Feature extraction

The multi-head attention mechanism in Transformers likely borrows from the idea of using multiple convolutional kernels within the same convolutional layer in CNNs. CNNs use different convolutional kernels to focus on different features in an image and learn different information. Then, in CNNs, channel-wise convolution is performed, and finally, feature fusion is achieved by summing along the channels.

The Transformer is positioned as a feature extractor or a universal function approximator. We aim to capture more patterns to facilitate fine-tuning for diverse downstream tasks. Once these patterns prove useful, they can be activated for downstream tasks to learn. Therefore, the Transformer uses multiple heads to slice a vector into different dimensions to capture different patterns, allowing the model to understand the meaning of the input sentence from multiple perspectives. A single concept vector can be projected using different functions for different domain-specific tasks. This is followed by a feature fusion process. Mapping to different subspaces essentially mimics the convolutional neural network’s support for multi-channel output patterns.

ensemble & fusion

The above mainly focused on splitting the input and then extracting information from different subspaces. Next, we will explain from another perspective that the core idea of the multi-head strategy is ensemble.

Numerous academic papers have demonstrated that it is difficult to capture syntactic, syntactic, and lexical information simultaneously using only a single head, thus requiring multiple heads. However, each head in a multi-head system has a different function; some heads may not recognize much information, some may primarily recognize positional information, some may primarily recognize syntactic information, and some may primarily recognize lexical information. The purpose of a multi-head system is to ensure that all these patterns can be extracted.

We can view the multiple attention calculations in MHA as multiple independent small models, with each head acting as a weak classifier. The final concat calculation is equivalent to fusing the results from multiple small models, allowing the final embedding to consider multiple aspects of information. Furthermore, a single head tends to focus only on its own attention weights, while multiple heads (requiring a certain cardinality) undoubtedly reduce this probability through multiple rounds of voting, resulting in better performance and aligning with intuition. To illustrate, it’s like eight translators with different reading habits translating the same sentence together. Each of them might have a different reading order and focus, and combining their opinions would likely yield a more accurate translation.

Alleviating sparsity

By observing the attention matrices of a large number of samples, we found that the attention of almost every token in the whole sentence is sparse, that is, each token only pays attention to a very limited number of other tokens, and the remaining attention can be regarded as 0 (softmax cannot be strictly 0).

Sparsity means that we can use smaller matrices to combine larger sparse matrices with similar results, but with much less computation. Therefore, it is better to divide Q, K, and V into multiple small segments, calculate the attention matrix multiple times, and then integrate them in some way. In this way, the computational cost is actually similar to calculating a single attention matrix directly, but the effect of model fusion should be at least no worse than that of a single attention matrix, and may even be better. Hence, multi-head attention was developed.

2.3 Calculation

Calculation process

The computational process of multi-head attention involves splitting a high-dimensional vector into several low-dimensional vectors and solving for the Scaled Dot-Product Attention (DPDA) in each of these low-dimensional spaces. The overall process consists of four parts: splitting, computation, concatenation, and fusion. This involves many steps and matrix operations, which we will illustrate with a large diagram.

The inputs remain the original Q, K, and V.
Segmentation. Each attention head has its own learnable weight matrix $W_i^Q$ , $W_i^K$ and $W_i^V$ . The inputs Q, K, and V undergo multiple linear transformations through these weight matrices to obtain N sets of Queries, Keys, and Values. These sets of Q, K, and V can be understood as linearly projecting the high-dimensional input vector onto a lower dimension. Each newly formed Q inherently requires different types of relevant information, thus allowing the attention model to incorporate more information into the context vector computation. This is illustrated in label 1 of the figure below.
The computation involves calculating N vectors using Self-Attention for each head. Each head can focus on learning different parts of the input, allowing the model to focus on more information. This is illustrated in figure 2 below.
Concatenation. Our goal is to create a single context vector as the output of the attention model. Therefore, the context vectors generated by individual attention heads are concatenated into a single vector. Here, for example, number 3 in the diagram below.
Fusion. Using a weight matrix $W^O$ . This ensures that the generated context vector is restored to the same dimensionality as the original embedding. This is both a dimensionality reduction operation and a fusion operation. Here, we refer to label 4 in the diagram below.

1206

Calculate strength

We will use the following diagram as a basis to think about the computational strength, where D represents the hidden size, H represents the number of Heads, and L represents the Lth Token in the sequence.

1207

When the batch size is 1, all matrix multiplications in the red, purple, and blue dashed boxes in the figure are matrix-vector multiplications, which are memory-bound operations with an arithmetic strength of less than 1.
When the batch size is greater than 1 (e.g., in continuous batching):
- The red and blue dashed boxes indicate that because the weight is multiplied by the activation, different requests can share the weight. This becomes matrix multiplication, and the larger the batch size, the greater the arithmetic strength, and the closer it is to the Compute Bound (the FFN layer is similar).
- The purple dashed box indicates that the attention calculation for Q, K, and V involves multiplying activations, meaning there’s no correlation between different requests. Even with batching, it’s still a batched matrix-vector multiplication, and because sequence lengths may vary, the matrix-vector multiplication between different requests is irregular. In other words, the arithmetic strength is consistently less than 1, a clear memory bound.

As can be seen from the above, Continuous Batching can effectively transform the Memory Bound problem into a Compute Bound problem, but the arithmetic strength of the attention calculations for Q, K, and V is consistently less than 1. The longer the Sequence Length, the more significant the computational cost becomes, as it represents a bottleneck in the system.

2.4 Effects

The Transformer paper concludes with a visualization of the attention mechanism between the two heads, as shown below. In the figure, thicker lines indicate greater attention weights. It can be seen that the two heads focus on different areas: green indicates a greater focus on global information, while red indicates a greater focus on local information.

1208

The paper “What Does BERT Look At? An Analysis of BERT’s Attention” also provides examples of different attention heads. The thickness of the lines represents the strength of the attention weights (some attention weights are too low to be seen).

1209

2.5 Integration Method

In the vanilla Transformer, different attention heads are directly concatenated. The paper “Multi-Head Attention: Collaborate Instead of Concatenate” proposes alternative concatenation methods. This paper finds that there is inherent redundancy in the information captured by all attention heads, with significant amounts of common information existing between them. The concatenated attention heads $W_QW_K^T$ only about one-third of the dimension is needed to capture the vast majority of information. Therefore, the authors designed a hybrid vector to extract common information between attention heads. This vector can be learned together with the model and then applied to the original multi-head attention computation. This approach allows for more flexible representation of attention heads, and the dimension of the attention heads can be changed according to the actual situation. It also makes parameter computation more efficient.

The image below shows the original splicing method of vanilla Transformer on the left and the proposed scheme CollabHead on the right.

1210

(a) is the original splicing method of vanilla Transformer (equivalent to extracting matrix information of different dimensions from different heads), and is also a special case of the CollabHead method. $m_i$ is a vector consisting of only 1s and 0s, where the position of the 1 is the position of the mapping matrix of its corresponding attention head in the concatenated overall matrix. This ensures that each attention head is independent of the others when the model integrates them.
(b) allows all heads to share the mapping matrix.
(c) based on the shared mapping matrix, further compress the dimension of the final output integrated matrix to achieve the effect of dimension compression.

2.6 Analysis

Researchers have conducted in-depth analyses of multi-head attention (such as the paper “What Does BERT Look At? An Analysis of BERT’s Attention”), and some of their insights and perspectives are as follows:

Number of heads

The fewer the number of heads, the more attention will tend to focus on the token itself or other relatively simple patterns, such as focusing on CLS.
Existing research has shown that more heads are not always better (increasing the number of heads leads to smaller subspaces, thus reducing the content that each subspace can represent; when there are enough heads to focus on positional information, syntactic information, and the ability to identify rare words, adding more heads may be an improvement or just noise). Too many or too few heads will both degrade performance; the optimal number depends on the model size and the task. The current trend is that the larger the model (i.e., the larger the hidden size), the more heads it contains, and the greater the average performance gain.

Learning mode

For most queries, each head learns a certain fixed pattern.
Each head does learn something different, but the differences between most heads are not as great as we think (such as the obvious distinction between one learning syntax and another learning word meaning).
A small number of heads can capture various textual information relatively well without paying excessive attention to their own position, which alleviates the computational burden mentioned above to some extent to $QK^T$ , then there’s the issue of excessively large diagonal elements.

The image below shows the attention heads displayed. Some attention heads focus on all words (broadly), some focus on the next token, some focus on the SEP symbol, and some focus on punctuation marks. The thickness of the lines indicates the strength of the attention weight (some attention weights are too low to be seen).

1211

The relationship between head and hierarchy

The closer attention is to the bottom layer, the more diverse the patterns and the more points it focuses on.
The pattern gradually stabilizes as the number of layers increases. The difference (or variance) between heads decreases as the layer number increases, meaning it contains more and more location information.
The higher the level of attention, the more similar the patterns of most attention heads become.
The few distinct attention heads that remain are the attention heads that represent the semantic information of this model. This also explains why multiple layers of Transformers are needed, because some information may not be captured in one layer and needs to be captured in other layers.

The paper “What Does BERT Look At? An Analysis of BERT’s Attention” also analyzes BERT’s performance in recognizing dependency relationships between words. Dependency relationships are the relationships of dependence between words; for example, the “predicate” is the core of a sentence, and other components are directly or indirectly related to the verb. Through the analysis of dependency relationships between words, the authors found that BERT cannot handle all dependency relationships well, but certain layers are better at recognizing specific dependency relationships.

1212

The paper “Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned” analyzes multiple heads and finds that most of the functions of multiple heads are redundant, and many can be pruned. Through experiments on multiple datasets, the paper finds that most heads can be categorized as follows:

Positional Head: This head primarily focuses on the positional head of neighboring words. The weight calculated for this head usually points to the nearest word, and the rule is that in 90% of cases, this head will assign the highest weight to the word to its left or right.
Syntactic Head: A syntactic head that points to tokens with specific syntactic relationships. The weights calculated for this head typically relate the relationships between words, such as the referential relationship between nouns and verbs.
Rare Head: Points to the beginning of an uncommon word in a sentence. This head usually assigns greater weight to rare words.

The best way to demonstrate the importance of their head classification is to prune other categories. Here is an example of their pruning strategy, which classifies based on the 48 heads (8 heads multiplied by 6 blocks) of a normal transformer.

1213

The image above illustrates the functionality of retaining encoder heads after trimming. Each column represents a different trimming amount. It can be observed that by retaining attention heads categorized as the primary type, they managed to retain 17 out of 48 heads. Note that this is approximately two-thirds of the total number of encoder heads. The numbers below each column represent the number of heads remaining.

The paper also analyzes how to simplify the Heads, and the optimization method is as follows (adding a weight to each Head, which is equivalent to gating):

1214

2.7 Advantages

The advantages of multi-headed attention are as follows:

Enriching contextual understanding enhances the model’s expressive and learning capabilities, enabling it to capture richer features and information.
Improved computational efficiency: Since each head operates in a lower-dimensional space, the computational complexity of attention is reduced, thus improving computational efficiency. The computational complexity of attention is proportional to the square of the dimension, so dimensionality reduction can significantly reduce the amount of computation.
Parallelization capability: The multi-head attention mechanism allows the model to learn in parallel on different representation subspaces, which improves the efficiency of training and inference.
Improved generalization ability: Because the multi-head attention mechanism can analyze input data from multiple perspectives, the model’s generalization ability is improved. It also makes the model more robust to noise and variations in the input. Even if some heads are disturbed by noise or irrelevant information, other heads can still provide useful information.
Increase model capacity: Even if each head works in a lower-dimensional subspace, combining multiple heads can capture information from different subspaces, thereby increasing the model’s capacity.

0x03 Implementation

1215

3.1 Definition

Multi-head attention is implemented using the MultiHeadedAttention class, whose key parameters and variables are as follows.

d_model is the model’s dimension, specifically the vector dimensions of the query, key, value, and word embeddings under single-head attention. We assume it to be 512.
h represents the number of attention heads, let’s say 8.
d_k is the attention dimension of a single head, and its size is d_model / h, so 512/8 = 64.

Additionally, in the comments:

seq_len is the sentence length, which is the number of tokens (which can be considered as the maximum number of words in the sentence). Let’s assume it’s 10 words. shape refers to the tensor shape.
batch_size is the batch size.

The initialization code for MultiHeadedAttention is as follows.

class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        # 因为后续要给每个头分配等量的词特征，把词嵌入拆分成h组Q/K/V，所以要确保d_model可以被h整除，保证 d_k = d_v = d_model/h
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h # 单个头的注意力维度
        self.h = h # 注意力头数量
        # 定义W^Q, W^K, W^V和W^O矩阵，即四个线性层，每个线性层都具有d_model的输入维度和d_model的输出维度，前三个线性层分别用于对Q向量、K向量、V向量进行线性变换，第四个用来融合多头结果
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None # 初始化注意力权重
        self.dropout = nn.Dropout(p=dropout) # 进行dropout操作时置0比率，默认是0.1

3.2 Operational Logic

The computation process of multi-head attention (demonstrated here from the first encoding layer, thus covering word embeddings) is summarized in the following diagram by combining specific functions in the Harvard code.

Note:

For easier understanding, the batch_size dimension has been removed from the diagram below, focusing on the remaining dimensions.
The diagram is limited to two heads. Note: The code does not split the linear layer weights W^Q, W^K, W^V. It is not a single part, but a combined part, so it is omitted in the diagram.
In practice, you can ignore concat when implementing the code. The simplest implementation is to reshape the channel into multiple heads and then multiply by two matrices.

1216

enter

The encoder’s input is word embeddings, with data dimensions of (batch_size, seq_len, d_model). It’s important to note that in the paper’s architecture diagram, projection and segmentation are achieved through 3 × h. This is accomplished using a small weight matrix.

projection

This corresponds to number 1 on the diagram.

In a single-head attention mechanism, the input will interact with W^Q, W^K, W^V matrix multiplication. W^Q, W^K, W^V are three independent linear layers. Each linear layer has its own independent weights. The input data is multiplied by each of the three linear layers to produce Q, K, and V. However, the Harvard code still uses three large weight matrices here: W^Q, W^K, W^V, not as listed in the paper 3 × h there are three small weight matrices; however, as training progresses, the three large physical weight matrices will naturally transform into logical weight matrices 3 × h small weight matrices.

Data splitting

This corresponds to number 2 in the diagram. The segmentation is not simply dividing the word embeddings into h parts at the physical level, but rather performing a dimensionality reduction transformation. That is, using a weight matrix to map them from the original dimension to a lower dimension, resulting in h small “Embeddings” with independent semantic logic in different subspaces.

Logical perspective

The Q, K, and V matrices output from the linear layer will be split into multiple attention heads so that each attention head can process them independently, which will change the shape of the Q, K, and V matrices. Logically, this is done as follows.

$q_i = W_i^Q \times q$ $K_i = W_i^K \times K$ $V_i = W_i^V \times V$

From a vector perspective, the segmentation operation splits each row (original word embedding) in the tensor d_model into h row vectors of length d_k (small embeddings with sub-semantic logic). That is: (batch_size, seq_len, d_model) -> (batch_size, seq_len, nums_heads, d_k). Although from the perspective of the embedding vector, the dimension is reduced from d_model to d_k dimensions for each head, and the dimension corresponding to the attention of each head is reduced, each head can still express certain subdivided semantic logic in a certain subspace.

From a neural network perspective: For a single-layer fully connected network, any subset of the input layer and hidden layer nodes combined constitutes a complete single-hidden-layer fully connected network. In other words, this split can be viewed as breaking down the previous fully connected network from k input_depth nodes to d_model k nodes into h smaller input_depth fully connected networks from k nodes to d_k nodes.

From a physical perspective

In practice, the code uses a large matrix approach. Specifically, the projected output Query, Key, and Value are split into multiple heads using the view(nbatches, -1, self.h, self.d_k) operation. This adds a dimension, making the last dimension d_k. Alternatively, the last dimension is split into (h, d_k). Now, each “slice” corresponds to a matrix for each head.

As mentioned earlier, the projection is a logical projection, so the segmentation is also only a logical segmentation. For the parameter matrices Query, Key, and Value, there is no physical segmentation into independent matrices corresponding to each attention head; logically, each attention head corresponds to an independent part of Query, Key, and Value. Similarly, each attention head does not have a separate linear layer; instead, all attention heads share a linear layer, with different attention heads operating on their own unique logical parts. This logical segmentation is achieved by evenly distributing the input data and linear layer weights across the attention heads.

Based on this, the computation of all Heads can be achieved through a matrix operation on a single head, without requiring h separate operations. This makes the computation more efficient while maintaining the simplicity of the model: fewer linear layers are required, while achieving the effect of multi-head attention.

Actually, a small matrix approach can also be used for calculation, which involves physically splitting the query, key, and value, then using a for loop to calculate the header one by one, and finally concatenating the result list. This approach is clearer in terms of code, but its performance is not as good as the large matrix approach.

summary

The input dimensions are: batch_size, seq_len, d_model.
W^Q, W^K, W^V the linear layer has dimensions (d_model, d_model), but it doesn’t actually perform any splitting for the long position. In reality, the long position W^Q, W^K, W^V the matrices are still three single matrices, but they can be viewed as logically independent for each attention head.
W^Q, W^K, W^V the resulting single-head logical matrix has the shape (batch_size, seq_len, h, d_k).

Adjust dimensions

This corresponds to number 3 on the diagram.

To achieve better parallelism, the shapes of the Q, K, and V matrices will be changed by swapping the dimensions h and seq_len. The batch dimension is not shown in the diagram; in reality, the dimension of Q for each attention head is (batch_size, h, seq_len, d_k).

Calculate attention for each head

As mentioned earlier, there are two ways to calculate the attention for each head.

The large matrix approach lays all eight attention heads flat on the 0th dimension (batch_size) of the 3D input matrix and performs a dot product operation on them. The dot product result is then reshaped and transposed to be the concatenation of the eight heads on the 2nd dimension. This approach is computationally fast.
Using a for loop to calculate each header one by one, and then concatenating the result list, makes the code clearer.

The vanilla Transformer uses a large matrix approach. This corresponds to number 4 in the diagram.

Grouped separately

Logically, each query, key, and value has been divided into several segments according to their respective dimensions, forming several independent query, key, and value groups, each corresponding to an attention head. Next, dot product operations and weighted averages are performed within each group. For example, the first segment of the query is only dot-producted with the first segment of the key, and the result is simply the weight of the first segment of the value, and so on. These are independent groups, and attention operations are performed within each group, without cross-group operations. From a theoretical perspective, this operates the attention mechanism by dividing it into different sub-spaces of logic within the embedding (semantic logic, syntactic logic, contextual logic, classification logic, etc.). That is, instead of measuring the relevance between any two words in a high-dimensional space, it transforms the measurement of the relevance between any two words in an 8-dimensional space.

parallel

The attention calculation for each head is essentially the same as for a single-head attention head, but there’s one point to note: single-head calculation uses the last two dimensions (seq_len, d_k) and skips the first two dimensions (batch_size, h). The output shape of each attention head is (batch_size, h, seq_len, d_k). This is done purely for computational needs. Since the first two dimensions of Q, K, and V (multi-head and batch) are equivalent, they are essentially computed in parallel. Therefore, they can be computed along the same dimension: batch_size * num_heads. Due to computational requirements, the shape of the attention weights (QK^T) is sometimes a three-dimensional tensor (batch_size * num_heads, tgt_seq_len, src_seq_len) and sometimes a four-dimensional tensor (batch_size, num_heads, tgt_seq_len, src_seq_len), switching between the two as needed.

Typically, independent computation has a very simple parallelization process. Although this depends on the underlying low-level implementation in the GPU threads. Ideally, we would allocate one GPU thread for each batch and each head. For example, if we have batch=2 and heads=3, we can run the computation in 6 different threads, even if the size is d_k=d_model/heads. Because the computation of each head is performed in parallel (different heads receive the same input and perform the same computation), the model can efficiently handle large-scale inputs. Compared to sequential RNNs, the attention mechanism itself supports parallelism, and the multi-head mechanism further enhances this.

Z-factors from each head

We now have a separate Z for each head, but the next layer of the encoder wants to obtain a matrix, not h matrices. Therefore, the previous splitting needs to be reassembled. The multiple Zs from the multi-head outputs are merged into a single output Z through a fully connected layer. This merging operation is essentially the opposite of the splitting operation, accomplished by reshaping the resulting matrix to eliminate the d_k dimension. The steps are as follows:

To facilitate the concatenation of multi-head results, we first transpose h to the penultimate dimension, essentially swapping the head and sequence dimensions to reshape the attention score matrix. In other words, the matrix shape changes from (batch_size, h, seq_len, d_k) to (batch_size, seq_len, h, d_k). This corresponds to number 5 in the diagram.
Placing the intention score matrix into a contiguous block of physical memory is a deep copy, which does not modify the original data. This corresponds to number 6 in the diagram.
The head dimension is folded by reshaping (batch_size, seq_len, d_model). This effectively concatenates the attention score vectors of each head into a merged attention score. This corresponds to number 7 in the figure.
The combined outputs are organically fused through a linear transformation using a fully connected layer. The last dimension after fusion by the fully connected layer remains the same d_model. This corresponds to number 8 in the diagram.

As can be seen, the output matrix Z of Multi-Head Attention has the same dimension as its input matrix X.

forward() function

The above operational logic corresponds to the forward() function of MultiHeadedAttention, as detailed below.

 def forward(self, query, key, value, mask=None):
     """
     本函数是论文中图2（多头注意力的架构图）的实现。
     - query, key, value：并非论文公式中经过W^Q, W^K, W^V计算后的Q, K, V，而是原始输入X。query, key, value的维度是(batch_size, seq_len, d_model)
     - mask：注意力机制中可能需要的mask掩码张量，默认是None
     """        
     if mask is not None:
         # 对所有h个头应用同样的mask
         # 单头注意力下，mask和X的维度都是3，即(batch_size, seq_len, d_model)，但是多头注意力机制下，会在第二个维度插入head数量，因此X的维度变成(batch_size, h,seq_len,d_model/h)，所以mask也要相应的把自己拓展成4维，这样才能和后续的注意力分数进行处理
         mask = mask.unsqueeze(1) # mask增加一个维度
     nbatches = query.size(0) # 获取batch_size

     # 1) Do all the linear projections in batch from d_model => h x d_k
     """
     1). 批量执行从 d_model 到 h x d_k 的线性投影，即计算多头注意力的Q,K,V，所以query、value和key的shape从(batch_size,seq_len,d_model)变化为(batch_size,h,seq_len,d_model/h)。
        zip(self.linears, (query, key, value)) 是把(self.linears[0],self.linears[1],self.linears[2])这三个线性层和(query, key, value)放到一起
        然后利用for循环将(query, key, value)分别传到线性层中进行遍历，每次循环操作如下：
         1.1 通过W^Q,W^K,W^V（self.linears的前三项）求出自注意力的Q,K,V，此时Q,K,V的shape为(batch_size,seq_len,d_model), 对应代码为linear(x)。
         以self.linears[0](query)为例，self.linears[0] 是一个 (512, 512) 的矩阵，query是(batch_size,seq_len,d_model)，相乘之后得到的新query还是512(d_model)维的向量。
         key和value 的运算完全相同。
         1.2 把投影输出拆分成多头，即增加一个维度，将最后一个维度变成(h,d_model/h)，投影输出的shape由(batch_size,seq_len,d_model)变为(batch_size,seq_len,h,d_model/h)。对应代码为`view(nbatches, -1, self.h, self.d_k)`，其中的-1代表自适应维度，计算机会根据这种变换自动计算这里的值。
         因此我们分别得到8个头的64维的key和64维的value。这样就意味着每个头可以获得一部分词特征组成的句子。
         1.3 交换“seq_len”和“head数”这两个维度，将head数放在前面，最终shape变为(batch_size,h,seq_len，d_model/h)。对应代码为`transpose(1, 2)`。交换的目的是方便后续矩阵乘法和不同头部的注意力计算。也是为了让代表句子长度维度和词向量维度能够相邻，这样注意力机制才能找到词义与句子位置的关系，从attention函数中可以看到，利用的是原始输入的倒数第一和第二维.这样我们就得到了每个头的输入。
         多头与batch本质上都是并行计算。所以计算时把它们放在同一个维度上，在用GPU计算时，大多依据batch_size * head数来并行划分。就是多个样本并行计算，具体到某一个token上，可以理解为n个head一起并行计算。
     """          
     query, key, value = [
         lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2) # 对应图上的序号2，3
         for lin, x in zip(self.linears, (query, key, value)) # 对应图上的序号1
     ]

     # 2) Apply attention on all the projected vectors in batch.
     """
     2) 在投影的向量上批量应用注意力机制，具体就是求出Q,K,V后，通过attention函数计算出Attention结果。因为head数量已经放到了第二维度，所以就是Q、K、V的每个头进行一一对应的点积。则：     
        x的shape为(batch_size,h,seq_len,d_model/h)。
        self.attn的shape为(batch_size,h,seq_len,seq_len)
     """          
     x, self.attn = attention( # 对应图上的序号4
         query, key, value, mask=mask, dropout=self.dropout
     )

     # 3) "Concat" using a view and apply a final linear.
     """
     3) 把多个头的输出拼接起来，变成和输入形状相同。
        通过多头注意力计算后，我们就得到了每个头计算结果组成的4维张量，我们需要将其转换为输入的形状以方便后续的计算，即将多个头再合并起来，进行第一步处理环节的逆操作，先对第二和第三维进行转置，将x的shape由(batch_size,h,seq_len,d_model/h)转换为 (batch_size,seq_len,d_model)。
        3.1 交换“head数”和“seq_len”这两个维度，结果为(batch_size,seq_len,h,d_model/h)，对应代码为：`x.transpose(1, 2).contiguous()`。`contiguous()`方法将变量放到一块连续的物理内存中，是深拷贝，不改变原数据，这样能够让转置后的张量应用view方法，否则将无法直接使用。
        3.2 然后将“head数”和“d_model/head数”这两个维度合并，结果为(batch_size,seq_len,d_model)，代码是view(nbatches, -1, self.h * self.d_k)。
        比如，把8个head的64维向量拼接成一个512的向量。然后再使用一个线性变换(512,512)，shape不变。因为有残差连接的存在使得输入和输出的维度至少是一样的。
        即(5, 8, 10, 64)  ==> (5, 10, 512)
     """            
     x = (
         x.transpose(1, 2) # 对应图上的序号5
         .contiguous() # 对应图上的序号6
         .view(nbatches, -1, self.h * self.d_k) # 对应图上的序号7
     )
     del query
     del key
     del value
     # 当多头注意力机制计算完成后，将会得到一个形状为[src_len,d_model]的矩阵，也就是多个z_i水平堆叠后的结果。因此会初始化一个线性层（W^O矩阵）来对这一结果进行一个线性变换得到最终结果，并且作为多头注意力的输出来返回。
     # self.linears[-1]形状是(512, 512)，因此最终输出还是(batch_size, seq_len, d_model)。
     return self.linears[-1](x) # 对应图上的序号8

3.3 Call

Next, let’s see how to call it. In the Transformer, MultiHeadedAttention is used in three places: once in the Encoder layer and twice in the Decoder layer.

encoder

The purpose of self-attention in the Encoder is to find relationships with itself. Therefore, for its internal multi-head attention mechanism, when calling MultiHeadedAttention.forward(query, key, value, mask), the query, key, and value are all the same input value X or the output of the lower layer (corresponding to the Transformer architecture). In the code, this corresponds to the following:

class EncoderLayer(nn.Module):
    "Encoder is made up of self-attn and feed forward (defined below)"
    def forward(self, x, mask):
        # 这里调用MultiHeadedAttention.forward(query, key, value, mask)
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask)) 
        return self.sublayer[1](x, self.feed_forward)

decoder

The purpose of the Decoder is:

Self-attention is used to find semantic relationships within the output sequence itself. This involves identifying which characters in the target sequence are most relevant to each token.
Cross attention is used to align the source sequence with the target sequence.

therefore,

For the masked multi-head attention mechanism at the beginning of the decoder, when calling MultiHeadedAttention.forward(query, key, value, mask), the query, key, and value are all the same value X (the input to the decoder). However, the mask prevents it from accessing future inputs. That is, in order to feed the input of all decoding parts in parallel at one time, a mask is used to cover the position information after the current time step.
For the multi-head attention mechanism in the Decoder, the output memory of the Encoder is used as the key and value, and the output of the lower layer is used as the query of this layer.

The code is as follows:

class DecoderLayer(nn.Module):
    "Decoder is made of self-attn, src-attn, and feed forward (defined below)"
    def forward(self, x, memory, src_mask, tgt_mask):
        m = memory
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        return self.sublayer[2](x, self.feed_forward)

0x04 Improvement

Some improvements have also been made to multi-head attention. The figure below shows a comparison of some head composition approaches to attention.

1217

4.1 MOHSA

The main reason for the success of the Transformer model is the effective information exchange between different tokens, allowing each token to gain a global view of the context. However, the query, key, and value in each head are separate and do not overlap, and there is no information exchange when calculating attention across heads. In other words, when calculating the attention of the current head, it has no information from other heads. Although tokens are processed through linear projection after attention, the information exchange at that time is limited to each token.

The paper “Improving Vision Transformers by Overlapping Heads in Multi-Head Self-Attention” explores this topic. The authors propose that information exchange can improve performance during attention computation in Vision Transformers. This can be achieved by overlapping the queries, keys, and values of each Head with those of adjacent Heads. To this end, the authors propose a method called MOHSA (Multi-Overlapped-Head Self-Attention), which improves the multi-head self-attention mechanism by overlapping Heads. This allows the Q, K, and V values in each Head to be influenced by the Q, K, and V values of its adjacent Heads during attention computation. This inter-Head information exchange can lead to better performance for Vision Transformers. (See figure).

1218

To enable information exchange between Heads, the authors used soft division instead of direct division when Q, K, and V were partitioned into different Heads. By overlapping adjacent Heads, information from other Heads can also participate in the attention calculation of the current Head. Since overlapping increases the token dimension after concatenating tokens from different Heads, linear projection reduces it back to its original size.

1219

4.2 MoH

The paper “MoH: Multi-Head Attention as Mixture-of-Head Attention” draws on the idea that not all attention heads are equally important, proposing a new architecture of Mixture-of-Head (MoH). This architecture treats attention heads as experts in a Mixture-of-Experts (MoE) mechanism, thus upgrading the core of the Transformer model—the multi-head attention mechanism. MoH has two significant advantages:

This allows each lexical unit to select an appropriate attention head, thereby improving inference efficiency without sacrificing accuracy or increasing the number of parameters.
The use of weighted summation instead of standard summation in multi-head attention introduces flexibility to the attention mechanism without increasing the number of parameters and unlocks additional performance potential.

The overall architecture of MoH is shown on the right side of the figure below, which includes multiple attention heads and a router (activating the Top-K heads). The output of MoH is a weighted sum of the outputs of the K selected heads.

1220

The main improvements of MoH are shown in the figure below.

Shared headers: Specify a subset of headers that are always active as shared headers. These shared headers consolidate common knowledge and reduce redundancy among other dynamic routing headers.
Two-phase routing: The routing score is determined by the score of each individual header and the score associated with the header type. The formula for the relevant routing score is shown in Figure 1 below.
Load balancing loss: To avoid unbalanced loads, load balancing loss is applied. The formula is shown in Figure 2 below.
Overall training objective: The total training loss is a weighted sum of the task-specific loss and the load balancing loss, as shown in Figure 3 below. Here, β is the tradeoff hyperparameter, set to 0.01 by default.

1221

4.3 DCMHA

The paper “Improving Transformers with Dynamically Composable Multi-Head Attention” proposes replacing the core component of Transformer, the Multi-Head Attention Module (MHA), with Dynamically Composable Multi-Head Attention (DCMHA). This removes the fixed binding between the search selection loop and the transformation loop of the MHA attention heads, allowing them to dynamically combine according to the input, fundamentally improving the expressive power of the model.

DCMHA can be roughly understood as follows: where each layer originally had a fixed number of H attention heads, now, with almost the same number of parameters and computing power, up to H x H attention heads can be dynamically combined as needed. This plug-and-play approach allows it to replace MHA in any Transformer architecture, resulting in the general, efficient, and scalable new architecture DCFormer.

Research Background

In the Transformer’s Multi-Head Attention Module (MHA), each attention head operates completely independently. This design has been highly successful in practice due to its simplicity and ease of implementation, but it also brings drawbacks such as the low-ranking of the attention score matrix weakening expressive power, and the redundancy of attention head functions wasting parameters and computational resources. Based on this, recent research has attempted to introduce some form of interaction between attention heads.

motivation

According to Transformer loop theory, in MHA, the behavior of each attention head is determined by $W^Q$ , $W^K$ , $W^V$ , $W^O$ characterized by four weight matrices (where $W^O$ is obtained by segmentation of the output projection matrix of MHA), where:

$W^QW^K$ is called a QK circuit (or lookup-selection circuit), which determines which token(s) in the context to focus on from the current token.
$W^OW^V$ is called an OV loop (or projection transformation loop), which determines what information (or what attributes) to retrieve from the token of interest and write into the residual stream at the current position, thereby predicting the next token.

Researchers noted that finding (where to get) and transforming (what to get) are two separate things that should be specified separately and combined freely as needed. MHA forcibly put them into a single attention head QKOV and “bundled them together,” limiting flexibility and expressiveness.

Ideas

Starting from this point, our research team introduced the compose operation into MHA, resulting in DCMHA as shown in the figure below.

1222

To maximize expressive power, researchers proposed a dynamic approach to determining how attention heads are combined, where the mapping matrix is dynamically generated from the input. To reduce computational overhead and memory usage, they further decomposed the mapping matrix into an input-independent static matrix $W_b$ , two low-rank matrices $w_1$ , $w_2$ , and a diagonal matrix $Diag(w_g)$ . The sum of these matrices is responsible for the basic combination, the dynamic combination of finite ways (i.e., rank R <= 2) between attention heads, and the dynamic gating of the head itself, respectively. The latter two matrices are dynamically generated from the Q matrix and the K matrix. The specific formulas are shown in the figure below:

1223

The following diagram illustrates how compose is calculated.

1224

0xFF Reference

On the Role of Attention Masks and LayerNorm in Transformers

MOH: MULTI-HEAD ATTENTION AS MIXTURE-OFHEAD ATTENTION

Improving Transformers with Dynamically Composable Multi-Head Attention

High score at ICML 2024! Modified attention mechanism allows small models to outperform models twice their size.

Quantum Bit

PaLM: Scaling Language Modeling with Pathways

BERT Performance Optimization: Integrating Multi-Head Attention in an Alternative Way (Qiu Zhenyu)

What exactly is the role of Multi-Head-Attention? (MECH)

Metaphysics Transformer Mejpoldan Winter

Understanding Attention: From its Origins to MHA, MQA, and GQA

[Hardcore] Thoroughly Understanding Bullish Attention: A Comprehensive Analysis of Andrej Karpathy’s MHA Code Choosing a Good Name is Really Hard

Understanding Transformers from the Bottom Up (5): From Attention Layers to Transformer Networks

Multiscale Visualization of Attention in the Transformer Model

What Does BERT Look At? An Analysis of BERT’s Attention

Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

Adaptively Sparse Transformers

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

“Are Sixteen Heads Really Better than One?”

A Fundamental Insight into the Multi-Head Self-Attention Mechanism of Transformers Author: Nikolas Adaloglou Translator: Wang Qingfa

Transformer Series: Multi-Head Attention Network Structure and Code Analysis (xiaogp)

Transformer Series: Detailed Analysis and Code Demonstration of Residual Connectivity Principles (xiaogp)