Exploring the Transformer Series (1): Attention Mechanism

0x00 Overview

Due to various commitments, I haven’t blogged for a long time, and I haven’t had time to organize some drafts I wrote before (I haven’t had time to log into my blog or WeChat, which led to me only recently discovering many unread messages and private messages; I sincerely apologize to everyone). I’m resuming updates now because some new students from non-AI fields have recently contacted me asking for good learning materials. They hope to quickly get started with Transformers. I searched online but couldn’t find very suitable and systematic learning materials, so I had the idea to write a series myself, hence this series. In the process of organizing these materials, I also discovered many of my own seemingly correct but incorrect understandings; therefore, this series is also a process of organizing, learning, and improving for myself.

This series attempts to analyze Transformer from scratch, with the goal of…

This course explains how and why the Transformer works, providing a beginner’s guide to the Transformer.
The aim is to incorporate some newer or more distinctive papers or concepts, so that even experienced readers can gain new insights and benefit from reading this series.

A few points to note:

This series is a study and interpretation of papers, blogs, and code, drawing heavily on articles from online friends. I would like to express my gratitude to them, and their names will be listed in the references. Because there are so many references in this series, there may be omissions in the citations. If the original authors find any omissions, please point them out, and I will add them to the bibliography.
Some content in this series is the result of my personal analysis and reflection (reverse reasoning or conjecture), and may differ from the original authors’ thinking or the actual historical trajectory. I’ve written it this way because this derivation method allows me to provide an intuitive and reasonable explanation. If there are any errors in my understanding, please point them out.
For certain areas, this section will incorporate some newer or more unique explanations, as the author’s time and energy are limited, making it difficult to read a large amount of literature. If any excellent literature has been omitted, please point it out.

This is the first article in a series, primarily aimed at introducing the concept of Transformer and its related background. In 2017, Vaswani et al. from Google Brain published the Transformer in their paper “Attention is All You Need.” The original paper defines Transformer as follows:

Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution.

The document mentions concepts such as sequence, RNN, convolution, and self-attention, so we will begin our analysis from these concepts. We will start with an introduction to Seq2Seq, then gradually switch to the attention mechanism, and finally derive the Transformer model architecture.

0x01 Background Knowledge

In this section, we will introduce some background knowledge and concepts.

1.1 seq2seq

The concept of seq2seq (Sequence to Sequence) was first proposed by Bengio in his 2014 paper “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,” representing the operation of generating a target sequence from a source sequence. Since machine translation is a familiar and easily understood field, we will primarily use machine translation in our explanations to avoid introducing too many concepts.

1.2 Text Generation Mechanism

Machine translation is essentially text generation. Language models treat text as a time series. From this perspective, each word is related to the words preceding it, and by learning the statistical patterns of the preceding word sequences, the next word can be predicted. Therefore, machine translation models language from a probabilistic perspective, aiming to maximize the probability that the newly predicted word combined with the previous words forms a complete sentence. This involves autoregressive models.

1.3 Autoregressive Model

Autoregressive models are generative models whose language modeling goal is to predict the next word based on a given context. Following the causal principle (the current word is only influenced by the words preceding it), the core idea of autoregressive models is to use the historical values of a variable to predict its future values. It models the generation of sequence data as a process of progressively predicting the conditional probability of each new element. At each time step, the model predicts the probability distribution of the current element based on the previously generated elements.

The image below shows an example of an autoregressive model. In each inference iteration, the model predicts only one token. The current output token is concatenated with a previous input token to serve as the input token for the next iteration. This process continues until a new end-of-sentence symbol is output or the sentence length reaches a preset maximum threshold. In the image below, the model execution sequence is as follows:

The input to the first round of the model is “You should”.
The first round of model inference outputs “wear”.
The predicted first word “wear” is combined with the original input and provided to the model, so that the second input to the model is “You should wear”.
The second model inference outputs “shoes”.
The predicted second word “shoes” is combined with the original input and fed to the model, so that the third input to the model is “You should wear shoes”.
The third inference output ends with a symbol. That concludes this prediction.

Each prediction step in this process depends on the result of the previous prediction, and starting from the second round, the inputs of the two rounds differ by only one token.

图片来源：论文 LLMCad: Fast and Scalable On-device Large Language Model Inference

The autoregressive model has several drawbacks:

Errors can easily accumulate, leading to poor training results because subsequent inferences depend on the outputs of previous inferences. In the early stages of training, the model is still immature, and a few random outputs can make it difficult for subsequent training to learn anything. “One wrong step leads to another,” making training extremely unstable, difficult to converge, and wasting training resources.
It can only be done serially, which means it is difficult to conduct training in a parallel manner and improve efficiency.

1.4 Autoregressive Model with Latent Variables

Latent variable models are models that introduce latent variables to represent past information. Autoregressive models, when making predictions, summarize and record information from past observations, denoted as $h_t$ , and then update the forecast $x_t$ . In other words, $h_t = g(h_{t-1}, x_{t-1})$ .

Then estimate $x_t$ based on $x_t = P(x_t \mid h_t)$ .

Because $h_t$ is never directly observed, $h_t$ is a latent variable, and such models are also known as latent autoregressive models.

After introducing $h_t$ , prediction transforms into two sub-problems:

One problem is how to use the previous latent variable $h_{t-1}$ and previous input information $x_{t-1}$ to obtain the current hidden variable $h_t$ . Another problem is how to adjust based on the current hidden variable $h_t$ and previous input $x_{t-1}$ to obtain the current $x_t$ .

In fact, this is the problem that encoder-decoder models have to solve.

1.5 Encoder-Decoder Model

Currently, most neural network models for sequence translation are encoder-decoder models. Traditional RNN architectures are only suitable for tasks where the input and output are of equal length. However, in most cases, the output and input of machine translation are not of equal length. Therefore, for sequences with variable length input and output, researchers decided to use a fixed-length state machine as a bridge between the input and output. Thus, a new architecture was used: the first part of the RNN only has input, and the second part only has output (the output of the previous round is used as the input of the next round to supplement information). The two parts exchange information through a hidden state. If we consider the hidden state as an encoding of the input information, the first part can be called the encoder, and the second part can be called the decoder. This architecture is therefore called the encoder-decoder architecture, and the model used is the encoder-decoder model, as shown in the figure below. In the figure, the encoder and decoder exchange information through an intermediate hidden state C.

The functions of the encoder and decoder are as follows:

The encoder compresses all semantic information of the input sentence into a fixed-length intermediate semantic vector (also called a context vector, latent vector, or latent state). This vector contains computationally and learnable feature information representing the linguistic characteristics and meaning of the sentence; it is a condensed summary of the input. The specific logic is as follows:
- The encoder processes the input sentence $X = (x_1, \ldots, x_n)$ .
- Each word is processed, and a hidden state is generated after each word is processed.
- Starting from the second word in the input, the encoder’s input at each time step is the hidden state of the previous time step and the new word in the input.
- The hidden state at the last time step of the encoder output is the semantic context that encodes the semantics of the entire sentence. This is a fixed-length, high-dimensional feature vector $C = (z_1, \ldots, z_n)$ . The information for each time step of the input sentence is contained within this context.
The decoder uses this intermediate semantic context vector to decode the output sentence . In other words, the decoder transforms the feature information learned by the encoder into corresponding sentences. The specific logic is as follows:
- At each time step, the decoder is autoregressive, meaning it iterates through the output of the previous time step (the generated character) $y_{t-1}$ as one input at the current moment $t$ , generating the character at the current time $y_t$ .
- The decoder’s initial input is the intermediate semantic context vector $C$ . Based on $C$ , the first output word and the new hidden state are calculated, meaning each prediction of the decoder is subtly influenced by the previous output word and hidden state.
- The decoder then uses the new hidden state and the first output word as joint input to compute the second output word, and so on, until the decoder produces an EOS (End Of Service) marker or reaches the boundary of the predetermined sequence length. From a macro perspective, the core of sequence modeling is to study how to compress the context of a long sequence into a smaller state.

1.6 How to compress

How to compress this? One might easily think of the Markov assumption, which states that the future state of a system depends only on its current state. This is also known as the recency effect: from a text generation perspective, the current word is only more relevant to the words closest to it. If we consider the preceding n words, we get the N-gram model, where the probability of the current word depends on the preceding n words. However, models based on the Markov assumption struggle to handle long-distance dependencies in sentences (where a word depends on an earlier word in the sentence) and do not consider deep semantic relationships. Furthermore, the size of an N-gram model is almost an exponential multiple of n; if n is too large, the possible combinations of n words become too numerous, consuming excessive resources. Therefore, a new model is needed. The new model should not only focus on word frequency and order but also consider longer-distance dependencies without explicitly considering so many word combination possibilities. Thus, the idea of using neural networks for fitting emerged.

MLP (Multi-Level Processing) is one of the most basic neural network models. It maps a sequence of word vectors to a fixed-length vector representation, then inputs this vector into a softmax layer to calculate the probability distribution of the next word. While MLPs theoretically don’t have distance-dependent problems, they are difficult to train well. CNN/RNN/Transformer network structures can be seen as adding constraints to MLPs. Through these prior constraints, the optimization difficulty with the same number of parameters is reduced, making it easier for the model to find the optimal solution. The previously mentioned “using neural networks to fit” refers to using CNNs, RNNs, or Transformers to implement the encoder and decoder.

Since the main focus of this series is the Transformer, it means that the Transformer has advantages in implementing encoders and decoders. Therefore, let’s first look at the problems with CNN and RNN solutions.

0x02 CNN and RNN solutions

Note: This section only explains the issues in a general sense or on typical problems, and is not a definitive conclusion. Because CNN and RNN solutions are constantly evolving, solutions at a certain stage may have solved (or alleviated) the problems mentioned below.

2.1 Technical Challenges

When faced with long and information-dense input sequences, the encoder-decoder model’s ability to maintain relevance throughout the decoding process may weaken. To better illustrate this, let’s first look at the main technical challenges faced by sequence transformation: alignment problems and long dependency problems (or forgetting problems).

Alignment

Let’s look at why alignment is necessary. First, in some fields (such as speech recognition), although the input and output are in the same order, there isn’t a one-to-one correspondence. Second, in some fields (such as machine translation), after processing the entire input sequence, the model’s output sequence may not be in the same order as the input sequence. Taking machine translation as an example, if we want the model to translate the English “how are you” into the Chinese “你好吗” or “Where are you” into “你哪儿”, we will find that the word order in the translation is not the same as the word order of the original sentence. Furthermore, some translations do not correspond one-to-one with the words in the English.

The biggest challenge posed by the misalignment problem is that at time t in the time series, we cannot be certain whether the model has obtained all the information needed to output the correct result. Therefore, people often encode all inputs into a hidden state first, and then decode this hidden state step by step. This ensures that the model has received all the necessary information during the decoding process. Although this approach guarantees the integrity of the input information, it has a significant drawback: the contribution cannot be determined during the decoding process. For example, when translating “I love you” into “我爱你” (I love you), “我” (I) should be aligned with “I” (because its contribution is the largest), but in this approach, the contributions of “I,” “love,” and “you” to “我” are all equal.

Long-term dependency

Let’s take the following sentence as an example for analysis. “The weak, lingering sound of cicadas in autumn is a specialty of the North, because trees grow everywhere in Beijing and the houses are low, so their singing can be heard no matter where you are.”

When translating example sentences from English to Chinese, there is a clear alignment between the English and Chinese, so it’s necessary to know which English word corresponds to which Chinese word. For example, translating “他们” as “They” is problematic. Does “他们” represent “trees”? “Houses”? Or “cicadas”? Based on the word “singing” and our prior knowledge, we know that “cicadas” and “他们” refer to the same object. However, if we change “听得见他们啼唱” to “看见他们树荫” (seeing their shade), then “他们” refers to “trees”.

Humans can easily see the words “cicada” and “they” simultaneously and associate them, understanding that there’s a long-range dependency between them, thus comprehending the entire sentence. However, for computers or models, the distance between “cicada” and “they” in a sentence is too great, making them easily distracted by other words in between. To accurately arrive at the final answer, the neural network needs to model the interaction between “the faint echo of the cicada” and “they.” However, some neural networks struggle with long-range dependencies because one of the key factors in handling such dependencies is the length of the path the signal traverses within the network. The shorter the path between two locations, the easier it is for the neural network to learn the long-range dependency; the greater the distance, the more difficult it is to model. If the model cannot handle long-range dependencies, long-term information loss occurs, resulting in the forgetting problem.

Next, we’ll examine how CNN and RNN solutions address these two technical challenges.

2.2 CNN Solution

The essence of CNNs is to learn local dependencies in spatial data. While CNN convolution operations can extract important features, because the length of a single convolutional kernel is generally small, the receptive field of CNN convolution is local, extracting local features and performing local information computation. In other words, CNNs are sensitive to relative positions but not to absolute positions, making it difficult to extract long-range dependencies in sequences.

To enable CNNs to process long sequences, researchers typically stack more convolutions, expanding the local receptive field by layering multiple convolutional regions. This allows the convolutional network to compensate for the lack of global information through depth, thereby capturing long-range dependencies. In this approach, different convolutional layers provide features at different levels, and long-range dependency calculations are then performed at higher layers. This compresses the information of long sequences into a single convolutional window, giving the model the opportunity to capture long-range dependencies and complex structural relationships.

For example, as shown in the diagram below, the bottom-level CNN uses sliding windows to process the text sequence, with each window processing the data. Window A1 captures the information “autumn cicada,” and window A3 captures the information “they.” However, because “autumn cicada” and “they” are too far apart, no single window can establish a dependency between these two words; that is, no single window can simultaneously see both words. Therefore, the model can only continuously stack convolutional networks, deepening the entire network, so that window C1 can simultaneously contain the information of both “autumn cicada” and “they.”

However, depth implies indirectness, and indirectness implies loss. In CNN schemes, because information is “processed and abstracted layer by layer,” and the information transmission process is not transparent enough, only a portion of the information is retained after passing through too deep a network, leading to a decrease in model performance. Therefore, CNNs are generally used less in scenarios involving long dependency modeling and are more suitable for short text computation.

2.3 RNN Scheme

Superficially, RNNs appear to be temporal structures, with later time steps naturally depending on the outputs of earlier time steps. Essentially, RNNs are connectionist models with the ability to selectively pass information between sequential steps, capturing contextual information and local dependencies between elements of different ranges. The unique feature of RNNs is the introduction of a “memory” function, allowing the network to remember previously inputted information. As data flows through the RNN, the memory from previous time steps serves as input to the processing of current data, enabling the model to dynamically integrate temporal context and historical information of the sequence. Because they can effectively handle variable-length sequential data, theoretically, RNNs can predict infinitely long sentences, utilizing all preceding information, making them ideal for translation scenarios.

Ideas

In fact, before the advent of Transformers, encoders and decoders were typically composed of RNNs or their variants (such as LSTM or GRU). Let’s first look at how to implement an encoder using an RNN. Taking the following diagram as an example, the encoder needs to encode the sentence “a specialty of the North” into a hidden state. Each square in the diagram is a simple RNN unit. Each RNN unit receives two inputs (input word and hidden state) and outputs a hidden state.

In the first step, the model receives input h0 and “北” (north), calls the function f() to calculate, and obtains the output h1 = f(h0, north). h0 is the first hidden state (usually containing the value 0 or a random value). In the second step, the model receives input h1 and “国” (country), again calls the function f() to calculate, and obtains h2. This process continues until the model finally outputs h5. At each step t in the calculation process, the information obtained from all previous nodes is stored in the intermediate hidden state ht calculated in the previous step. Therefore, the calculation of the next word will use the output results of all previous words.

The hidden state ht can be viewed as a carrier of information loops, carrying information that can be passed across time steps within the RNN. As data flows through the RNN, the activation states from previous time steps become inputs to the processing of the current data, allowing the model to dynamically integrate temporal context and historical information of the sequence. Therefore, theoretically, an RNN can obtain the dependency between any two words through the hidden state; regardless of how far apart the two words are, their information will converge at some point in the computation.

We then present the encoder-decoder architecture diagram, where $s_i$ is the decoder state at time $t_i$ . The encoder reads the input token $x_i$ and generates a hidden state $h_i$ for each token $x_i$ . From the first $h_1$ to the last $h_m$ , these hidden states continuously accumulate information from the previous ones. The final $h_m$ , carrying information about the entire input sequence, becomes the input $s_0$ of the decoder.

advantage

The advantages of RNNs are as follows:

Suitable for processing sequential data. RNNs are naturally well-suited for processing data with time series or sequence structures, such as text, speech, and video. RNNs can flexibly handle input sequences of different lengths and capture dependencies within the sequence.
Capturing long-term dependencies. The hidden state of an RNN at any step contains almost all the information from all previous time steps. Therefore, RNNs can capture long-term dependencies in a sequence, thus overcoming the main limitation of Markov models, which is crucial for many sequence learning tasks.
Weight sharing. RNNs employ a weight sharing strategy when processing sequences, meaning the same weights are used at different time steps. This reduces the number of model parameters and lowers the risk of overfitting.
It is fast. Each input depends only on the corresponding h, so the cost of inference for all tokens is basically the same. The overall inference speed is linearly related to the length of the context.

shortcoming

The drawbacks of RNNs are equally apparent. In RNN schemes, at each time step, the RNN compresses all previous information in the sequence into a fixed-length hidden vector. Ultimately, the encoder and decoder only exchange information through this fixed-length hidden state. The fixed-length hidden state, or limited memory capacity, leads to several problems when processing long sequences, such as information loss and information bottlenecks.

Lack of expressive ability

The characteristics of RNNs lead to a lack of expressive power, which is reflected in the following aspects:

Since the latent vector has a fixed length, this compression process is lossy compression, which means that the latent vector’s ability to preserve context is inherently limited. Taking text summarization as an example, the latent vector can store all the semantic information of a few hundred words of prose, but it will be insufficient for a novel of tens of thousands of words.
RNNs are partial order structures. Although the word order and grammar of a language itself constitute a partial order structure, there are often additional modifiers, complements, and various clauses, which means that the overall word order does not completely satisfy the partial order structure. Therefore, RNNs are not good at handling complex grammatical structures with long-distance associations.
During decoding, the hidden state at each event step is constructed based on the same hidden vector generated by the encoder, which is unreasonable because words at different positions may require different levels and aspects of information. Furthermore, weight sharing leads to assigning the same weight to every word in the input, making it impossible to distinguish the importance of words.

Information lost

Because RNNs lack expressive power, they can lead to information loss problems.

Information may be lost or confused. Furthermore, when new input arrives, existing information may be overwritten or diluted by the new information. This causes the model to focus more on inputs closer to the end of the sequence, while the memory of earlier parts of the sequence decays with increasing distance. The further back in the input, the more information decays; if crucial information appears at the beginning of the sequence, it is easily overlooked.
It is difficult to capture long-distance dependencies. Taking the above figure as an example, the most information contained in h4 is the current input “特”, while the information carried by the initial “北” will be ignored, making it difficult to effectively build the dependency relationship between the two.

Difficult to parallelize

RNNs require stepwise processing of the sequence content, with the output at each step depending on the previous hidden state and the current input. This sequential computation of RNNs is essentially recursive, which hinders parallel computation during training, resulting in low training efficiency and excessively long training times.

Difficult to train

The inherent structure of RNNs makes them difficult to train. RNNs rely on a single information transmission path, and this path involves multiple non-linear activation operations. When processing long sequences, the nested activation functions resulting from increased time steps lead to exponential decay (vanishing) or explosion (exploding) of gradients during backpropagation—the vanishing gradient problem and the exploding gradient problem. When gradients vanish, earlier gradients cannot be effectively passed to later ones, meaning the greater the distance between words, the weaker the influence of earlier words on later ones, making it difficult for RNNs to learn long-distance dependencies. When gradients explode, the network weights become extremely large, leading to network instability. Furthermore, when dealing with long sequences, RNNs require significant memory to maintain their hidden states—for example, needing to fully understand an entire sentence or even a whole article to make a judgment. This memory burden also poses a significant challenge to training.

2.4 Current Issues

We summarize the main problems of CNN and RNN as follows:

Alignment issues. Both CNNs and RNNs struggle to achieve perfect alignment between the source and target sequences.
The hidden state length is fixed. This problem mainly exists in RNNs because the size of their hidden vectors is fixed, so the inference performance is limited by the ability to compress information, leading to information loss.
The relational distance problem exists in both RNNs and CNNs. The relational distances between two words in a sequence vary, and when the distances between words are too long, both approaches struggle to determine the dependencies between them. This can weaken the model’s ability to maintain relevance throughout the decoding process when faced with long and information-dense input sequences.

Let’s take a closer look at the “relationship distance problem.” For CNN schemes, the first and last words of a sequence need to go through multiple convolutional layers to establish a relationship. The thick line in the figure below represents the longest distance required to establish a relationship between two words in a CNN structure.

The RNN approach requires examining the sequence “from beginning to end” to determine the distance needed to establish a relationship between the two words. The thick line in the diagram below represents the longest distance required to establish a relationship between two words in an RNN structure.

Therefore, the problem we face is: how to compress a large number of tokens into a hidden state that can effectively capture their underlying structure and relationships. To fundamentally solve this problem, we have the following steps to choose from:

The information content can be increased by extending the length of the hidden states or by adding a new influence factor (which reflects the influence of each position in the input sequence on the current output of the decoder).
This involves bringing two words in a sequence closer together, or establishing direct connections between words. For example, in an RNN, the hidden states need to be independent of temporal order, thus breaking the sequential structure.
Treat each word in the sequence equally, avoiding the tendency of RNNs to focus more on later content and ignore earlier inputs.
Highlighting key points in a sequence allows us to differentiate the amount of information carried by different elements and give them varying degrees of attention.
While most real-time causal data only knows the past state and is expected to influence future decisions, for certain functions (such as translation), we hope to be able to both predict forward and look backward simultaneously.

The attention mechanism we will introduce next can solve the above problems to some extent.

0x03 Attention Mechanism

The attention mechanism was proposed by Bengio’s team in their 2015 paper, ” NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE. ” Its main idea is to reduce algorithm complexity and improve performance by mimicking human behavior in observing things. Humans selectively focus on and process relevant information during perception, cognition, and behavioral decision-making, thereby improving cognitive efficiency and accuracy. For example, humans can choose to focus on certain information while ignoring or suppressing others based on their interests and needs, and can allocate attention across multiple tasks to process multiple pieces of information simultaneously. A familiar example is when someone looks at a picture; they first quickly scan it and then pay special attention to key areas. However, at each stage of text generation, not all fragments of the input context are equally important. For instance, in machine translation, the word “boy” in the sentence “A boy is eating the banana” does not require understanding the entire sentence’s context before accurate translation.

3.1 Principle

Attention mechanisms can be understood from different perspectives, and people have made many insightful observations, such as:

The essence of attention mechanisms is that context determines everything.
Attention mechanisms are a type of resource allocation scheme.
Attention mechanisms are information exchange, or “global information retrieval”.

These perspectives are interconnected yet each has its own unique characteristics, and we will analyze them one by one.

Context determines everything

The essence of attention mechanisms can be summarized in one sentence: context is everything. The meaning of a word in a text is usually related to its context. For example, both sentences below contain the word “transformer,” but the first “transformer” should be translated as “transformer transformer,” while the second “transformer” is best left untranslated.

Several distributor transformers had fallen from the poles, and secondary wires were down.
Transformer models have emerged as the most widely used architecture in applications such as natural language processing and image classification.

How can we semantically differentiate the polysemous word “transformer”? We must consider the context of the word to better identify its meaning; that is, we must consider not only the word itself but also the influence of other words, i.e., the influence of context. For example, the neighboring words “pole,” “fallen,” and “wires” in the first sentence suggest that “transformer” here is related to the real physical environment. In the second sentence, “model” and “natural language processing and image classification” directly tell us that “transformer” here is a concept related to deep learning. Ultimately, we can infer the precise meaning of “transformer” through the context, and thus translate these two English sentences as follows:

Several transformers fell off the utility poles, and the secondary lines also drooped down.
The Transformer model has become the most widely used architecture in applications such as natural language processing and image classification.

This is the function of the attention mechanism: to associate each word with other words in the sequence and to infer the semantics of the word we are interested in from the other words in the sentence.

Resource allocation

Attention mechanisms are also a form of resource allocation. We now understand the importance of context, but this is not enough, because the context of a word includes many other words, and different words often have varying influences on the target word. Taking translation as an example, because the input sentence is a coherent whole, each input word $X_i$ has an impact on each output word $Y_i$ . Therefore, when considering the context of words, it is also necessary to consider how much attention each element in the context should receive. For example, in the second English sentence, the influence of “model” on “Transformer” is undoubtedly the greatest. Therefore, we need a mechanism to focus on different information based on different contexts. This way, firstly, important elements in the sequence can be projected with higher attention, and important information will not be overwhelmed; secondly, limited computing resources can be used to process more important information, thus achieving a concise and comprehensive understanding.

The attention mechanism is this kind of resource allocation mechanism. During the learning process, it adaptively assigns different attention weights to different words in the input, thereby distinguishing the influence of different parts of the input on the output. Adaptive learning should focus on the key positions and make accurate judgments; that is, attention gives the model the ability to distinguish.

In fact, a passage in the paper “Recurrent Models of Visual Attention” profoundly illustrates this point regarding resource allocation. Specifically: A key characteristic of human perception is that we don’t process the entire scene at once. Instead, we selectively focus our attention on certain parts of the visual space to acquire information when and where needed, and combine information from different gaze points over time to build an internal representation of the scene, guiding future eye movements and decisions. Concentrating computational resources on different parts of the scene saves “bandwidth” because fewer “pixels” need to be processed. However, it also significantly reduces task complexity because the object of interest can be placed at the center of the gaze, while irrelevant features of the visual environment outside the gaze area (“clutter”) are naturally ignored. The original English text is as follows:

Information exchange

Once the principles of resource allocation are established, information exchange can begin. The attention mechanism’s computation process is essentially the exchange of information between elements in a sequence. The input to the attention mechanism is a sequence or set, from which the mechanism selectively extracts information and calculates a set of weights. These weights represent the importance of each piece of information, and multiplying these weights by the original information yields the weighted information after attention processing.

Information exchange, to some extent, functions as the memory module in an RNN, enabling the attention encoder to understand and parse complex statements or scenarios, much like an RNN.

Applying attention mechanisms between the source and target sequences in sequence transformation allows the two sequences to exchange information, achieving information alignment.
Applying attention mechanisms within a sequence (often referred to as self-attention) allows each word in the sequence to be associated with other words. This gives each element in the sequence the opportunity to selectively absorb information from every other element based on its own features and the relevance between words, dynamically adjusting itself. This enables the model to capture long-distance dependencies, unaffected by distance. Consider the following two sentences: in the first sentence, “it” refers to a cat, so “it” absorbs more information about “cat.” In the second sentence, “it” refers to milk, so “it” absorbs more information about “milk.”
- The cat drank the milk because it was hungry.
- The cat drank the milk because it was sweet.
The purpose of self-attention is to create an abstract and rich representation for the current word. This representation is the result of the word being influenced by other words in the same sequence. After processing by self-attention, each new word now incorporates partial information from other words; this is a data-dependent weighted average, resulting in a richer representation. If we must compare self-attention with previous attention mechanisms:
- In a self-attention mechanism, the query is equivalent to the hidden state of the decoder in an attention mechanism.
- In a self-attention mechanism, the key and value are equivalent to the encoder’s hidden state in an attention mechanism.

3.2 General Structure

Based on the above analysis, we can see that the core idea of attention is to help the model assign different weights to different parts of the input. This allows for the extraction of key information, making the model’s judgments more accurate and saving computational and storage resources. However, how to implement the attention mechanism? This requires solving two problems:

Where is attention calculation performed?
How to perform attention calculation?

Let’s continue learning.

Task Model

The paper “A General Survey on Attention Mechanisms in Deep Learning” summarizes the general structure of attention models in the following diagram. The authors call this general architecture a task model. The task model consists of four parts:

Feature Model. We assume that matrix X is the input to the task model, and the columns of the matrix may be words from a sentence. The task model uses a feature model to transform X into a feature vector F. The feature model may be an RNN, CNN, embedding layer, or other model.
The query model. q is the query vector generated by the query model, used to determine which vectors in X the task model needs to focus on and what information to extract. In other words, q can be interpreted as a general question: Which feature vector contains the most important information for q?
Attention model. The feature vector F and query vector q are the inputs to the attention model, and the output of the attention model is the context vector c. We will discuss the attention model in detail later.
Output model. The output model uses the context vector $c$ to combine the various parts into the final high-level feature vector $\hat{y}$ . For example, the output model could be a softmax layer, which outputs a prediction.

Let’s use the sentence “Specialties of the North” as an example to explain the diagram below. In the diagram, X represents the complete sentence “Specialties of the North,” F is the list of feature vectors extracted from “Specialties of the North,” Z represents “Specialty,” and q is the feature vector extracted from “Specialty.” Our goal is to determine which feature vector in the feature vector list contains the most important information about “Specialty.”

attention model

The internal process of the attention model is shown in the figure below. The goal of this model is to generate a weighted average of the vectors in V. The specific calculation process is as follows.

Label 1 is the input (two inputs). The feature vector F generated from the input will further generate the key matrix K and the value matrix V.
Label 2 uses matrix $K$ and query vector $q$ as input, and calculates the attention score vector $e$ using a similarity calculation function. $q$ represents the request for information, and $e_l$ represents the importance of the $l$ -th column to $q$ .
Label 3 further processes the attention score through an alignment layer (such as a softmax function) to obtain the attention weight ‘a’.
Label 4 uses the attention weights a and matrix V to calculate and obtain the context vector c.

QKV

Because the terms Q, K, and V are mentioned here, we’ll start with text translation as an example for a preliminary introduction. Subsequent chapters will analyze Q, K, and V in more depth. As the analysis above shows, each word in a sequence needs to understand the information of other words in the sequence to determine their relationships. Therefore, each word asks other words: “Are we closely related?” The other words reply with their relationship status. After obtaining the relationships, each word incorporates the information from other words for information fusion. This operation is essentially a search and merging operation, and we need to find a suitable mechanism to implement this operation. In the attention model shown above, there are two inputs: q (the sequence being processed/target sequence) and F (the sequence being focused on/source sequence). F is further converted into K and V. Using these three variables together can meet our needs.

Q (Query Matrix): Each element of the target sequence summarizes its own features into a vector called “query,” which can be understood as a word sending a query to other words. The queries of all elements in the target sequence constitute the query matrix Q.
K (Key Matrix): Each element of the source sequence summarizes its features into a vector key. This can be understood as the features of a particular word, or how a word answers questions posed by other words based on its own features. The keys of all elements in the target sequence constitute the key matrix K.
V (Value Matrix): The actual value (the information ultimately provided) of each word in the source sequence is a vector value. The values of all elements in the source sequence constitute the value matrix V.

We will then use query, key, and value to represent the relevant vectors, and Q, K, and V to represent the matrix formed by the relevant vectors.

A dictionary-based approach might aid understanding. The query is the content you’re looking for, the key is the dictionary index (what information the dictionary contains), and the value is the corresponding information. A typical dictionary lookup is an exact match, returning the value based on the matching key. The attention mechanism, however, combines vectorization, fuzzy matching, and information merging. It not only finds the best match but also performs a weighted sum based on the degree of matching. Each element in the source sequence is transformed into a <key, value> pair, forming the dictionary of the source sequence. Each element in the target sequence represents the query, the content to be retrieved. During the search, each element in the target sequence uses its query to calculate an alignment coefficient with the key of each element in the target sequence. This alignment coefficient represents the similarity or relevance between the elements. The more similar the query and key are, the greater the influence of the value on the query, and the more information the query needs to absorb from the value. Subsequently, the query determines how much information to extract from the value and incorporate it into itself based on the close relationship between the two words.

Through the interaction of the query, key, and value vectors, the model measures the attention each word receives from other words. Ultimately, each element of the source sequence places its actual data, obtained by integrating information from other words, into a vector.

We assume the source and target sequences are the same sequence. The image below shows the similarity between “one” and other words in the sequence. The dashed lines represent the relevance between the key and the query, and the distribution of line thickness is called the “attention distribution.” In other words, the line thickness represents the weight; the thicker the line, the more relevant the key is to the query, the more important it is for understanding the query, and the greater the weight of the value.

3.3 Calculation Process

Let’s introduce the attention mechanism into the seq2seq domain to take a closer look at its computational process.

Ideas

Let’s first look at the overall approach. A self-attention layer is a computational unit equivalent to a recurrent layer and a convolutional layer. Their purpose is to map one vector sequence to another; for example, an encoder maps x to an intermediate representation z. Recall the translation scenario. With an RNN approach, the encoder generates a latent vector, which is then passed to the decoder for decoding. We’ve already analyzed the drawbacks of this approach, such as the latent vector being fixed. To overcome this drawback, we should generate a latent vector at each time step $t$ , denote it as $h_t$ , and save these $h_t$ values. When a new output is generated, we let the model review all the previously saved hidden states, and use the key information found in the hidden states. This avoids the drawback of fixed-length hidden vectors in RNNs. But how do we determine whether a certain hidden state is important to the currently generated word? This requires the model to learn through some mechanism to understand how much attention to give to this hidden state. In short, the task of the attention mechanism should be to find the relationship between the current hidden vector of the decoder and all the hidden vectors of the encoder. Following the above idea, the computation of the attention mechanism can be divided into two steps:

The attention distribution is computed over all input information. The encoder passes not just the last hidden state, but all hidden states to the decoder.
The input information is calculated using a weighted average based on the attention distribution. It’s important to note that this is a data-dependent weighted average, a flexible and efficient global pooling operation.

Specifically, we can break these two steps down into five detailed steps, as follows:

Generate semantic vectors. The source sequence is input into the encoder sequentially, and the encoder executes the code in sequence, compiling the information of the source sequence <X1,X2,X3,X4> into step_len semantic vectors <C1,C2,C3,C4> for use by the subsequent decoder. That is, for each word in the input sequence, the encoder outputs a hidden state vector, representing the word and its context information.
Calculate the alignment coefficient ‘a’. For each word Yi output by the decoder, we need to consider the relevance between all words in the source sequence and the current word in the target sequence. Therefore, before the decoder outputs each prediction, we calculate attention scores for all semantic vectors <C1,C2,C3,C4> output by the encoder. We can consider the source sequence as the key and the target sequence as the query, which corresponds to the “internal flow of the attention model” diagram above.
Calculate the probability distribution. Summarize the alignment coefficients and normalize them using softmax to obtain the attention weights w. The purpose of applying softmax to these scores is to amplify high-scoring hidden states and suppress low-scoring hidden states. We will refer to the alignment coefficients before softmax normalization as attention scores, and the result after softmax normalization as attention weights.
Calculate the current context vector Context. Using attention weights w as weights, sum all the encoder vectors <C1,C2,C3,C4> using weights to obtain the decoder’s current context semantic vector Context. The attention weights represent the importance of each input word to the current output word. The context vector represents the source language information required for the current output word.
Update the decoder state to the hidden state Hi.
Calculate the predicted output word. Taking the decoder’s previous output, current state, and current context semantic vector (Context) as input, the function f() is called to calculate the decoder’s current output. This output is a probability distribution representing the probability of each possible target language word being the current output word. Then, a mapping is made between these probability values and the target vocabulary (or, if the attention mechanism is used in a classification model, a mapping is made to each category), thus obtaining the next output word. The above steps correspond to numbers 1 to 6 in the diagram below.

Taking “I ate an apple” as an example, the context of each output is as follows:

Input “I”, and you get C1 = (I * 1).
Input “ate”, and you get C2=(I * 0.5, ate * 0.5).
Input “one”, and you get C3=(I * 0.45, ate * 0.45, one * 0.1).
Input “apple” and get C4=(I * 0.3, ate * 0.3, one * 0.1, apple * 0.3).

The attention weights obtained are

$(w_{3,1}, w_{3,2}, w_{3,3}, w_{3,4}) = \operatorname{softmax}(g(Y_2, C_1, C_2, C_3, C_4))$

Next, we will analyze a few key concepts.

Attention Score

When a model needs to decide how much “attention” to give to a word in a sequence, it calculates the attention score between that word and other words. The attention score is a metric that measures the importance of different words in the sequence to the current word, or the likelihood that the target word is aligned with a word in the input. A higher likelihood should be given greater weight. A large weight means that when generating the output, the currently predicted word should pay more attention to information about its corresponding word in the source text.

Attention scores are obtained through a similarity calculation function. This function typically accepts a key and a query vector as input and outputs the relevance between the key and the query vector, i.e., the attention score. The following figure provides an overview of these functions, where $q$ is a query vector, $k_l$ is the $l$ -th key vector, and $K_l$ denotes the corresponding element in matrix $K$ . In seq2seq, $q$ can be considered the latent vector output by the decoder, and $k$ can be considered the latent vector inside the encoder. In reality, the similarity calculation function is calculated in matrix form, rather than calculating a single column.

Attention weight

After obtaining the attention score, the model uses a softmax operation to normalize the attention score to obtain the attention weight. This ensures that the sum of all weights is 1, keeping the total amount of features contributed by all source elements constant, and also highlights the important weights.

Weighted summation

After obtaining the attention weights, each query can extract relevant information from its corresponding key. At this point, the output function is needed to combine these parts into a final high-level feature vector for output. In this stage, the attention mechanism processes the data using a weighted summation method. This means that each word in its new representation not only contains its own information but also information from other words, helping the model capture dependencies in the input sequence.

$w_i = \frac{f_{\text{score}}(K_i, Q)}{\sum f_{\text{score}}(K_i, Q)}, \quad \text{Out} = \sum f_{\text{out}}(w_i \cdot V_i)$

summary

We summarize the three key steps in the calculation process as follows:

Calculate the score (score function). The similarity score is calculated between the query and all keys to obtain the attention score. The formula is as follows: $s_i = a(q, k_i)$ .
Normalization (alignment function). The softmax operation is used to normalize the weights; the calculation formula is as follows: $a = \operatorname{softmax}(s_i)$ .
The generated result (context vector function) uses ‘a’ to calculate a weighted average of the values. This can be understood as the output ‘y’ being an interpolation of values based on key-query similarity. The calculation formula is as follows: $\text{Attention Value} = \sum_i a_i v_i$ .

Therefore, the idea of attention can be rewritten as the following formula: calculate the weights by calculating similarity and then sum them up by weight.

$\text{Attention}(\text{Target}, \text{Source}) = \text{Attention}(\text{Query}, \text{Source}) = \sum_{i=1}^{\text{Length}_{\text{Source}}} \text{Similarity}(\text{Query}, \text{Key}_i) \cdot \text{Value}_i$

The specific details are shown in the image below.

3.4 Problem Solving

Let’s recall the problems encountered by the RNN and CNN solutions mentioned earlier: alignment issues, information loss, and long dependencies. And to fundamentally solve these problems, we propose several improvement strategies:

Increase the information content by extending the length of the hidden state or adding new influence factors.
This can bring two words in a sequence closer together, or establish direct connections between the words.
Treat the order of each word in the sequence equally, and avoid ignoring the earlier inputs.
The sequence is highlighted to distinguish the amount of information carried by different elements.
Simultaneously predicting forward and reviewing backward.

Next, we will take the self-attention mechanism as an example to see how it can solve the problems mentioned earlier by leveraging its advantages, and also conduct a comparative analysis with the two solutions.

Increase information content

Compared to RNN and CNN approaches, self-attention mechanisms can increase information content, thus effectively solving the information loss problem in RNNs. To some extent, all sequence modeling involves the following operation:

Store the historical context in a hidden state.
The hidden state is transformed according to certain update rules.

The diagram below illustrates the characteristics and typical examples of sequence modeling. These typical examples all include three components: initial state, update rules, and output rules.

In RNN schemes, the decoder compresses all past contextual information into a fixed-size low-dimensional vector (hidden state). This hidden state is used at different stages of the decoder. The advantage of this scheme is its linear (relative to quadratic) complexity in long contexts. However, in long contexts, RNNs are limited by the fixed-size hidden state (limited expressive power) and find it difficult to utilize additional conditional information.

Test-Time Training (TTT) compresses the context into the model’s weights, which has the advantage of maintaining a fixed size over time while greatly enhancing expressive power.

The self-attention mechanism uses a list (which we will learn from later articles is actually a key-value cache) as its hidden state. All contexts are stored in the list and are not compressed. All the contexts in the list together constitute a unified hidden state (the hidden state of each stage is an item in the list), allowing the encoder to pass more data to the decoder.

Therefore, we can see the advantage of the self-attention mechanism (Transformer): instead of just passing the encoder’s final hidden state like an RNN, it passes all hidden states (corresponding to all processed tokens) to the decoder. This allows new tokens to interact with all past contexts. Of course, as the context length increases, the cost of using a list also increases dramatically; its processing time increases sharply with the context length, and the list’s memory usage also increases dramatically.

Reduce word spacing

Let’s use the previous diagrams for further analysis.

In self-attention mechanisms, when a word acquires information from other words, the positional distance between that word and the other words remains constant. Thus, the closeness between two words depends only on their actual relevance, not on their distance. In other words, it is insensitive to absolute position; any two words can be directly modeled. This characteristic can solve long-distance dependency problems. Let’s analyze this in detail.

First, while CNNs can expand their field of view and fuse information by increasing the number of convolutional layers, information is easily lost during propagation in excessively deep networks, leading to a decline in model performance. Self-attention mechanisms abandon the locality assumption of CNNs, adjusting the perceptual field to the entire sequence. RNNs, unable to compensate for long-distance dependencies, suffer from vanishing and exploding gradients. When processing any word, self-attention can attend to all words in the entire sentence, thus reducing the distance between any two positions in the sequence to a constant. It can effectively capture dependencies between them in constant time, directly establishing connections and eliminating the concept of distance. Because of this characteristic, self-attention mechanisms suffer from minimal information propagation loss. The following figure compares the distances required for the three methods to construct relationships between words.

Furthermore, the constant distance characteristic means that the self-attention mechanism does not have an “order assumption.” Unlike RNNs, which tend to focus on later content and ignore earlier input, the self-attention mechanism treats the order of each word in the sequence equally when processing each position. This effectively considers other positions in the input sequence and better integrates the “understanding” of other words into the currently processed word, resulting in high information fusion efficiency.

Finally, the constant distance characteristic removes the recursive limitation from the self-attention mechanism, allowing it to be parallelized within each layer, just like CNN.

Selective processing of information

Attention mechanisms can selectively focus on and process relevant information, which can both provide a clear overview and effectively solve alignment problems.

Weighted summation

Weighted summation can be viewed from two aspects: weighting and summation. The former treats data differently, while the latter performs data fusion. Combined, it involves processing data in a concise and focused manner (giving different levels of attention to elements with varying amounts of information). We will then compare and analyze weighted summation with CNNs and fully connected layers from different perspectives.

First, the attention mechanism dynamically generates weights. The weights in CNNs or fully connected layers are static weights; these weights are fixed during training and used for inference, without establishing a relationship between the weights and the input entity itself. The attention mechanism, on the other hand, uses dynamic weights, calculating attention weights based on the similarity between the input query and key. The attention mechanism calculates which inputs the output should focus on based on the input, and the weights change with the input—this is an adaptive operation.

Secondly, from another perspective, attention mechanisms consider the context and its relationships from the perspective of a particular input object, using these relationships as weights to determine how to extract information from other objects. Therefore, attention mechanisms are a form of relative relationship modeling. Furthermore, such operations often employ global attention to model a broader range of dependencies, making it a global operation.

In summary, weighted summation, with its dynamic, relative, and global approach, better facilitates the processing of sequence information.

Alignment mechanism

Alignment mechanisms involve applying attention to the source and target sequences in sequence transformation. In machine translation, every word in the source sequence can influence the output, not only the input before the word to be translated but also the input after it. Therefore, it is necessary to consider both the source and target sequences comprehensively. For example, in the Chinese-to-English translation example below, when translating “一个” (a/one), we cannot determine whether the English should be translated as “a” or “an” until we see “苹果” (apple). Therefore, we need to align the “a or an” operation with “苹果” during translation.

I ate an apple, and then I ate a banana.

English: I ate an apple and then a banana.

Attention allows for relevance calculations based on different parts of the input and output sequences. This determines which parts of the input data the model will focus on when outputting a word. This aligns the input and output sequences, enabling a more accurate learning of relationships between sequences and improving the model’s generalization ability and performance. In the diagram below, the dashed lines represent the relevance between the key and the query; thicker lines indicate greater relevance. We can see that words like “a” and “apple” are more important (more relevant or similar) to the encoding of “an,” and should therefore bear more weight in predicting “an.” Thus, when the model generates “an,” it needs to extract semantics not only from “a” but also from “apple” to determine whether it should be “an” or “a.” Ultimately, “a” and “apple” together determine that the corresponding English word should be “an.”

Simultaneously looking forward and looking back

The encoder can read the input sequence from left to right and from right to left simultaneously, and concatenate the hidden states at each time step as the output. The advantage of this is that it allows the encoder to consider the contextual information surrounding each word in the input sequence, thus generating a richer and more complete representation.

3.5 Advantages and Disadvantages

In summary, the attention mechanism ensures that each decoding step is provided with information from the most relevant context fragments, offering a robust solution to the long-standing problem of long-distance dependencies and thus redefining the landscape of sequence modeling. Moreover, the introduction of the attention mechanism is not only a technological breakthrough but also opens up new horizons for the field of machine learning, reflecting the profound influence of human cognition. As Andrej Karpathy commented, “Attention is a major unlocking, a leap forward in neural network architecture design.”

Attention mechanisms are not perfect and have several drawbacks, the most significant being slow computation speed and high storage consumption.

The computational requirements are high. Attention mechanisms need to calculate the relationship between each element in the sequence and every other element, so the computational cost increases quadratically with the length of the input sequence. This is especially problematic when dealing with long sequences, limiting the maximum sequence length N of large language models. This is why, in the early stages of development, large models often only supported 2K or 4K token inputs. In contrast, RNNs only need to consider the previous hidden states and the current input. High memory consumption: Similar to computational complexity, attention mechanisms require storing a large number of intermediate results when processing long sequences, which places high demands on memory.

0x04 The History of Attention Development

Attention mechanisms are just an idea, applicable to many tasks. Let’s look at some classic examples of using attention mechanisms to improve the Encoder-Decoder model, and also review some key historical milestones. In Fan Wei’s words: we need to know how the Transformer came about. Of course, we also hope to see how the Transformer disappears, because if a new solution replaces the Transformer, it signifies a new historical breakthrough in the field of AI.

Let’s start by showing a diagram illustrating the history of attention development. As you can see, the Transformer is a culmination of many advancements built upon the shoulders of giants.

This section references “Learning Big Models Through Pictures: The Past and Present of Transformers (Part 1)“.

4.1 RCTM

The paper “Recurrent Continuous Translation Models” is considered a seminal work in neural machine translation (NMT), and its characteristics are as follows:

We use an encoder-decoder framework. The encoder is implemented using a CNN, and the decoder is implemented using an RNN.
The encoder converts the source text into a continuous context vector, and the decoder converts the context vector into the target language.
Each step in the decoding process uses the context vector generated by the encoder.
End-to-end neural network model.

4.2 RNN Encoder-Decoder

The paper “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation” is a result of Bengio’s team, and its characteristics are as follows:

The concepts of encoder and decoder are clearly defined, and both are implemented using RNN.
Each step of the decoding process uses the latent vectors generated by the encoder.
Instead of making the encoder-decoder an end-to-end model, the output probability of the decoder was simply fed as a feature into the statistical machine translation (SMT) model.

4.3 Sequence to Sequence Learning with Neural Networks

The paper “Sequence to Sequence Learning with Neural Networks” also played an important role, and its characteristics are as follows:

Use an end-to-end architecture.
Implement encoders and decoders using RNNs.
Use the context generated by the encoder as the initialization parameter of the decoder.

4.4 Bahdanau Attention

The Bahdanau Attention technique, proposed in the paper “Neural Machine Translation by Learning to Jointly Align and Translate,” is one of the pioneers of the Attention mechanism. Its characteristics are as follows:

End-to-end model.
An attention mechanism was added to the encoder-decoder architecture, which led to the introduction of the concept of attention.
The context vector generated by the encoder is no longer used as input to the decoder at each step.
Because the decoder is still an RNN, the parallelism problem is not solved.

The authors used an attention mechanism in their paper to address the problems of how to model “distance-independent dependencies” and how to more effectively pass the information generated by the encoder to the decoder. In this way, the attention mechanism, together with RNNs, solves the forgetting and alignment problems.

Alignment issues. The authors believe that during translation, words or paragraphs with the same meaning in the input and output languages should be aligned. However, this is not currently achieved. Therefore, the authors used a network to calculate the association score between different words in the two languages, i.e., the attention score.
The forgetting problem. The authors use an attention mechanism to break the limitation of traditional RNN schemes that rely on fixed-length context information.
- The authors argue that the fixed length of contextual information limits model performance and is the bottleneck, requiring the encoder to include more effective contextual information. Furthermore, the decoder previously used the same contextual information generated by the encoder at each time step, i.e., the hidden layer output of the last time step. The authors consider this unreasonable because the semantics of a word typically depends more on the meaning of words near that word in the original sentence than on the output of the last time step. For the decoder, the contextual vector at each time step should be different.
- To address the aforementioned issues, the authors constructed an attention mechanism that utilizes all the hidden states of the encoder. At each time step, the encoder considers the entire encoded sequence, and the decoder determines which part of the encoded sequence is most important, assigning it higher weight. Then, a context vector is calculated based on the entire encoded sequence.

Andrej Karpathy shared private emails from Dzmitry Bahdanau, from which we can see the development process of Dzmitry Bahdanau’s ideas in order to overcome the bottleneck between the encoder and decoder.

The initial design was inspired by the concept of “two cursors,” which uses dynamic programming to move two cursors between the source and target sequences, respectively. However, this approach is too complex and difficult to implement.
Dzmitry Bahdanau opted for a less desirable approach, attempting a “hard-coded diagonal attention,” which, while yielding acceptable results, was still clumsy.
The real breakthrough came in a flash of inspiration from Dzmitry Bahdanau: why not let the decoder autonomously learn to focus on relevant parts of the source sequence? This idea stemmed from English translation exercises. During translation, the eyes (attention) repeatedly move between the source and target sentences. Dzmitry Bahdanau designed this soft search as a softmax operation and combined it with a weighted average of the states of a bidirectional RNN.

It’s clear that groundbreaking ideas often come from innovators who seek to solve problems in practice, rather than from idealistic theorists. Dzmitry Bahdanau put it very well in his email:

My AI ambition is to launch more amazing application projects like the machine translation project. Excellent R&D contributes far more to the advancement of fundamental technologies than the complex theories we often consider “real” AI research.

4.5 Luong Attention

The paper “Effective Approaches to Attention-based Neural Machine Translation” explores the diversity of computational methods for attention mechanisms based on Bahdanau Attention.

First, based on the size of the computation region, alignment functions are divided into two mechanisms: global attention and local attention. In the global attention mode, the hidden state is calculated based on all words at each step of the encoder. In the local attention mode, only the hidden state of a subset of words is calculated.
Secondly, based on the information used, alignment functions are divided into content-based alignment and position-based alignment. The former considers the encoder hidden state $\tilde{h}_s$ and the decoder hidden state at the current step $h_t$ . The latter only considers the encoder hidden state at the current step $h_t$ .

The image below illustrates local attention.

The image below illustrates global attention.

The specific differences between Luong Attention and Bahdanau Attention are as follows:

There are differences in how semantic vectors are generated.
- Luong Attention uses the current decoder’s hidden state to compute the alignment vector, while Bahdanau Attention uses the previous hidden state.
- Luong uses multiple alignment functions, while Bahdanau uses an additive approach.
There are also differences when passing context vectors for prediction.
- The basic modules are different. When calculating the hidden state, Luong Attention uses LSTM, while Bahdanau Attention uses a bidirectional RNN unit.
- The decoder computes the hidden state differently. The Luong model uses a separate hidden state $\tilde{s}_t$ to calculate output $y_t$ .
- The decoder’s input and output are different:
  - In Bahdanau Attention, the input is $c_t$ and $h_{t-1}$ splicing, and then calculating $h_t$ . Final output $y_t$ .
  - The input to Luong Attention is $c_t$ and $h_t$ . The network is spliced together, and then calculated using an additional network structure $\tilde{h}_t$ . Final output $y_t$ .

4.6 ResNet

ResNet is a classic work by Kaiming He, which effectively solves the vanishing/exploding gradient problem, breaking through the bottleneck of the number of layers in neural networks. A detailed explanation will follow in later articles, so it will not be repeated here.

4.7 Self Attention

Previous attention mechanisms focused on the attention between different sequences, i.e., cross-attention. The paper * Long Short-Term Memory-Networks for Machine Reading* proposes self-attention, or intra-attention, building upon cross-attention. The authors argue that when reading each word sequentially, the relevance of each word to the preceding words is different, and therefore use an attention mechanism to perform this analysis.

The motivation for this paper is as follows. Traditional LSTM treats sentences as sequences of words when processing input and recursively combines each word with previous memories until a semantic representation of the entire sentence is obtained. This approach faces two problems:

During recursion, memory compression is a problem; which parts can be remembered during compression is unknown. The ability of LSTM to remember sequences under cyclic compression is questionable. LSTM assumes that the current state can summarize all tokens seen by the LSTM. Traditional LSTM has only two outputs: a memory cell and a hidden state. At time t, these two outputs represent all memory and hidden states before time t, respectively. Under the Markov condition, this relationship can represent the entire sequence without limitation, but this is not feasible in practice; that is, when the sequence is long enough, the LSTM assumption cannot hold.
There are problems with input structuring. LSTM integrates information one token at a time in sequence, but it lacks an explicit mechanism for reasoning from the structure, making it difficult to model the relationships between tokens.

This paper addresses these two issues.

To address the memory compression problem, the paper proposes a memory/hidden tape approach, where each token corresponds to a hidden state vector and a memory vector.
For structured input problems, the paper introduces intra-attention, which uses the hidden tape to calculate the correlation between a word and its preceding words. Essentially, it uses the hidden state vector and memory vector of each word for attention calculation.

The specific approach is shown in the diagram below.

This paper proposes two methods for modeling the alignment (semantic alignment) of two sequences.

Shallow fusion model: Replace LSTM in the traditional encoder-decoder with LSTMN and align it using inter-attention.
Deep fusion model: Combines inter-attention and intra-attention to update the state, and deeply fuses the relationships within the sequence and the relationships between sequences.

4.8 QKV-Attention

The paper “QKV-Attention: FRUSTRATINGLY SHORT ATTENTION SPANS IN NEURAL LANGUAGE MODELING” is probably the earliest paper to propose the concept of QKV.

The authors argue that current attention mechanisms rely too heavily on latent vectors, leading to overload and impacting model performance. This is because the meaning of the same word differs between its contextual meaning and its meaning as a query term. Therefore, this paper splits the output vector generated at each time step into three vectors: key, value, and predict, each with a distinct function. The following outlines the modification logic and process.

Attention for Neural Language Modeling

The original model structure is shown in the figure below. The neural network language model has only one output vector, which is used to calculate the attention vector, encode historical context information, and predict the distribution of the next word.

Key-Value Attention

The authors of the paper divided the output of the original model into two parts: key and value. The key is used to calculate the attention vector, and the value is used to encode the distribution and contextual information of the next word.

However, the value is still used to encode both contextual information and the distribution representation of the next word (responsible for both the query results and the aggregation results), which is prone to errors due to multitasking. Therefore, the authors made further improvements.

Key-Value-Predict Attention

The new model structure is shown in the figure below. The authors further divide the output of the original model into three parts: key, value, and predict. The key is used to calculate the attention vector, the value is used to encode contextual information, and the predict is used to encode the distribution of the next word.

N-gram Recurrent Neural Network

The authors further improved the model by dividing the output of the original model into N-1 parts, using a portion of the output vector from the first N-1 time steps to calculate the distribution representation of the next word. The model structure is shown in the figure below.

4.9 MultiHead Self Attention

The paper ” Self-Attention & MultiHead Attention: A STRUCTURED SELF-ATTENTIVE SENTENCE EMBEDDING” applies attention mechanisms to text representation learning, allowing these mechanisms to perform multiple tasks. The attention calculation process in the paper is as follows:

The input is a sentence containing n tokens.
Using bidirectional LSTM to process sentences, we obtain some dependencies of adjacent words in a single sentence, i.e., latent vectors.
Concatenate the latent vectors into H.
The hidden state H is transformed using a self-attention mechanism to obtain the vector weight a.
The hidden states $H$ of the LSTM are summed using the weights provided by $a$ to obtain a vector representation of the input sentence $m \in \mathbb{R}^{1 \times 2u}$ . Multiple $m$ vectors focus on different parts of the sentence, thereby representing the overall semantics of the sentence.
Multi-head self-attention computation is performed. The sentence embedding is obtained by multiplying matrix A and the LSTM hidden state H. The matrix multiplication AH here is somewhat similar to the attention computation method of Transformer.

4.10 Multi-step Attention

The paper [Convolutional Sequence to Sequence Learning] uses a CNN model combined with an attention mechanism to solve the problem that RNNs cannot be parallelized.

Compared to RNNs, CNNs have certain advantages. Firstly, CNNs can process data in parallel, resulting in faster training speeds. Secondly, RNNs are not well-suited for handling structured information within sentences. Therefore, the authors used CNNs for both the encoder and decoder, leveraging hierarchical structures to capture long-range dependencies between words, and also to better capture more complex relationships. During decoding, each convolutional layer performs an attention operation, known as multi-step attention.

4.11 Summary

As we can see from the above evolution, the components included in or required by the Transformer have been implemented step by step. However, the presence of RNNs and CNNs in the attention schemes mentioned above is a hindrance. For example, RNNs cannot be trained in parallel, which is not conducive to large-scale rapid training and deployment, nor is it conducive to the development of the entire algorithm field.

Therefore, the authors of Transformer completely abandoned RNNs and CNNs, constructing a completely new sequence transformation architecture. The entire Transformer network structure is composed entirely of attention mechanisms. By directly comparing each pair of sequence elements, Transformer can learn the relevance of all words in the input sequence, capturing global connections in one step. Furthermore, because Transformer does not analyze sequentially, it can operate in parallel, making it more computationally efficient and scalable than RNNs. Ultimately, Transformer is a comprehensive and integrated solution.

Of course, we must also recognize that RNNs and CNNs have never given up, and their respective developments have been remarkable. RNNs, in particular, are playing a significant role in reinforcement learning. We also expect that more innovative models and methods will emerge in the future, allowing Transformers to play an even greater role in reinforcement learning.

In the next article, we will introduce the overall architecture of Transformer.

0xFF Reference

A General Survey on Attention Mechanisms in Deep Learning Gianni Brauwers and Flavius Frasincar
Andrej Karpathy First public private email: Revealing Transformer The truth about attention mechanism AI Cambrian
Attention is All your Need
Bahdanau and Luong Attention An intuitive introduction to Honoria
[Convolutional Sequence to Sequence Learning](Convolutional Sequence to Sequence Learning)
Effective Approaches to Attention-based Neural Machine Translation
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
Long Short-Term Memory-Networks for Machine Reading
NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE
NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE
QKV-Attention: FRUSTRATINGLY SHORT ATTENTION SPANS IN NEURAL LANGUAGE MODELING
Recurrent Continuous Translation Models
Recurrent Continuous Translation Models
Self Attention & MultiHead Attention: A STRUCTURED SELF-ATTENTIVE SENTENCE EMBEDDING
Self Attention & MultiHead Attention: A STRUCTURED SELF-ATTENTIVE SENTENCE EMBEDDING
Self Attention 1: Long Short-Term Memory-Networks for Machine Reading
Two attention mechanisms in seq2seq (Figure + Formula) Hu Wenxing
Sequence to Sequence Learning with Neural Networks
Thang Luong’s Thesis on Neural Machine Translation Minh-Thang Luong
Transformer Bottom-up Understanding (4) Attention without RNN marsggbo
《FRUSTRATINGLY SHORT ATTENTION SPANS IN NEURAL LANGUAGE MODELING》Reading Notes Simple
One article to understand Bahdanau and Luong’s two attention mechanisms The differences in mechanisms: Flitter
attention mechanism ; OnlyInfo
understanding Attention: from its origins to MHA, MQA, and GQA ; Linsight
learning large models through graphs: the past and present of Transformers (Part 1) ; Learning through graphs
Part 4: Understanding the three attention mechanisms of Transformer architecture in one article; AIwithGary
review: attention mechanisms in image processing; progress and conjectures of non-Transformer architectures in extreme market platforms ; StormBlafe

| [Transformer Series] (1) Attention Mechanism