Exploring the Transformer Series (4) --- Encoder & Decoder

0x00 Summary

For machine translation, the complete forward computation process of the Transformer is shown in the following diagram (compared to the flowchart in the overall architecture chapter, this diagram emphasizes the internal structure of the encoder and decoder and their interrelationships). The left side of the diagram shows the encoder stack, and the right side shows the decoder stack; these two constitute the “body” of the Transformer. The specific process is as follows.

The input sequence is converted into an embedding matrix, and then positional encoding (representing the position of each word) is added to form the word embedding. The word embedding is then input into the decoder. This step corresponds to label 1 in the figure below.
The encoder receives word embeddings from the input sequence and generates relevant latent vectors. The encoder operates in parallel, therefore performing only one forward propagation. Internally, the encoder stack uses a self-attention mechanism to extract features from the source sequence, obtaining the correlations between elements within the source sequence and preserving high-dimensional, latent logical information. In other words, self-attention is responsible for modeling latent vectors that each output can reference based on all its input vectors. This step corresponds to label 2 in the diagram below.
Unlike the encoder, the decoder executes in a loop until all results are output. The decoder…Starting with the token and the encoder’s output, we generate the next token for the output sequence. Just as we did with the encoder’s input, we generate an embedding and add positional encoding before passing it to the decoders. This step corresponds to label 3 in the diagram below.
- The decoder stack internally extracts features from the target sequence through a masked self-attention mechanism, obtaining the correlations between elements within the target sequence and preserving high-dimensional hidden logical information. In other words, masked self-attention is responsible for modeling the output vector of each decoder based on the input vector of the decoder.
- Self-attention mechanisms can only extract and deconstruct the relevance features of the current sequence. Therefore, the encoder and decoder stacks communicate with each other through cross-attention, ensuring asymmetric compression and reconstruction of features. Alternatively, the cross-attention layer is responsible for further modeling the output vector of each decoder based on all the latent output vectors of the encoder.
Use a linear layer to generate logits. This step corresponds to number 4 in the diagram below.
A softmax layer is applied to generate probabilities, and finally, the most likely word is output according to a certain strategy. This step corresponds to number 5 in the figure below.
The decoder uses the decoder’s new output token and the previously generated token as input sequences to generate the next token in the output sequence. This step corresponds to label 6 in the diagram below.
Repeat steps 3-6 (for step 3, the input changes each time) to decode and predict the output for the next time step until an EOS marker is generated to indicate the end of decoding or the specified length is reached.

This decoding process is actually the standard seq2seq process. Therefore, the attention mechanism is the “soul” of the Transformer. The Transformer actually establishes global connections within and between sequences through a triple attention mechanism.

401

This chapter will still use machine translation for analysis and explanation.

0x01 Encoder

The encoder takes a sequence of word embeddings as input, which is a low-order semantic vector sequence. The encoder’s goal is to extract and transform features from this low-order semantic vector sequence, ultimately mapping it to a new semantic space to obtain a high-order semantic vector sequence. Because the encoder uses an attention mechanism, this high-order semantic vector sequence has richer and more complete semantics and is context-aware. This high-order semantic vector sequence will be used by the subsequent decoder to generate the final output sequence. Furthermore, the encoder generates a context vector for each word to be predicted. Why generate a new context vector for each word at every step? We can answer this with an example.

I ate an apple, and then I ate a banana.
English: I ate an apple and then a banana.

If we translate word for word, when we get to “I ate one”, should the English be “I ate a” or “I ate an”? This requires determining the meaning based on the following word “apple”. Therefore, after translating “apple”, we need to use “apple” to determine whether it’s “a” or “an”, and then update the context vector for the word “one”.

1.1 Structure

The encoder module structure of Transformer is shown in the purple box in the figure below. The encoder is composed of multiple identical EncoderLayers (the yellow part in the figure below) stacked together. The original diagram in the Transformer paper only shows one EncoderLayer, and then writes Nx next to it. This Nx indicates how many EncoderLayers are stacked together. In the paper, N = 6.

402

Each EncoderLayer consists of the following modules:

Multi-head self-attention (MHA) is characterized by the following:
- MHA is an attention calculation performed on the input sequence itself, used to obtain the correlation between different words in the input sequence.
- The encoder’s input is transformed (i.e., after the input passes through the embedding layer and position encoding, it is multiplied by the parameter matrices of Query, Key, and Value respectively) and then passed to MHA as the three parameters Query, Key, and Value. In other words, QKV all come from a single sequence.
- MHA allows the network to allocate different attention to different positions in the input sentence during prediction. Multi-head attention means that the model has multiple sets of different attention parameters, each of which outputs a vector that is weighted and summed according to the attention weights. These vectors are then merged into a final output vector.
The first residual connection. The residual connection combines the new data generated in the attention mechanism with the original input data. This combination is essentially a simple addition, thus avoiding the vanishing gradient problem in deep neural networks. The residual connection corresponds to “Add” in the diagram above.
The first is Layer Normalization. This module makes the data more stable, facilitating subsequent processing. Specifically, it corresponds to “Norm” in the diagram, which normalizes the layer’s input so that its mean is 0 and its variance is 1.
FFN (Feed-Forward Networks). This module consists of two linear transformations sandwiched by a ReLU activation function. It transforms the word vector at each position independently. This layer further processes and optimizes the attention-processed vectors, producing a new representation. The FFN output is a more abstract and richer contextual representation, which can increase the model’s non-linear representation capabilities.
The second residual join has the same function as the first residual join.
The second Layer Normalization has the same function as the first Layer Normalization.

Alternatively, to simplify, the encoder module consists of a series of identical layers, each layer divided into two important sub-modules: MHA and FFN. Each important sub-module is surrounded by a residual connection, and the output of each important sub-module undergoes Layer Normalization.

1.2 Inputs and Outputs

Because encoders are stacked structures, the inputs and outputs of different EncoderLayers are not exactly the same.

The input to the first EncoderLayer in the encoder stack is the word embedding plus positional encoding; that is, the result of adding the Input Embedding and Positional Encoding in the graph, which we call the Word Embedding. Positional encoding is added because the Transformer model does not have loops or convolutional operations. To allow the model to utilize word order information, positional encoding needs to be added to the input embedding layer. Since multiple EncoderLayers are concatenated, the input to other EncoderLayers in the stack is the output of the previous EncoderLayer.

After multiple layers of computation, the output of the last EncoderLayer is the encoder’s output (the hidden state between the encoder and decoder). This output is fed into each DecoderLayer in the decoder stack. In code implementation, this output is typically called memory. The encoder’s output is a high-order abstraction of the original input, represented in a higher-dimensional vector space.

The input dimensions are typically [batch_size, seq_len, embedding_dim]. To facilitate residual connections, the output matrix of each EncoderLayer has the exact same dimensions as the input, and its shape is also [batch_size, seq_len, embedding_dim]. This design ensures that the model can effectively transfer and process information between multiple encoder layers, while also preparing for more complex computation and decoding stages.

1.3 Process

We will continue to refine the encoding process. The work done by a Transformer encoding block is shown in the following diagram. The diagram is divided into two parts, including both the encoder and its input (for better illustration, the input processing part is also included).

The section above (#2) is a separate EncoderLayer within the Encoder module. This module aims to transform the input X into another vector R, where both vectors have the same dimension. Then, vector R serves as the input to the next layer, and this process is repeated layer by layer.
The lower part (#1) is the embedding processing part for the two words, which is the input processing part of the EncoderLayer.

403

Our analysis of the process shown in the diagram above is as follows.

The first step is to generate word embeddings using token embeddings and positional encoding, corresponding to circle number 1 in the diagram.
The second step is the self-attention mechanism, corresponding to circle number 2 in the diagram. The specific operation is as follows:

$\operatorname{softmax}(QK^{\top})V$

All input vectors participate in this process; that is, X1 and X2, through some form of information exchange and mixing, obtain intermediate variables Z1 and Z2 respectively. The self-attention mechanism involves each word in a sentence considering the influence of other words on itself and determining which words it should focus on more. In the input state, X1 and X2 are unaware of each other’s information, but because information exchange occurs during the self-attention mechanism, Z1 and Z2 each receive information from X1 and X2.

The third step is residual connectivity and layer normalization, corresponding to circle number 3 on the diagram. The specific operations are as follows:

$\operatorname{Norm}\bigl(x + \operatorname{Sublayer}(x)\bigr)$

The fourth step is the FFN layer, corresponding to circle number 4 in the diagram. The corresponding operation is…

$\operatorname{FFN}(x)=\max(0,\;xW_1+b_1)W_2+b_2$

This means two linear mappings with an activation function used in between. Because the FFN is split, Z1 and Z2 are each passed independently through a fully connected neural network to obtain R1 and R2.

The fifth step is the second residual connection and layer normalization, corresponding to circle number 5 on the diagram. Its specific operations are as follows:

$\operatorname{Norm}\bigl(x + \operatorname{Sublayer}(x)\bigr)$

At this point, one EncoderLayer has completed its execution. Its output can then be used as the input for the next EncoderLayer, and steps 2-5 are repeated until all N EncoderLayers have finished processing. We can also see that the computation of each output term is independent of the computation of other terms. That is, each EncoderLayer operates on all positions of the input sequence simultaneously, rather than processing each position one by one as in an RNN. This is the key to the Transformer model’s efficient parallel processing.

1.4 Tensor Shape Changes

Let’s look at the tensor shape changes during the encoding process. The encoder’s input is the sentence sequence X to be inferred: [batch_size, seq_len, d_model].

Note: If the maximum length limit is taken into account, it should be [batch_size, max_seq_len, d_model] each time, but it has been simplified here.

The changes in tensor shape during data conversion within the encoder are shown in the table below. For the Input Embedding tensor of input X, its shape remains unchanged within the encoder, as detailed below.

In the input layer, the tensor shape changes during embedding operations; however, the tensor shape remains unchanged when added to the positional encoding.
During the flow within the encoder layer, the tensor shape remains unchanged.
Within the encoder, that is, during the interaction between multiple encoder layers, the tensor shape remains unchanged.

The table below provides detailed operations and tensor shapes.

Perspective	operate	Shape of the operation result tensor
Input layer	X (token index)	[batch_size, seq_len]
Input layer	X = embedding(X)	[batch_size, seq_len, d_model]
Input layer	X = X + PE	[batch_size, seq_len, d_model]
inside the encoder layer	X = MHA(X)	[batch_size, seq_len, d_model]
inside the encoder layer	X = X + MHA(X)	[batch_size, seq_len, d_model]
inside the encoder layer	X = LayerNorm(X)	[batch_size, seq_len, d_model]
inside the encoder layer	X = FFN(X)	[batch_size, seq_len, d_model]
Encoder layer	X = EncoderLayer(X)	[batch_size, seq_len, d_model]
encoder	X = Encoder(X) = 6 × EncoderLayer(X)	[batch_size, seq_len, d_model]

1.6 Implementation

Next, let’s take a look at the encoder implementation in the Harvard source code.

Encoder

The Encoder class is an implementation of an encoder, and its forward() function returns the encoded vector.

# 使用Encoder类来实现编码器，它继承了nn.Module类
class Encoder(nn.Module):
    "Core encoder is a stack of N layers"
	# Encoder的核心部分是N个EncoderLayer堆叠而成的栈

    def __init__(self, layer, N):
        """
        初始化方法接受两个参数，分别是：
        layer: 要堆叠的编码器层，对应下面的EncoderLayer类
        N: 堆叠多少次，即EncoderLayer的数量
        """
        super(Encoder, self).__init__() # 调用父类nn.Module的初始化方法
        # 使用clone()函数将layer克隆N份，并将这些层放在self.layers中
        self.layers = clones(layer, N)
        # 创建一个LayerNorm层，并赋值给self.norm，这是“Add & Norm”中的“Norm”部分
        self.norm = LayerNorm(layer.size)

    def forward(self, x, mask):
        "Pass the input (and mask) through each layer in turn."
        """
        前向传播函数接受两个参数，分别是：
        x: 输入数据，即经过Embedding处理和添加位置编码后的输入。形状为(batch_size, seq_len，embedding_dim)
        mask：掩码
        """

        # 使用EncoderLayer对输入x进行逐层处理，每次都会得到一个新的x，然后将x作为下一层的输入
        # 此循环的过程相当于输出的x经过了N个编码器层的逐步处理
        for layer in self.layers: # 遍历self.layers中的每一个编码器层
            x = layer(x, mask) # 将x和mask传递给当前编码器层，编码器层进行运算，并将输出结果赋值给x
        return self.norm(x) # 对最终的输出x应用层归一化，并将结果返回

The code for the clone function is as follows:

def clones(module, N):
    "Produce N identical layers."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

EncoderLayer

The EncoderLayer class is the implementation of the encoder layer. As a unit of the encoder, each EncoderLayer completes one feature extraction process from the input, i.e., the encoding process.

class EncoderLayer(nn.Module):
    "Encoder is made up of self-attn and feed forward (defined below)"

    def __init__(self, size, self_attn, feed_forward, dropout):
        """
        初始化函数接受如下参数：
        size: 对应d_model，即word embedding维度的大小，也是编码层的大小
        self_attn: 多头自注意力模块的实例化对象
        feed_forward: FFN层的实例化对象
  		dropout: 置0比率
        """
        super(EncoderLayer, self).__init__() # 调用父类nn.Module的构造函数
        self.self_attn = self_attn # 设置成员变量
        self.feed_forward = feed_forward # 设置成员变量
        # 创建两个具有相同参数的SublayerConnection实例，一个用于自注意力，一个用于FFN
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size

    def forward(self, x, mask):
        "Follow Figure 1 (left) for connections."
        """
        对应论文中图1左侧Encoder的部分
        前向函数的参数如下：
        x: 源语句的嵌入向量或者前一个编码器的输出
        mask: 掩码
        """

        # 顺序运行两个函数：self_attn()，self.sublayer[0]()
        # 1. 对输入x进行自注意力操作
        # 2. 将自注意力结果传递给第一个SublayerConnection实例，SublayerConnection实例内部会做残差连接和层归一化
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))

        # 用上面计算结果来顺序运行两个函数：self.feed_forward()和self.sublayer[1]
        # 1. FFN进行运算
        # 2. 将FFN计算结果传递给第一个SublayerConnection实例，SublayerConnection实例内部会做残差连接和层归一化
        return self.sublayer[1](x, self.feed_forward)

We’ll briefly explain the SublayerConnection mentioned in the code, and elaborate on it in a later article. Looking at the diagram in the paper, both the self-attention module and FFN first perform their own business logic, then perform residual connections and layer normalization, and also incorporate Dropout. Because some logic is reusable, the Harvard code encapsulates it in the SublayerConnection class. Internally, the SublayerConnection class constructs instances of LayerNorm and Dropout. Self-attention or FFN is still constructed within the EncoderLayer, and then passed to the SublayerConnection by the EncoderLayer during forward propagation. The decoder calls SublayerConnection in a similar manner.

However, the implementation of SublayerConnection differs slightly from the paper.

The implementation mechanism of the original paper is:

$\operatorname{LayerNorm}\bigl(x + \operatorname{Sublayer}(x)\bigr)$

SublayerConnection is:

$x + \operatorname{LayerNorm}\bigl(\operatorname{Sublayer}(x)\bigr)$

Researchers call the implementation in the Transformer paper “Post LN” because it’s the final LayerNorm implementation, and the implementation of SublayerConnection “Pre LN.” Both methods have their advantages and disadvantages, which we will analyze in subsequent articles.

0x02 Decoder

First, it’s important to clarify that the decoder has two inputs: the hidden state generated by the encoder and the previously predicted output of the decoder. The decoder uses these two inputs to predict the next output token. There’s a fairly apt and easy-to-understand analogy online describing the relationship between the encoder and decoder; the author has further refined this analogy based on their own understanding:

The input sequence is a toy that needs to be assembled.
The encoder is like a salesperson. The salesperson studies the various components of the toy, and the encoder’s output (hidden state) is the toy’s assembly instructions, which explain how each component of the toy is used (how it needs to be combined with other components).
The decoder is the buyer. To assemble the toy, the buyer needs to consult the assembly instructions for each part, then find the most similar part (attention matching) based on the instructions and assemble it. The toy during assembly is the decoder’s predicted output from the previous steps. Because assembly requires simultaneously consulting the assembly instructions and observing the toy during assembly, the decoder has two inputs: the assembly instructions and the toy during assembly.
The buyer ultimately outputs an assembled toy.

2.1 Structure

The decoder structure, as shown in the figure below, also consists of multiple decoder layers. The purpose of stacking sub-layers within the decoder is to refine and optimize the representation of the generated words layer by layer, enabling the model to generate more accurate and context-appropriate target words. Each sub-layer has a different function and role.

404

Each decoder layer comprises three key sub-modules: Masked Multi-Head Attention, Cross Attention, and FFN. The output of each sub-module is passed to the next, further enriching and optimizing the representation of the generated sequence. Similar to the encoder, each key sub-module is surrounded by a residual connection, and the output of each key sub-module is normalized by the layer. The functions of these three sub-modules are as follows:

Masked multi-head attention. This involves calculating the attention of the input sequence to itself. In each decoding step, the masked self-attention layer calculates the relationship between the current word and the generated words in the output sequence through a self-attention mechanism. Details are as follows.
- One of the inputs to the decoder (the concatenation of the decoder’s previously predicted outputs) is passed to all three parameters, Query, Key, and Value, i.e., QKV, which all come from a single sequence.
- The masked multi-head attention module encodes the sequence of “predicted outputs from the decoder” in a manner similar to the global self-attention layer in an encoder.
- The difference between masked multi-head attention and global self-attention lies in the handling of positions in the sequence (masking operation), which we will analyze shortly.
- This layer outputs a new representation, which is a context vector that incorporates information from the generated partial sequence and contains the contextual dependencies between the currently generated word and the generated words.
Cross-attention. This module aligns the source and target sequences, acting as a bridge between the decoder and encoder. The purpose of this sub-layer is to allow the decoder to incorporate the encoder’s output, meaning it references the contextual information of the source sentence during decoding. Through this layer, the decoder can perform attention calculations on each token of the source sentence, ensuring that the structure and semantics of the source sentence are considered during decoding. Cross-attention is the second difference between the encoder and decoder. Details are as follows.
- No mask.
- The input sources for cross-attention are two: the encoder’s output and the masked multi-head attention output. The K and V matrices come from the encoder’s output, and Q comes from the masked multi-head attention output. The authors designed it this way to mimic the decoding process of a traditional encoder-decoder model, allowing for the comprehensive consideration of the content of both the target and source sequences.
- The output of this layer combines the contextual information from the encoder output with the target word representation generated by the decoder in the current step, further optimizing the target word representation. The output is a contextual representation containing information from the source sentence (integrating the structure of the source sentence and a portion of the target sequence).
FFN functions the same as the encoder’s FFN.

The ultimate reason why the masked multi-head self-attention of the decoder differs from that of the encoder lies in the difference between training and inference, namely the difference between the inputs and operations at each time step during training and inference.

During training, the input at each time step is the entire target sequence. In the multi-head self-attention of the Encoder, each position is free to pay attention to all other positions in the sequence. This means that there are no positional restrictions when calculating the attention score. This setup is because during the encoding phase, we assume a complete input sequence, and each word can rely on any other word in its context to obtain its representation.
In the inference process, the input at each time step is the entire output sequence up to the current time step, and the essence of inference is also serial autoregressive. Or, in other words, the decoder is essentially autoregressive.

To enable parallel operation, a teacher forcing mechanism (which requires masking) is used. This allows the decoder to decode and predict multiple tokens in the same sequence simultaneously. Since the decoder’s input is the entire target sentence, when predicting the output at a certain position, we want the word to only see itself and the words preceding it. We don’t want the attention to focus on words following a word while predicting it, as this could allow the model to use existing future words to assist in the generation of the current word—essentially “cheating.” To prevent preceding tokens from observing information from following tokens, masking is used. The purpose of masking is to ensure that the decoder can only focus on words already generated and cannot see future words. The masking logic is specifically designed for training.

Furthermore, although all inputs during inference are known, eliminating the possibility of truly “peeking into future words” and thus eliminating the need for masks, the code and model structure are still retained during inference to maintain consistency with training computations. This ensures that the behavior during inference perfectly matches that during training, avoiding behavioral discrepancies between training and inference. Moreover, even though future words haven’t been generated yet, the masked self-attention mechanism still acts as a constraint, ensuring that the decoder focuses only on the already generated context at each step and doesn’t assume the existence of future information. Subsequent chapters will explain masks in detail.

2.2 Inputs and Outputs

The decoder is also a stacked structure, so the inputs and outputs of different decoder layers are not the same.

The first decoder layer has two inputs:

The decoder concatenates the previously predicted outputs, i.e., the decoder’s input from the previous time step plus the decoder’s output from the previous time step (the predicted result). Additionally, a Shfitted Right operation is added.
The encoder output, i.e., the K and V of the cross-attention in the first decoder layer, both come from the encoder output (the output of the last encoder in the encoder stack).

The subsequent decoder layer has two inputs:

The output of the previous decoder layer.
The encoder’s output, i.e., the K and V of cross-attention in each subsequent decoder layer, all come from the same encoder output (the output of the last encoder in the encoder stack).

The decoder ultimately outputs a real-valued vector, which is passed to the topmost linear layer in the architecture. There, a final linear transformation and a softmax layer convert it into a probability distribution, which is then used to predict the next word. As mentioned earlier, the encoder’s output is a representation of the original input in a higher-dimensional vector space. The decoder operates on the vectors in this higher-dimensional space, finding a higher-order vector that matches the attention score. Subsequent modules, such as the generator, transform the high-dimensional vector back into a lower-dimensional vector that is understandable to humans.

405

2.3 Process

The decoder process needs to distinguish between the training phase and the inference phase, because the processes for these two phases are not entirely the same.

train

Note: The decoder uses Teacher Forcing mode during training, which inputs the entire target sequence into the decoder and allows for parallel prediction of all words in the target sequence. We will analyze Teacher Forcing in detail in later chapters.

Suppose our training task yields the following text pairs:

Chinese: I love you.
English: I love you.

The model is called as follows, where batch.src is “I love you” and batch.tgt is “I love you”.

out = model.forward(batch.src, batch.tgt, batch.src_mask, batch.tgt_mask)

The encoder accepts “I love you” as input. “I love you” is first processed into a token by the tokenizer, and then converted into a set of vectors for further processing to obtain the output memory.

memory = encode(self, src, src_mask)

The decoder has two inputs: the encoder’s output (corresponding to “memory” in the code below) and the English sentence “I love you” (corresponding to “tgt” in the code below). “I love you” is processed into tokens by the tokenizer, and then converted into a vector. The decoder combines this vector with the encoder’s output to generate the final translation output.

decode(self, memory, src_mask, tgt, tgt_mask)

We need to further explain the input to the decoder.

tgt: The English sentence “I love you” is the ground truth label in the training set. It needs to be combined with a mask and shifted right by one bit to obtain the shifted and masked ground truth. The reason for the right shift is as follows: the decoder predicts the output at time T in time T-1, so a start symbol needs to be added to the beginning of the input sentence.This shifts the entire sentence one position to the right, assigning each token a label to the token that follows it, thus facilitating the prediction of the first token (and, in Teacher forcing mode, also facilitating the prediction of subsequent tokens). For example, if the original input is “I love you”, after shifting it to ""I love you” allows us to use the start character…Predicting “I”, that is, by using the start symbolThe first output is predicted using the ‘l’ operator. Otherwise, the first output cannot be predicted.

406

src_mask and tgt_mask are the masks for the source and target sequences, respectively. As previously explained, the reason for adding masks is to hide future information. Since the decoder’s input is the entire target sentence, when predicting the output at a certain position, we want the word to only be that word and the words preceding it. We don’t want the attention to focus on the words following a word while predicting it; that would be “cheating.” Therefore, masks are needed to hide the information of subsequent words, so that the effect of actual inference can be simulated during training.

The specific input and output during training are shown in the figure below. Because it is parallel training, one of the inputs to the decoder is “I love you”. Assuming that all predictions are correct, the output is also “I love you”. This output is all output at once.

407

reasoning

The decoder’s inference workflow is much simpler, so a mask is no longer needed. The decoder uses an autoregressive approach, where the current output is appended to the previous input as the next input. This way, each decoding iteration utilizes the embedding information from all previously decoded words. Therefore, the inference task doesn’t use the actual target sequence to guide the generation process; it only uses the Chinese sentence “我爱你” (I love you). In other words, the encoder’s input remains unchanged, while the decoder’s input is a combination of previous decoder outputs (including the encoder’s output).

As shown in the figure below, the model will start from a special starting sequence label.The model begins generating output sequences sequentially. When generating the next token, it appends the previously predicted token to the previous input sequence, using this as a new sequence before inputting it to the decoder, thus generating its own sequence. Therefore, the first inference outputs “I”, and then…The “I” is concatenated with the “I” and input into the decoder. The decoder infers “love” on the second time, and then…”I love” is input into the decoder, and so on. When the decoder generates…When marked, it will stop generating.

408

The specific code flow is as follows.

def inference_test():
    test_model = make_model(11, 11, 2)
    test_model.eval()
    src = torch.LongTensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
    src_mask = torch.ones(1, 1, 10)

    memory = test_model.encode(src, src_mask)
    ys = torch.zeros(1, 1).type_as(src)

    for i in range(9):
        out = test_model.decode(
            memory, src_mask, ys, subsequent_mask(ys.size(1)).type_as(src.data)
        )
        prob = test_model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.data[0]
        ys = torch.cat(
            [ys, torch.empty(1, 1).type_as(src.data).fill_(next_word)], dim=1
        )

    print("Example Untrained Model Prediction:", ys)

The code corresponding to test_model.decode() is as follows.

class EncoderDecoder(nn.Module):
    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)

The encoder will call the logic of the encoder layer step by step.

class Decoder(nn.Module):
    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)

Finally, the encoder layer will call the business logic, such as attention calculation and residual connection.

class DecoderLayer(nn.Module):
    def forward(self, x, memory, src_mask, tgt_mask):
        "Follow Figure 1 (right) for connections."
        m = memory
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        return self.sublayer[2](x, self.feed_forward)

2.4 Tensor Shape Changes

Let’s examine the tensor shape changes during the decoding process (for better illustration, we’ll include both the decoder’s input and output). The decoder’s input is a tensor Y of varying length: [batch_size, seq_len, d_model]. Initially, each matrix in this tensor contains only one row, representing the encoding of the first character.

Note: If the maximum length limit is taken into account, then each time it should be [batch_size, max_seq_len, d_model].

For the Input Embedding tensor of input Y, its shape remains unchanged inside the decoder, as follows.

In the input layer, the tensor shape changes during embedding operations; however, the tensor shape remains unchanged when added to the positional encoding.
The tensor shape remains unchanged during the flow within the decoder layer.
Inside the decoder, that is, during the interaction of multiple decoder layers, the tensor shape remains unchanged.
After entering the output layer, the shape of the tensor begins to change.

Note that the tensor remains unchanged during a single inference process, but after each inference, the matrix adds a row and seq_len is incremented by 1.

Perspective	operate	Shape of the operation result tensor
Input layer	Y (token index)	[batch_size, seq_len]
Input layer	Y = embedding(Y)	[batch_size, seq_len, d_model]
Input layer	Y = Y + PE	[batch_size, seq_len, d_model]
Decoder layer interior	Y = Masked-MHA(Y)	[batch_size, seq_len, d_model]
Decoder layer interior	Y = LayerNorm(Y + Masked-MHA(Y))	[batch_size, seq_len, d_model]
Decoder layer interior	Y = Cross-MHA(Y, M, M)	[batch_size, seq_len, d_model]
Decoder layer interior	Y = LayerNorm(Y + Cross-MHA(Y, M, M))	[batch_size, seq_len, d_model]
Decoder layer interior	Y = FFN(Y)	[batch_size, seq_len, d_model]
Decoder layer interior	Y = LayerNorm(Y + FFN(Y))	[batch_size, seq_len, d_model]
Decoder layer	Y = DecoderLayer(Y)	[batch_size, seq_len, d_model]
decoder	Y = Decoder(Y) = N × DecoderLayer(Y)	[batch_size, seq_len, d_model]
Output layer	logits = Linear(Y)	[batch_size, seq_len, d_voc]
Output layer	prob = softmax(logits)	[batch_size, seq_len, d_voc]

2.5 Implementation

Decoder

The Decoder class is the implementation of the decoder, a stack of N decoding layers. The encoder passes its output latent vector encoding matrix C to the decoder. These latent vectors help the decoder know which parts of the input sequence it should focus on. Each decoding layer uses the same latent vector encoding matrix C, and these latent vectors are used by each layer in its own Encoder-Decoder cross-attention module.

The Decoder class sequentially translates the next word (i+1) based on the currently translated word (i-th word). During decoding, when translating the (i+1)-th word, a Mask operation is used to mask the words after (i+1). The code for the Decoder class is as follows.

class Decoder(nn.Module):
    "Generic N layer decoder with masking."

    def __init__(self, layer, N):
        """
        初始化函数有两个参数。layer对应下面的DecoderLayer，是要堆叠的解码器层；N是解码器层的个数
        """
        super(Decoder, self).__init__()
        self.layers = clones(layer, N) # 使用clones()函数克隆了N个DecoderLayer，然后保存在layers这个列表中
        self.norm = LayerNorm(layer.size) # 层归一化的实例

    def forward(self, x, memory, src_mask, tgt_mask):
        """
        前向传播函数有四个参数：
        x: 目标数据的嵌入表示，x的形状是(batch_size, seq_len, d_model)，在预测时，x的词数会不断增加，比如第一次是(1,1,512)，第二次是(1,2,512)，以此类推
        memory: 编码器的输出
        src_mask: 源序列的掩码
        tgt_mask: 目标序列的掩码
        """
        # 实现多个编码层堆叠起来的效果，并完成整个前向传播过程
        for layer in self.layers: # 让x逐次在每个解码器层流通，进行处理
            x = layer(x, memory, src_mask, tgt_mask)
        # 对多个编码层的输出结果进行层归一化并返回最终的结果
        return self.norm(x)

DecoderLayer

The DecoderLayer class is the implementation of the decoder layer. As a unit of the decoder, each decoder layer performs feature extraction operations in the direction of the target based on the given input, i.e., it implements the decoding process. Internally, DecoderLayer and EncoderLayer are very similar, the difference being that EncoderLayer has only one multi-head self-attention module, while DecoderLayer has two. From the code, the only difference is that DecoderLayer has an additional src_attn member variable. The implementations of self_attn and src_attn are exactly the same, except that they use different inputs for the Query, Key, and Value.

The main member variables of DecoderLayer are as follows:

size: The dimension of the word embedding, i.e. the size of the decoder layer.
self_attn: The masked multi-head self-attention module, responsible for performing self-attention calculations on the previous output of the decoder (i.e., the current input). This module requires $Q = K = V$ . The query, key, and value of self-attention come from the lower layer input or the original input.
src_attn: The cross-attention module, responsible for performing cross-attention calculations on the encoder’s output and the output before the decoder, but $Q \neq K = V$ . Query comes from the output of self_attn, while Key and Value are the outputs of the last layer of the encoder (memory variables in the code).
feed_forward: FFN module.
sublayer: applies residual connections and layer normalization.
drop: the ratio of zeroing.

These member variables are passed to the initialization function as parameters.

class DecoderLayer(nn.Module): # 继承自PyTorch的nn.Module类
    "Decoder is made of self-attn, src-attn, and feed forward (defined below)"

    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        # 创建三个SublayerConnection类实例，分别对应self_attn，src_attn和feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)

    def forward(self, x, memory, src_mask, tgt_mask):
        "Follow Figure 1 (right) for connections."
        """
        前向传播函数有四个参数：
        x: 目标数据的嵌入表示，x的形状是(batch_size, seq_len, d_model)，在预测时，x的词数会不断增加，比如第一次是(1,1,512)，第二次是(1,2,512)，以此类推。x可能是上一层的输出或者是整个解码器的输出
        memory: 编码器的输出
        src_mask: 源序列的掩码
        tgt_mask: 目标序列的掩码
        """
        m = memory # 将memory表示成m方便之后使用
        # 第一个子层执行掩码多头自注意力计算。相当于顺序运行了两个函数：self_attn()和self.sublayer[0]()。这里的Q、K、V都是x。tgt_mask的作用是防止预测时看到未来的单词。
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        # 第二个子层执行交叉注意力操作，此时Q是输入x，K和V是编码器输出m。src_mask在此处的作用是遮挡填充符号，避免无意义的计算，提升模型效果和训练速度。此刻需要注意的是，两个注意力计算的mask参数不同，上一个是tgt_mask，此处是src_mask
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        # 第三个子层是FFN，经过它的处理后就可返回结果
        return self.sublayer[2](x, self.feed_forward)

0x03 In-depth Cross-Attention

We have already briefly introduced the classification of attention. Here, we will conduct a more in-depth analysis of cross-attention in conjunction with code.

3.1 Classification

In the Transformer, attention is used in three places: Multi-Head Attention in the Encoder; Masked Multi-Head Attention in the Decoder; and Multi-Head Attention in the interaction between the Encoder and Decoder. Attention layers (Self-attention layers and Encoder-Decoder-attention layers) accept input with three parameters: Query, Key, and Value. We will now analyze how the Q, K, and V parameters of each attention module in the Transformer are derived.

name	Location	Q	K/V
Bullish Self-Attention	encoder	QKV all come from the same sequence.	QKV all come from the same sequence.
Mask multi-head self-attention	decoder	QKV all come from the same sequence.	QKV all come from the same sequence.
Cross attention	decoder	Masked multi-head self-attention output	The encoder output is used as V and K.

The functions of the three attention modules are as follows:

Multi-head self-attention: This involves calculating the attention of the input sequence to itself, allowing each position to freely attend to the entire sequence, thus capturing the correlations between different words in the input sentence. This can be considered the biggest innovation of Transformer.
Multi-head self-attention with masking: This involves calculating the attention of the input sequence to itself, used to capture the relevance between different words in the output sentence (the already translated portion). Simultaneously, sequence masking is used to limit the scope of attention, preserving autoregressive properties and ensuring the correctness of the generation process. This design is key to the parallel training capability of the Transformer model.
Cross-attention: This involves calculating the attention of the target sequence on the input sequence. This design is key to the Transformer model’s ability to effectively handle tasks like text translation.

In other words, the attention mechanism seeks similarities or correlations at different locations within a sequence (self-attention) or between sequences (cross-attention).

3.2 Business Logic

Next, let’s look at the business logic of cross-attention. Cross-attention calculates the interaction or similarity between each source sequence word and each target sequence word, aligning the input and output sequences.

In seq2seq scenarios, a causal relationship exists, which is reflected in the context (including both the input and output sequences). This is because each character in the source text should have a certain correspondence with each character in the target translation text. Therefore, the token at each position in the source text should have a different impact (weighting) on the token to be predicted. Taking machine translation as an example, the decoder’s goal is to produce the correct target language output sequence given a source language sequence. To achieve this, the decoder needs to:

We learned all the information from the historical translations. Only by knowing the content generated in the past can we have the foundation for correctly outputting the next token.
The decoder learns the parts of the source text that are relevant to the current output token. This implies that the decoder needs to understand all the information in the encoder’s output. Only when each position of the decoder can obtain information from all positions in the input sequence can it generate the correct output token through learning.

However, masked self-attention only guarantees that the decoder learns the content of the historical translation. A method is still needed to learn the source text information and to integrate the source text and the historical translation. Therefore, cross-attention was proposed to accomplish this function, which is also a manifestation of causality. “Encoder-decoder cross-attention” obtains input from two sources, which are from different categories; therefore, cross-attention can be understood as a two-tower practice of self-attention.

Q is the translation, which comes from the decoder’s output. Because Q comes from the decoder’s masked self-attention, it naturally already contains the content of the previous translations.
K and V are the original text, which comes from the encoder’s output and already contains information about the input sequence (e.g., “I like apples”).

Cross-attention mechanisms allow the decoder and encoder to interact, ensuring that the decoder can “question” the encoder about the input sequence and focus on different parts of the source language sentence. In other words, the latent vector output by the encoder is essentially a database (V) aggregating information from the input sequence, while each input token of the decoder is essentially a query (Q), responsible for retrieving the most similar (and most attention-required) token from the database. Ultimately…

$QK^{\top}$

Each row of this matrix represents the decoder’s attention to all tokens in the latent vector for an input token. For example, when translating “I like apples” to “I love apples,” the decoder might ask the encoder, “Based on your understanding of ‘I,’ what should I output next?” In this way, the decoder can more accurately predict the next word in the target sequence. This “question and answer” description is vivid, illustrating how the decoder uses information from the encoder to determine the most appropriate output.

3.3 Business Process

Next, let’s look at how cross-attention works between the encoder and decoder.

When the decoder outputs the current value, it uses all the previous predictions as input to predict the output for the next value. Suppose we need to translate “I ate an apple” into English. The decoder has already output the words “I ate”. Next, we need to predict the output “an” for the next value. The whole process can be represented by the diagram below. The blue part is the decoder, the small blue box on the left is the masked multi-head self-attention module in the decoder, the large blue box on the right is the cross-attention module in the encoder, and the red dashed box in the lower left is the encoder.

409

The specific process is explained below:

The matrix in the upper left corner (corresponding to index 1) is the result of the mask self-attention mechanism in the decoder ” I ate” encoding these three words into the input.
The bottom left corner (corresponding to number 2) is the result of the encoder encoding the input “I ate an apple”. This is the final output of the encoder, which is the matrix of the relationship between itself and other words from multiple perspectives, denoted as memory.
Because both matrices have undergone self-attention transformation, each vector in each matrix contains encoded information from other positions in the sequence.
Consider number 1 as Q (corresponding to number 3), and number 2 as K and V (corresponding to numbers 4 and 5).
Next, Q and K calculate an attention score matrix (corresponding to index 6). Each row of the matrix represents how attention should be allocated when decoding each position in memory (V in the figure).
Perform a masking operation (corresponding to sequence number 7).
Perform a softmax operation (corresponding to number 8) to obtain the attention weight $A = \operatorname{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)$ . This weight can be seen as Q (the vector to be decoded) querying information related to Q at various positions in K (essentially memory).
The output vector of the cross-attention module (corresponding to number 9) is obtained by linearly combining the attention weights and values, i.e., $Z = AV$ . This output vector can be regarded as an output vector that takes into account the encoded information of each position in the memory. In other words, it contains information on which positions in the memory should be focused on when decoding the current time step.

3.4 Code Logic

Let’s analyze the code logic.

410

The specific process is explained below.

The run_epoch() function calls the model’s forward propagation function, which is the forward() function of the EncoderDecoder class (corresponding to number 1 in the diagram). At this time, the Batch class provides data and a mask (batch.src, batch.tgt, batch.src_mask, batch.tgt_mas) to the model for forward propagation.
The EncoderDecoder class performs encoding functions through the encode() function.
- The encode() function uses src, and src_mask calls the forward() function of the Encoder class (corresponding to number 2 in the diagram). src contains word indices, and the encode() function calls self.src_embed(src) to generate the word embedding for src.
- The Encoder.forward() function then calls the EncoderLayer.forward() function (corresponding to number 3 in the diagram), which is the encoder’s encoding function.
- The EncoderLayer.forward() function ultimately calls the MultiHeadedAttention.forward() function (corresponding to number 4 in the figure), which is the encoding function of the encoder layer.
- The MultiHeadedAttention.forward() function ultimately calls the attention() function to complete the attention calculation (corresponding to number 5 in the figure).
- The encode() function returns memory, which is then passed as an argument to the decode() function.
The EncoderDecoder class performs decoding using the decode() function.
- The decode() function uses memory, src_mask, tgt, and tgt_mask to call the Decoder.forward() function (corresponding to number 6 in the diagram). tgt contains the word indices, and the decode() function calls self.tgt_embed(tgt) to generate the word embedding of tgt.
- The Decoder.forward() function calls the DecoderLayer.forward() function (corresponding to number 7 in the diagram), which is where the decoding function of the decoder is implemented.
- The DecoderLayer.forward() function first calls the MultiHeadedAttention.forward() function. Because it is called using self.self_attn(x, x, x, tgt_mask) (corresponding to number 8 in the diagram), and x is tgt, this is masked multi-head self-attention.
- The DecoderLayer.forward() function then calls the MultiHeadedAttention.forward() function. Because it is called using self.src_attn(x, m, m, src_mask) (corresponding to number 10 in the diagram), and x is the time ground truth (tgt), this is cross-attention.
- The calculation of multi-head self-attention of the mask uses tgt and tgt_mask to call the attention() function (corresponding to number 9 in the figure).
- The calculation of cross attention uses tgt, memoy, memory, and src_mask to call the attention() function (corresponding to number 11 in the figure).

In fact, this part is exactly the same as the Multi-Head Attention in the Encoder mentioned above in terms of specific implementation details.

In summary, the cross-attention mechanism enables the model to consider all the information of the input sequence at every step of generating the output sequence, thereby capturing the complex dependencies between the input and output and achieving excellent performance in various seq2seq tasks.

0x04 Decoder Only

The architecture of Transformer is not static; it can be represented as an encoder-only, decoder-only, or a classic encoder-decoder model. Each architectural variant is tailored to specific learning objectives and tasks.

4.1 Classification

The Transformer architecture was initially introduced as an encoder-decoder model for machine translation tasks. In this architecture, the encoder takes the entire source language sentence as input and passes it through multiple Transformer encoder blocks to extract high-level features from the input sentence. These extracted features are then fed one after another into the decoder, which generates tokens in the target language based on the source language features from the encoder and the tokens previously generated by the decoder. In subsequent work, researchers also introduced encoder-only and decoder-only architectures, taking only the encoder and decoder components from the original encoder-decoder architecture, as shown in the figure.

(a) is a model that only contains the encoder and performs inference for all tokens in parallel.
(b) is a model that only contains a decoder and performs inference in an autoregressive manner.
(c) is a model containing an encoder-decoder, which uses the output of the encoded sequence as the input to the cross-attention module.

411

The figure below presents the evolutionary tree of modern LLMs, tracing the development of language models in recent years and highlighting some of the most prominent models. Models on the same branch are more closely related. Transformer-based models are shown in non-gray:

The blue branches represent Decoder-Only models. Over time, more and more Decoder-Only models have been introduced.
The pink branches represent Encoder-Only models. These models are primarily used for encoding and representing input sequences.
The green branch represents the Encoder-Decoder model. It combines the features of the former two, enabling it to both encode the input sequence and generate the output sequence.

The vertical position of a model on the timeline represents its release date. Open-source models are represented by solid squares, while closed-source models are represented by hollow squares. The stacked bar chart in the bottom right corner shows the number of models from different companies and institutions.

412

From the phylogenetic tree, we obtained the following observations relevant to this paper:

Decoder-only models have gradually dominated the development of LLM. In the early stages of LLM development, decoder-only models were less popular than encoder-only and encoder-decoder models. However, after 2021, with the introduction of the game-changing LLM-GPT-3, decoder-only models experienced a significant boom. Meanwhile, after the initial explosive growth brought by BERT, encoder-only models gradually began to disappear.
The encoder-decoder model remains promising because this type of architecture is still being actively explored, with most of it being open source. Google has made significant contributions to open-source encoder-decoder architectures. However, the flexibility and versatility of the decoder model alone seem to make Google’s commitment to this direction less promising.

4.2 Decoder Only

The Decoder-Only model uses only the Decoder part of the standard Transformer, but with slight modifications. A key difference is the absence of the encoder-decoder attention layer; that is, the Decoder-Only model does not need to receive input from the encoder. The Decoder-Only model does not have an explicit encoder module and does not explicitly distinguish between the “understanding” and “generation” stages. The model implicitly analyzes, understands, and models user input through a self-attention mechanism, while simultaneously providing the foundation for generation tasks.

As mentioned earlier, after the initial explosive growth brought by BERT, encoder-only models gradually disappeared. Therefore, only Encoder-Decoder models and Decoder-only models remain. However, Decoder-only models can be further subdivided into Causal Decoder architectures and Prefix Decoder architectures. Therefore, the mainstream architectures of existing LLMs can be broadly categorized into three main types: encoder-decoder, causal decoder, and prefix decoder. The following diagram compares the attention patterns in these three mainstream architectures. Here, the blue, green, yellow, and gray rounded rectangles represent attention between prefix tokens, attention between prefix and target tokens, attention between target tokens, and mask attention, respectively.

413

Next, we will analyze two Decoder Only architectures.

The Causal Decoder architecture incorporates a one-way attention mask to ensure that each input token only focuses on past tokens and itself. Input and output tokens are processed in the same way by the decoder. The GPT series of language models is a representative example of this architecture.
Prefix Decoder architecture. The prefix decoder architecture (also known as a non-causal decoder) modifies the masking mechanism of the causal decoder to achieve bidirectional attention to the prefix token and unidirectional attention only to the generated token. Thus, like the encoder-decoder architecture, the prefix decoder can bidirectionally encode the prefix sequence and predict the output token one by one autoregressively, sharing the same parameters during encoding and decoding. Instead of pre-training from scratch, a practical suggestion is to continuously train causal decoders and then convert them to prefix decoders to accelerate convergence. Existing representative LLMs based on prefix decoders include GLM130B and U-PaLM.

4.3 Architecture Selection

Many studies have investigated the performance of decoder-only architectures and encoder-decoder architectures, but with sufficient training and model size, there is no conclusive evidence that one architecture ultimately outperforms the other. There’s a well-known post on Zhihu: “[Why are current LLMs all Decoder-only architectures?]” Various experts have offered insightful perspectives. A partial summary is as follows:

Note: The “decoder only” below mainly refers to the Causal Decoder architecture.

Suitable for generating tasks.
- Decoder-only models are better suited for different tasks. For “pure generation” tasks (such as dialogue and continuation writing), there is no clear boundary between “input” and “output”, so introducing an encoder would be redundant.
- Decoder-only models are better suited for generation tasks. Many practical applications focus more on the coherence and semantic richness of the generated data than on a complex understanding of the input.
It has better generalization performance.
- Su Shen proposed the “full-rank attention problem.” That is, the attention matrix of the decoder-only architecture must be full-rank, which means stronger expressive power, while bidirectional attention will become insufficient.
  - In a pure decoder-only architecture, the attention matrix is constrained to a lower triangular form due to causal masking (preventing the model from seeing future tokens), which theoretically preserves its full-rank state: each element on the diagonal (representing self-attention) contributes to making the determinant positive (only Softmax can achieve a positive result). Full rank implies theoretically stronger expressive power.
  - The other two generative architectures introduce bidirectional attention, thus failing to guarantee the full-rank state of their attention matrices. Intuitively, this makes sense. Bidirectional attention is a double-edged sword: it speeds up the learning process but also disrupts the deeper predictive patterns necessary for the model to learn and generate. You can think of it as learning how to write: filling in the blanks is easier than writing an entire article word for word, but it’s a less efficient way to practice. However, with extensive training, both methods can achieve the goal of learning how to write.
- Encoder-Decoder models, because they are bidirectional, offer advantages in prediction but reduce the learning difficulty during training, resulting in a lower upper limit. Decoder-Only models, on the other hand, have a higher upper limit in learning general representations when the model is large enough and there is sufficient data.
- The paper “What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization” compares various combinations of architectures and pretraining methods. They found that:
  - The Decoder-Only model performs best in zero-shot mode without any tuning data. Our experiments show that the purely causal decoder model trained according to the autoregressive language modeling objective exhibits the strongest zero-shot generalization ability after purely self-supervised pre-training.
  - The encoder-decoder model requires multitask fine-tuning on a certain amount of labeled data to achieve optimal performance. However, in experiments, for inputs with non-causal visibility, the model that is first trained using a mask-based language modeling target and then fine-tuned through multitasks performs best.
- Decoder-Only models employ Casual attention and possess implicit positional encoding, which can break the positional invariance of Transformers. In contrast, bidirectional attention models without positional encoding are less effective at distinguishing word order.
- Context learning from cues. When using LLMs, we can employ cue word engineering methods, such as providing a small number of instances to help the LLM understand the context or task. In the paper “Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers,” researchers mathematically demonstrate that this contextual information can be viewed as having a similar effect to gradient descent, updating the attention weights for zero-shot tasks. If we consider cues as introducing gradients to the attention weights, we might expect it to have a more direct effect on Decoder-Only models, as it doesn’t need to be transformed into an intermediate contextual feature representation before being used for the generation task. Logically, it should still work in Encoder-Decoder architectures, but this requires careful tuning of the encoder to achieve optimal performance, which can be difficult.
High efficiency.
- Decoder-only models process both the input and output sequences within the same module, avoiding complex model structures. According to Occam’s Razor: if you have two competing perspectives to explain the same phenomenon, you should choose the simpler one. Therefore, we should generally prefer model structures that only use the decoder.
- The Decoder-Only model only needs one forward propagation during the inference process, instead of the Encoder and Decoder performing forward propagation separately, resulting in higher inference efficiency.
- The Decoder-Only model supports continuous reuse of the KV cache, making it more suitable for multi-turn dialogues. This is difficult to achieve with the other two architectures. In a pure decoder model, the key (K) and value (V) matrices of previous tokens can be reused for subsequent tokens during decoding. Since each position only focuses on previous tokens (due to the causal attention mechanism), the K and V matrices of these tokens remain unchanged. This caching mechanism avoids recalculating the K and V matrices for already processed tokens, thus improving efficiency and facilitating faster generation and reduced computational costs during inference in autoregressive models such as GPT.
- Scale Up is used. The Encoder-Decoder architecture network is not uniformly symmetrical (it is not linear but has a large number of branches), resulting in complex data dependencies and making parallel optimization difficult. The Decoder-Only architecture does not have this problem.
- The training data is highly efficient and the training cost is low.
  - The training objective of the Decoder-Only model is to predict the next token, which is the core objective of large-scale pre-training tasks. This objective is directly aligned with the network architecture and can efficiently utilize massive amounts of unstructured text data.
  - The Causal Decoder model performs well due to its strong zero-shot generalization ability, which aligns well with the current practice of self-supervised learning on large-scale corpora.
  - Encoder-Decoder models require additional input-output pairing data design. To realize the full potential of the Encoder-Decoder structure, we need to perform multi-task fine-tuning on the labeled data (essentially instruction fine-tuning), which can be very expensive, especially for large models.

0xFF Reference

Anatomizing the Transformer, Part 2: Can you assemble a Transformer using attention mechanisms? (Generous)

A Learning Algorithm for Continually Running Fully Recurrent Neural Networks

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

Professor Forcing: A New Algorithm for Training Recurrent Networks , 2016. Section 10.2.1, Teacher Forcing and Networks with Output Recurrence, Deep Learning , Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016.

The era of large-scale models: Is it spring or winter for infrastructure? (Cheng Cheng)

Why are modern LLMs all decoder-only architectures? (Su Jianlin)

A Detailed Explanation of the Decoder-Only Model, One of the Three Major Variants of Transformer ( by Langzi [Niushan AI Park])

Transformer Series: Detailed Explanation of Decoder Principles (xiaogp)

Detailed Explanation of Decoders in Transformer ( by Langzi Niushan AI Park)

Detailed Explanation of Encoders in Transformer (by Langzi Niushan AI Park)

Why does a Decoder-only LLM need positional encoding? (Su Jianlin )

Why do most LLMs only use a Decoder-Only architecture? The Way of AI Algorithms

Transformers Fundamentals—How Does the Decoder Decode? Python Eden

Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers. Retrieved from https://arxiv.org/abs/2212.10559 Dai, D., Sun, Y., Dong, L., Hao, Y., Sui, Z., & Wei, F. (2022).