Exploring the Transformer Series (5) --- Training & Reasoning

0x00 Overview

The purpose of Transformer training is to fit the true target sequence by learning from the input source sequence and the model’s output sequence. The purpose of inference is to generate the target sequence using only the input sequence; the decoder receives only the input sequence, not the target sequence.

This article will continue to use text translation as an example for learning.

0x01 Training

LLM is an autoregressive model, which can only make predictions sequentially. To improve training efficiency, the ideal training method is parallel computation: input the entire sequence at once, and decode it in parallel to output all predictions at each position. Transformer uses Teacher Forcing combined with masks to meet this requirement. Let’s look at how the key aspects of training are implemented in detail.

1.1 Input

The training data consists of two parts:

Source sequence, such as “I like to eat apples”.
Target sequence, such as “I love apple”.

The source sequence is fed into the encoder. The target sequence is fed into the decoder. Simultaneously, the target sequence is converted into ground truth labels and passed to the optimizer. We want the decoder’s output to be as close to the ground truth labels as possible, therefore the optimizer needs to minimize the cross-entropy. Because it’s a parallel operation, we split the source sequence and assemble it into a matrix, then feed it to the Transformer all at once. This achieves parallel operation through matrix operations, providing predictions for all sequences at once. However, the effect is equivalent to feeding each word into the encoder one by one for serial decoding.

1.2 Dropout

The dropout rate, or the proportion of neurons randomly dropped, is a hyperparameter during training that needs to be adjusted based on the specific task. Because Dropout introduces randomness, it is usually disabled during the testing (or inference) phase to ensure that all neurons participate in the computation and obtain the most stable model output.

principle

Dropout (regularization) is a widely used method in machine learning and deep learning that limits the range of parameter values by adding constraints. Its purpose is to prevent the model from overfitting, thereby improving the algorithm’s generalization ability. Regularization not only prevents overfitting but can also alleviate the gradient explosion problem to some extent. This is because by adding constraints to the parameters, the range of parameter values during the update process can be limited, thus preventing gradients from exploding due to excessively large parameter values.

The concept of Dropout was proposed by Hilton in his paper “Improving neural networks by preventing co-adaptation of feature detectors”. As shown in the figure below, after implementing dropout, the original network becomes a thinner and sparser network. Dropout weakens the dependencies between nodes by randomly discarding neurons, which can effectively alleviate overfitting and achieve a certain degree of regularization. This helps the model converge faster and improves performance, thus solving two common problems in deep learning neural networks when trained on small datasets: overfitting and long training time.

501

From an ensemble learning perspective, each Dropout iteration is equivalent to sampling a sub-network from the original network. Dropout optimizes different parameters for each batch step, and each iteration trains a different sub-network that shares some parameters with the original network. Furthermore, it continuously stacks and trains on top of these sub-networks. Therefore, the final network can be approximated as a combined model integrating several different networks. In other words, the averaging of the Dropout sub-networks provides a cheap, approximate approximation of Bagging ensembles.

Furthermore, Dropout can actually be seen as a form of sparsity. The paper “On the Effectiveness of Parameter-Efficient Fine-Tuning” points out two main advantages of sparsity in model training: enhancing model robustness and reducing generalization error.

502

Dropout can achieve this sparsity theory analysis effect to some extent.

Location

Dropout layers are ubiquitous in the Transformer architecture, as shown in the diagram below, and can be categorized into four types:

Dropout during input (corresponding to number 1 in the diagram).
In the attention mechanism, dropout is applied to the attention weights (corresponding to number 2 in the figure).
Dropout is applied between two fully connected layers in FFN (corresponding to number 3 in the figure).
There is also a dropout between “Add & Norm” (corresponding to number 4 in the figure).

503

The specific code snippet is as follows.

After summing the token embedding and positional encoding, there is Dropout.

class PositionalEncoding(nn.Module):
    "Implement the PE function."

    def forward(self, x):
        x = x + self.pe[:, : x.size(1)].requires_grad_(False)
        return self.dropout(x) # 这里用到Dropout

In attention, after obtaining the weights through scaling, masking, and softmax, they must be processed by Dropout before being multiplied with V. The purpose of randomly “discarding” some weights at this point is to prevent the model from becoming overly reliant on certain specific inputs. This is illustrated mathematically below.

$Z = Attention(Q, K, V) = Dropout\left(softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)\right)V$

The specific code is as follows.

def attention(query, key, value, mask=None, dropout=None):
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = scores.softmax(dim=-1)
    if dropout is not None:
        p_attn = dropout(p_attn) # 这里用到Dropout
    return torch.matmul(p_attn, value), p_attn

Dropout also exists between two fully connected layers in FFN.

class PositionwiseFeedForward(nn.Module):
    def forward(self, x):
        return self.w_2(self.dropout(self.w_1(x).relu())) # 这里用到Dropout

Dropout is applied to the output of each attention layer and FFN layer (before the residual connection).

class SublayerConnection(nn.Module):
    def forward(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm(x))) # 在Norm&Add之中用到Dropout

Source code

You can learn about the internal mechanism of Dropout by looking at the PyTorch source code below.

template<bool feature_dropout, bool alpha_dropout, bool inplace, typename T>
Ctype<inplace> _dropout_impl(T& input, double p, bool train) {

  if (p == 0 || !train || input.numel() == 0) {
    return input;
  }

  if (p == 1) {
    return multiply<inplace>(input, at::zeros({}, input.options()));
  }

  at::Tensor b; // used for alpha_dropout only
  auto noise = feature_dropout ? make_feature_noise(input) : at::empty_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
  noise.bernoulli_(1 - p);
  if (alpha_dropout) {
    constexpr double alpha = 1.7580993408473766;
    double a = 1. / std::sqrt((alpha * alpha * p + 1) * (1 - p));
    b = noise.add(-1).mul_(alpha * a).add_(alpha * a * p);
    noise.mul_(a);
  } else {
    noise.div_(1 - p);
  }

  if (!alpha_dropout) {
    return multiply<inplace>(input, noise);
  } else {
    return multiply<inplace>(input, noise).add_(b);
  }
}

ALIAS_SPECIALIZATION(_dropout,               false, false)
ALIAS_SPECIALIZATION(_feature_dropout,       true,  false)
ALIAS_SPECIALIZATION(_alpha_dropout,         false, true )
ALIAS_SPECIALIZATION(_feature_alpha_dropout, true,  true )

Tensor make_feature_noise(const Tensor& input) {
  auto input_sizes = input.sizes();
  std::vector<int64_t> sizes;
  sizes.reserve(input.dim());
  sizes.push_back(input_sizes[0]);
  sizes.push_back(input_sizes[1]);
  for (const auto i : c10::irange(2, input.dim())) {
    (void)i;
    sizes.push_back(1);
  }
  return input.new_empty(sizes);
}

develop

Dropout may be more effective in small models because they are designed for specific domains and limited data, making them prone to overfitting. However, is dropout necessary for large models? The answer is debatable.

The main reasons why large models do not need dropout are as follows:

Because large models are deep structures and use lossy, low-precision quantization calculations during training, while dropout can increase the model’s generalization ability, the noise it introduces can lead to instability in model training.
Using dropout increases computational resources and reduces efficiency. First, a mask needs to be generated (requiring additional GPU memory), and then the computation results also need to be stored (requiring additional GPU memory). Backpropagation also requires performing additional logical operations, so the overall efficiency is definitely lower.
Modern large models are mostly decoder-only structures, employing numerous techniques such as MQA, multi-head, pre-norm, and residual, and utilizing a large amount of multi-domain data for pre-training, which to some extent increases generalization. Removing dropout has little impact.

However, dropout is still used in some large models, and its purpose remains the same:

Operate on the output representation of self-attention.
Perform operations on the MLP output representation.

Of course, its settings will be adjusted according to the characteristics of the large model, for example:

For neurons in the input layer, the retention rate is usually set to a number closer to 1 to prevent significant changes in the input. This is because discarding neurons in the input layer is equivalent to adding noise to the data, thereby improving the robustness of the network.
For neurons in the intermediate hidden layers, a value of 0.5 generally yields the best results, which is effective for most networks and tasks. When the value is 0.5, half of the neurons are discarded during training, leaving only half that can be activated, resulting in the most diverse randomly generated network structures.
Dropout is generally not added to the output layer.

Furthermore, recent work has explored further applications of dropout. For instance, the paper “LoRA Dropout as a Sparsity Regularizer for Overfitting Control” applies random dropout to the input and output dimensions of LoRA matrices 𝐴 and 𝐵 to achieve better results. The reason for not applying dropout to the dimension of 𝑟 is that this would reduce the matrix rank, effectively using less 𝑟 in the structure, thus weakening the model’s expressive power.

504

1.3 Loss Function

The loss function provides an intuitive understanding of the model’s predictive performance by evaluating the difference between the model’s predictions and the true values, thus providing a clear goal and direction for the optimization algorithm. The model parameters are then gradually optimized by minimizing the loss. For autoregressive language models, the key is whether the model can correctly predict the next word; therefore, the optimization objective is to minimize cross entropy. Here, cross entropy is equivalent to information entropy. In other words, the goal of quantitative learning during the pre-training phase is to minimize the information entropy across different domains.

In the Transformer architecture, a Generator module follows the decoder output. This module maps the latent vectors output by the decoder from the word embedding dimension to the vocabulary length, obtaining logits. Logits correspond to the probabilities of taking different characters from the given token. The model then uses these probabilities and a certain sampling rule to sample the next token. The model’s performance is judged by whether it can classify the next token as the token corresponding to the ground truth. Therefore, predicting a new token is a classification task, and the Generator acts as a classification head. Training calculates the loss based on the classification results.

Cross-entropy

The Harvard code uses the cross-entropy loss function to compare the difference between the model’s predicted probability distribution (logits) and the true distribution (targets). Then, it calculates the gradient of the loss and uses backpropagation to slightly adjust the weights of all models to generate an output that more closely approximates the final result. The specific code is as follows.

self.criterion = nn.KLDivLoss(reduction="sum")

We’ll analyze this using the following diagram. Assuming the vocabulary contains 6 words, we want to obtain a probability distribution that matches the expected target sequence “I love you”. The top of the diagram shows the target probability distribution. In the probability distribution of the first output word, the probability of “I” should be 1, while the probabilities of all other words in the vocabulary should be 0. Similarly, in the probability distributions of the second and third output words, the probabilities of “love” and “you” should both be 1, while the probabilities of all other words in the vocabulary should be 0. The bottom of the diagram shows the probability distribution of the model’s predicted output. The loss function calculates the difference between the two.

505

The code for calculating the loss function is shown below, where the parameter criterion is the loss function. This class, besides containing loss calculation, also includes the forward propagation logic for the model generator. The code below has a regularization detail for smoothing. Assuming there are two batches, the first batch has 6 words, so the loss is the sum of the losses calculated for these 6 predictions. The second batch has 60 words, so the loss is the sum of the losses calculated for these 60 predictions. Clearly, the second loss is larger, which is illogical. Therefore, we use a averaging method, dividing by the number of valid tokens.

class SimpleLossCompute:
    "A simple loss compute and train function."

    def __init__(self, generator, criterion):

        self.generator = generator # Generator类对象，依据解码器的输出预测下一个token
        self.criterion = criterion # LabelSmoothing类对象，对标签进行平滑和计算损失

    def __call__(self, x, y, norm):
        """
        x: 解码器的输出
        y: batch.tgt_y，要被预测的所有token，例如src为`<bos>我吃了一个苹果<eos>`，
           则tgt_y是"I ate an apple<eos>"
        norm: batch.ntokens, tgt_y中的有效token数
        """
        x = self.generator(x) # 生成预测输出
        # 首先使用KLDivLoss进行了损失计算，随后又除以batch.ntokens对损失进行正则化。
        sloss = (
            self.criterion(
                x.contiguous().view(-1, x.size(-1)), y.contiguous().view(-1)
            )
            / norm # 对损失进行正则化
        )
        return sloss.data * norm, sloss

Label Smoothing

The Transformer paper also uses Label Smoothing (Label Smoothing Regularization) as a regularization technique to prevent overfitting. This is because real-world training data is noisy, and the trained model tends to reflect diverse data. Therefore, noise needs to be added to the ground truth to constrain the model. Below is an excerpt from the paper.

Label Smoothing During training, we employed label smoothing of value ϵls = 0.1 [36]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.

Label smoothing primarily targets the softmax layer. The idea is to introduce noise into the ground truth. Instead of labeling the ground truth as purely zero or one, it uses probabilities to smooth the labels, removing some of the highest probabilities and distributing these probabilities evenly among the other classes. After this adjustment, although the sum of probabilities for all classes remains normalized, it makes the model less confident, thus reducing overfitting.

Label smoothing effectively suppresses feature normalization, eliminating smooth regions on the loss function surface and introducing large gradients pointing towards class centers, thus causing features to cluster together. The principle of label smoothing is illustrated in the figure below.

506

Let’s demonstrate with an example. Suppose our label is 2 and the dictionary size is 6. The original truth vector is: [0, 0, 1, 0, 0, 0]. Now, we take a smoothing factor ϵ = 0.2, and the smoothed label is: [0.2/5, 0.2/5, 1-0.2, 0.2/5, 0.2/5, 0.2/5] = [0.04, 0.04, 0.8, 0.04, 0.04, 0.04]. This way, even if the model predicts correctly, we shouldn’t be overconfident; instead, we should penalize the model to prevent it from overestimating its predictions.

The code for Label Smoothing is as follows. This class is responsible for smoothing labels and calculating the loss. Additionally, because the dictionary includes padding characters. However, this word shouldn’t be predicted during the prediction process, so K-1 in the formula should be changed to K-2 in the algorithm. A common practice is to set ignore_index in PyTorch CrossEntropy to PAD_IDX, such as loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX).

class LabelSmoothing(nn.Module):
    "Implement label smoothing."
    # 该类除了平滑标签外，还会计算损失

    def __init__(self, size, padding_idx, smoothing=0.0):
        """
        size: 目标语言词典大小。
        padding_idx: <pad>在词典中对应的序号
        smoothing: 平滑因子，0表示不做平滑处理
        """
        super(LabelSmoothing, self).__init__()
        self.criterion = nn.KLDivLoss(reduction="sum") # 最终使用的损失函数
        self.padding_idx = padding_idx
        self.confidence = 1.0 - smoothing
        self.smoothing = smoothing
        self.size = size
        self.true_dist = None # 平滑后的标签

    def forward(self, x, target):
        """
        x: generator输出的概率分布。形状为(batch_size, voc_size)
        target: 目标真值标签，内容是token index。形状为(batch_size)
        """
        # 确保generator的输出维度和词典大小一致，否则后面计算loss的时候就会出错
        assert x.size(1) == self.size
        # 创建一个与x有相同形状的张量
        true_dist = x.data.clone()
        # 将true_dist全部填充为 self.smoothing / (self.size - 2)
        """
        假设 smoothing=0.2，词表大小为6，batch size为2
        则true_dist全部填充为 0.2 / (6-2)= 0.05，此时true_dist为：
        [[0.05, 0.05, 0.05, 0.05, 0.05, 0.05],
         [0.05, 0.05, 0.05, 0.05, 0.05, 0.05]]
        """
        true_dist.fill_(self.smoothing / (self.size - 2)) # K - 2 = 6 - 2
        """
        target.data.unsqueeze(1)会给target.data增加一维，假设target.data是[2,3]，则target.data.unsqueeze(1)的结果是[[2],[3]]
        将true_dist的第一个1维度上与target.data.unsqueeze(1)对应的值变为self.confidence。
        假设此例中target.data.unsqueeze(1) 为[[2], [3]]，即2个数据的标签分别为2，3，就是把true_dist上设置为self.confidence，则true_dist执行过scatter后变为:
        [[0.05, 0.05, 0.8, 0.05, 0.05, 0.05],
         [0.05, 0.05, 0.05, 0.8, 0.05, 0.05]]
        """
        true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence) # 1代表作用到第一个维度上
        # 将<pad>所在的index填充为0
        true_dist[:, self.padding_idx] = 0
        # 找出target中为<pad>的标签。例如target为['i', 'love', 'you', '<pad>', '<pad>']，mask则为[[3], [4]]，表示第3个和第4个为空格。
        mask = torch.nonzero(target.data == self.padding_idx)
        if mask.dim() > 0:
            # 将"<pad>"所在的label设置为0
            true_dist.index_fill_(0, mask.squeeze(), 0.0)
        # 保存平滑标签后的label
        self.true_dist = true_dist
        """
        使用平滑后的标签计算损失
        由于对`<pad>`部分进行了mask，所以这部分不会参与损失计算
        """
        return self.criterion(x, true_dist.clone().detach())

The following diagram shows a partial data flow example from the code above.

507

Because training is performed in parallel, it is difficult to demonstrate. Therefore, the following diagram is simplified, showing only the loss calculation of a single output in the first three steps.

508

The simplified code for calling the loss function is as follows. Since the LabelSmoothing class encapsulates the loss function as mentioned earlier, the criterion here is an instance of the LabelSmoothing class.

criterion = LabelSmoothing(size=V, padding_idx=0, smoothing=0.0)
model = make_model(V, V, N=2)
batch_size = 80
for epoch in range(20):
    model.train()
    run_epoch(
        data_gen(V, batch_size, 20),
        model,
        SimpleLossCompute(model.generator, criterion),
        optimizer,
        lr_scheduler,
        mode="train",
    )

# run_epoch()函数中会调用损失函数
def run_epoch(
    data_iter,
    model,
    loss_compute,
    optimizer,
    scheduler,
    mode="train",
    accum_iter=1,
    train_state=TrainState(),
):
    for i, batch in enumerate(data_iter):
        out = model.forward(
            batch.src, batch.tgt, batch.src_mask, batch.tgt_mask
        )
        # 计算损失
        loss, loss_node = loss_compute(out, batch.tgt_y, batch.ntokens)

Below is another example of smoothing, which shows that when the model is very confident, it is given a small penalty; the more confident it is, the greater the loss becomes.

def loss(x, crit):
    # x是从0到100的一个不断增大的数。 d=x+3，比x大一点。
    d = x + 3
    """
    模拟模型的输出。
    一开始x为1，输出为：[[0.0000, 0.2500, 0.2500, 0.2500, 0.2500]]，此时模型还不太会预测
    当x到100时，输出为：[[0.0000, 0.9706, 0.0098, 0.0098, 0.0098]]，此时模型很自信的说结果就是 1
    """
    predict = torch.FloatTensor([[0, x / d, 1 / d, 1 / d, 1 / d]])
    # 计算模型损失。由于使用的是KLDivLoss，所以要对predict进行log操作
    return crit(predict.log(), torch.LongTensor([1])).data

def penalization_visualization():
    crit = LabelSmoothing(5, 0, 0.1)
    loss_data = pd.DataFrame(
        {
            # x从1开始不断增大，模拟模型的表现越来越好
            "Loss": [loss(x, crit) for x in range(1, 100)],
            "Steps": list(range(99)),
        }
    ).astype("float")

    return (
        alt.Chart(loss_data)
        .mark_line()
        .properties(width=350)
        .encode(
            x="Steps",
            y="Loss",
        )
        .interactive()
    )

show_example(penalization_visualization)

1.4 Learning Rate

The learning rate determines the step size for updating model parameters. If the learning rate is set too high, the model parameters may jump out of the optimal solution range due to the excessively large step size during updates. Simultaneously, an excessively high learning rate can make the model overly aggressive in updating parameters, exacerbating gradient fluctuations and leading to gradient explosion. If the learning rate is too low, the model’s convergence speed may slow down, increasing training time. Therefore, the choice of learning rate needs to be adjusted according to the specific task and model structure. In practical applications, adaptive learning rate algorithms can be used to adjust the learning rate based on the statistical information of parameter gradients. For example, optimization algorithms such as Adam, Adagrad, and RMSprop can dynamically adjust the learning rate based on historical gradient information, thereby improving training stability and efficiency.

Warmup

Warm-up is a type of dynamic learning rate adjustment. Specifically, it refers to gradually increasing the learning rate from 0 to a specified value at the beginning of training, rather than starting training with a fixed value from the outset. Without warm-up, the model might learn quickly at the beginning of training. Due to the vanishing gradient phenomenon, the model is more sensitive to later layers, meaning later layers learn faster. However, later layers learn based on the outputs of earlier layers. If the earlier layers haven’t learned well, the later layers will learn on an incorrect foundation, ultimately causing the model to crash.

Noam

The Transformer paper employs a special adaptive learning rate adjustment strategy called the “Noam” learning rate warm-up strategy. It consists of two parts: warmup and decay, with the overall trend being an initial increase followed by a decrease in learning rate. The “Noam” learning rate warm-up strategy is illustrated in the figure below, and is a piecewise function with warmup_steps as the dividing point. d_model is the model dimension, step_num is the current training step number, and warmup_steps is the warm-up setup.

The warmup phase, from 0 to warmup_steps, is the initial warm-up phase, where the learning rate linearly increases to a maximum value. Large networks are unstable in the early stages of training, and a large learning rate increases the difficulty of convergence. Using a smaller learning rate in the warmup phase helps the model converge quickly in the early stages of training. Furthermore, large networks often use extremely large batch sizes. To achieve these large batch sizes, the gradients of “k minibatches, size = n, learning rate = η” and “1 minibatchatch, size = kn, learning rate = kη” need to be approximately equal. However, this equality can be broken when the model changes drastically. Warmup effectively alleviates this problem.
The decay phase is the cooling phase, during which the learning rate decays exponentially. This allows the model to stabilize during later stages of training by reducing the learning rate. Common methods include exponential decay, piecewise constant decay, and inverse-time decay. The Transformer uses a negative power form, with the decay rate initially fast and then slowing down.

509

The Noam mechanism is primarily inspired by human learning mechanisms: when we learn a new field, we initially need to explore and experiment, resulting in a slow learning pace. As we absorb more foundational knowledge, our learning speed gradually increases. However, after mastering a large amount of diverse knowledge, we typically encounter a bottleneck, requiring knowledge integration and reflection, which slows down the learning process again. In summary, human learning ability is a spiral-like, gradual process, a combination of slow and fast learning. The Noam mechanism is a concrete manifestation of this process.

The following diagram illustrates the detailed derivation process.

510

The rate() function in the Harvard code is an implementation of the following formula.

$\text{lrate} = d_{\text{model}}^{-0.5} \cdot \min\left(\text{step\_num}^{-0.5},\; \text{step\_num}\cdot\text{warmup\_steps}^{-1.5}\right)$

The specific code is as follows.

def rate(step, model_size, factor, warmup):
    """
    we have to default the step to 1 for LambdaLR function
    to avoid zero raising to negative power.
    """
    if step == 0: # 如果未提供步数，则设为1
        step = 1
    return factor * (
        model_size ** (-0.5) * min(step ** (-0.5), step * warmup ** (-1.5))
    )

The specific usage method is as follows.

optimizer = torch.optim.Adam(
    model.parameters(), lr=0.5, betas=(0.9, 0.98), eps=1e-9
)
lr_scheduler = LambdaLR(
    optimizer=optimizer,
    lr_lambda=lambda step: rate(
        step, model_size=model.src_embed[0].d_model, factor=1.0, warmup=400
    ),
)

In practical applications, different learning rates can be used to adjust each layer, or several layers can be grouped together and different learning rates can be applied to different groups. This is because different layers in the Transformer model typically capture different types of information. The lower layers usually encode general and basic information, while the upper layers usually encode information that is closer to the pre-training task. Therefore, a higher learning rate can be applied to the upper layers and a lower learning rate can be applied to the lower layers.

1.5 Initialization

Weight initialization is a crucial step in neural network training. If the weights are initialized too large, gradient calculation during backpropagation can be significantly affected, potentially leading to gradient explosion. For example, if the weights are initialized using a standard normal distribution with an expected value on the order of 1, the gradient value may become extremely large after multiple propagation layers.

Using appropriate weight initialization strategies can effectively control the magnitude of gradients and reduce the possibility of gradient explosion. Common weight initialization methods include Xavier initialization (also known as Glorot initialization) and He initialization. These methods set the initial values of the weights based on the number of layers in the network and the characteristics of the activation function, making the gradient changes more smoothly during backpropagation.

For example, the Xavier initialization method adjusts the initial values of the weights based on the number of input and output neurons, ensuring that the activation values and gradient values in forward and backward propagation have similar variances. The He initialization method is particularly suitable for ReLU activation functions because it takes into account the discontinuity of the ReLU activation function at zeros, thus setting the initial values of the weights more accurately.

The vanilla Transformer uses Xavier for initialization.

1.6 Teacher Forcing

Essentially, Teacher Forcing is a method to guide and accelerate the model learning process by providing the correct output as guidance at each step of training, rather than letting the training generate the next step based on previous outputs.

question

As mentioned earlier, autoregressive inference has two problems:

Errors can easily accumulate, leading to poor training results. During training, we can use the same method as during inference, i.e., using an autoregressive model. However, this makes the entire process serial. If the encoder makes a wrong prediction in a certain round, then this incorrect output will be used as the input for the decoder in the next round. Continuing to decode based on incorrect input means going further and further down the wrong path, which will slow down the model’s convergence towards the global optimum.
It can only be done serially, which means it’s difficult to conduct training in a parallel manner to improve efficiency. This phenomenon is similar to the logic of human speech. A person may be able to conceive an entire sentence in their mind, but the expression must be spoken word by word, and the later words will always be influenced by the earlier words. This is the logic of speech.

Let’s look at the two questions above using the table below.

First, given the input, the model must complete inference in 5 time steps because the decoder only predicts one word at a time. However, training according to this process would be too slow; we should use a parallel (matrix computation) method for training.

Secondly, errors can occur in the reasoning process, and it is easy to go further and further down the wrong path.

Time step	Decoder input 1	Decoder input 2	Decoder output	Truth value	illustrate
1		The latent vector encoded as “I ate an apple”	I	I	Prediction correct
2	I	The latent vector encoded as “I ate an apple”	like	ate	Prediction error
3	I like	The latent vector encoded as “I ate an apple”	play	an	Prediction error
4	I like play	The latent vector encoded as “I ate an apple”	football	apple	Prediction error
5	I like playing football	The latent vector encoded as “I ate an apple”			The prediction was correct, but it was of little use.

concept

To improve training efficiency, we need to use parallel methods to ensure that the predicted results of all words in a sequence are output in a single training session. To achieve this ideal training method, researchers proposed Teacher Forcing, a technique that can decode all outputs in parallel at once by inputting the entire target sequence into the decoder during training.

Specifically, Teacher Forcing involves using the ground truth of the training labels as input for each inference step instead of the output of the previous inference step. This mechanism ensures that the Transformer can output all words in parallel during training without loops, significantly accelerating the training process. The diagram below simplifies the input; in reality, the decoder’s input is a concatenation, not simply a single label.

511

Providing the target sequence to the decoder essentially gives the model correct guidance. Even if the previous word prediction is incorrect, at the next time step, it can use the correct first word (i.e., the ground truth) to predict the second word. This avoids the continuous accumulation of errors and ensures that supervised training for each inference starts from the correct input, thus allowing us to expect correct results. The “Teacher” in its name refers to the ground truth. Autoregressive models train “on their own,” while Teacher Forcing is like having a teacher guiding the training. Even if we calculate the wrong answer, the teacher provides the correct answer. We can identify at which stage the problem occurred, making it easy to analyze our errors and learn better and faster, in other words, training guided “by the standard answer.”

Note: The opposite of the Teacher forcing model is the free-running model. Free-running directly uses the output of the previous state as the input of the next state.

Example

Let’s assume we want to translate “我吃一个苹果” into “I ate an apple”. “我吃一个苹果” is the encoder input, and “I ate an apple” is the truth label. We’ll examine how the model uses Teacher Forcing to correct errors during training, prevent error accumulation, and thus improve training effectiveness.

First, the truth labels are the target sequence, which will serve as the input to the decoder. For better training, we need to shift all the input tokens one position to the right (Shifted Right), and then place a starting token on the leftmost side. In contrast, the input of the autoregressive model at this moment is the value it output at the previous moment, which was predicted at the previous moment and may not be correct; while the input of Teacher Forcing at this moment is the true value label from the previous moment, which is definitely correct and can ensure that the decoder’s prediction at this moment is based on correctness.

Secondly, let’s examine the inference steps during the decoding process. We can see that if the model accepts “like” after the second prediction, it will cause the model to deviate from its intended path in subsequent training, resulting in slower learning and model instability. In Teacher Forcing mode, because an error is detected, the model discards this output and uses “ate” as the next input. In other words, during training, regardless of the decoder’s current output, its next input is always the ground truth value corresponding to that output. This corrects the statistical properties of the model during training, increasing the probability of successful word prediction in subsequent iterations, thus enabling it to learn to generate correct sequences more quickly.

Time step	Decoder input 1	Decoder input 2	Decoder output	Truth value	illustrate
1		The latent vector encoded as “I ate an apple”	I	I	Prediction correct
2	I	The latent vector encoded as “I ate an apple”	like	ate	Prediction errors, corrected with truth values
3	I ate	The latent vector encoded as “I ate an apple”	an	an	Prediction correct
4	I ate an	The latent vector encoded as “I ate an apple”	orange	apple	Prediction errors, corrected with truth values
5	I ate an apple	The latent vector encoded as “I ate an apple”			Prediction correct

The specific correspondence is shown in the image below.

512

Image source: Anatomy of Transformer Part 2: Can you assemble a Transformer using attention mechanisms?

principle

Essentially, Teacher Forcing removes the sequential relationship between each inference step during training, thus decoupling the previous autoregressive inference dependencies. The decoder’s input is the truth label, enabling parallel inference. We can copy the entire sentence “I ate an apple” five times to form a matrix, where each row represents the input at one time step. Then, we feed this matrix as a batch input to the decoder, leveraging the GPU’s parallel capabilities to perform five inferences in parallel to obtain the results for all time steps. Finally, we calculate the loss for each element of the output sequence for each pair of pairs. This is why Transformer training can be performed in parallel.

During execution, we add a start character to the initial output <bos>, which is equivalent to shifting the entire output one position to the right.

513

In terms of specific data construction, the training code will first expand the target sentence to <bos>I ate an apple<eos>, then move one position to the right to build the tag I ate an apple<eos>. Then create another batch as follows:

<bos>I ate an apple
<bos>I ate an apple
<bos>I ate an apple
<bos>I ate an apple
<bos>I ate an apple

Finally, this batch is sent to the decoder.

mask

While the parallel processing described above can compute the output for all time steps at once, it presents a problem: the attention mechanism can pre-pay attention to subsequent words when predicting a particular word, allowing the model to learn to cheat. For example, in the diagram above, the input for each inference step is the entire sentence “I ate an apple.” Therefore, when predicting the first output “I,” the model can actually pay attention to words in the target sequence after <bos>. The model can simply output “I” to satisfy the requirement. This kind of “peeping” behavior will teach the model to be lazy rather than learn the pattern. In other words, simply using Shifted Right cannot achieve teacher forcing, because if the attention module is not masked self-attention, it will cause data leakage problems.

Therefore, masking mechanisms were introduced to hide future information. Specifically, a mask is added during attention calculation. This mask is a matrix with the same shape as the input matrix, and its function is to obscure a portion of the input matrix, allowing the model to see only a part of the target sequence (the prefix). When outputting the i-th element, the model cannot see the i-th element and subsequent elements of the target sequence; it can only use information preceding the i-th element. This cuts off its path to obtaining information from the future, preventing it from revealing its secrets (forcing the corresponding attention to zero). This allows for the simulation of actual inference effects during training. In other words, the mask allows for individual adjustment of the attention intensity between each source element and each target element.

During training, if the current input of the Decoder is <start> I ate an apple, for the word an, it should only focus on itself, I, and ate. The word apple, which we will be predicting, has not yet appeared and therefore does not need to be considered. The image below shows the focus for each of the four words.

514

The image below shows a Teacher Foring example with a mask added.

515

accomplish

Teacher Forcing is relatively simple to implement. It takes the target sequence as input and uses the real target sequence and mask as input to guide the decoder’s generation process.

for i, batch in enumerate(data_iter):
    out = model.forward(
        batch.src, batch.tgt, batch.src_mask, batch.tgt_mask
    )
    loss, loss_node = loss_compute(out, batch.tgt_y, batch.ntokens)

The loss function combines all the out values and then considers the loss.

class SimpleLossCompute:
    def __call__(self, x, y, norm):
        x = self.generator(x)
        sloss = (
            self.criterion(
                x.contiguous().view(-1, x.size(-1)), y.contiguous().view(-1)
            )
            / norm
        )
        return sloss.data * norm, sloss

Advantages and disadvantages

The advantage of Teacher Forcing is that the model makes predictions guided by the “correct answer,” which greatly enhances training stability and significantly improves convergence speed. Furthermore, we can input the entire target sequence at once and output the complete target sequence in parallel, significantly improving training efficiency.

However, Teacher Forcing also has its problems. Because training can rely on the teacher, while inference relies on the student’s own efforts, erroneous outputs encountered during inference become out-of-distribution inputs for subsequent inferences. This leads to discrepancies between the training and prediction phases of the model trained using Teacher Forcing. This phenomenon, where the model performs worse in deployment due to differences in data distribution between training and inference, is called exposure bias. Furthermore, because the model’s generated results must correspond one-to-one with the reference sentence, this constraint reduces model divergence and speeds up convergence during training. However, it also stifles the possibility of translation diversity.

Therefore, researchers have also made some improvements to address exposure bias. One such variation is Curriculum Learning, which proposes a compromise: since relying solely on the autoregressive model’s predictions and the Teacher Forcing model’s reliance solely on the true value are both undesirable, planned learning is preferable. At each step of the training process, the model’s output or the true value is randomly selected with a certain probability. This selection probability is continuously adjusted as training progresses: the training process starts with Teacher Forcing, gradually reducing the frequency of inputting the true value during the training phase. In other words, initially, the student is a beginner and can only learn under the teacher’s guidance; later, as the student progresses, the teacher gradually allows the student to learn independently.

summary

The training process is a one-step operation because the ground truth is known. By combining Teacher Forcing and masking, the context vector at the i-th time step can be calculated from the vectors of the previous i time steps. This ensures that the prediction of the i-th word only uses the information from the previous i time steps, which improves computational efficiency and conforms to the inherent rules of language models.

1.7 Parallelism

Next, let’s look at the parallel mechanisms during training. In general, the parallelization of Transformer is mainly reflected in the training phase, especially in self-attention and FFN.

During the inference phase, because the ground truth is unknown, we only know the words predicted in the first i-1 time steps. Obviously, we can only use autoregressive iterative operations to predict all words, so multiple steps are needed to predict all words in the sequence, which is difficult to parallelize. Although parallelization in the inference phase on the decoder side is challenging, this problem can be alleviated to some extent through some advanced techniques and model variants.

Logical Dimension

Let’s start by looking at it from the perspective of seq2seq models. The encoder-decoder architecture is autoregressive: it predicts the output of this step using the token generated in the previous step and the input of this step. Let’s see how the Transformer improves upon this to achieve parallelism.

encoder

The encoder natively supports parallelism. The entire architecture has been transformed from a sequence model into a fully connected graph model, where each word can be directly associated with the others. This makes matrix calculations much easier, allowing the encoder to benefit from the improved performance brought by parallel computing or GPU acceleration.

The diagram below illustrates the information flow during Transformer computation using the example of “specialties of the North”. The receptive domain of the attention mechanism is the entire sentence. When calculating the features of any word, the Transformer uses information from all words; that is, the feature L(North) of “North” is calculated jointly by all words. L(North) is a weighted sum of all words, thus eliminating the concept of distance and avoiding long dependencies, the distance between any two words in the sequence is a fixed constant. Furthermore, each word in the input sequence flows into the encoder via its own separate path, allowing all words to enter the encoder simultaneously without queuing, enabling parallel processing.

516

decoder

In autoregressive mode, the decoder requires two types of latent vectors:

The encoded latent vector generated by the encoder.
The latent vector generated by the decoder during the decoding process, which is the output of the previous state.

For the first type of latent vector, the encoder can compute it all at once through parallel operations and pass it to the cross-attention. For the second type of latent vector, in the Teacher Forcing model, the sequential relationship of each inference is broken, and the previous autoregressive inference dependencies are removed, so the latent vector in the decoding process is no longer needed. Therefore, with the help of a mask, we can directly feed the full input and ground truth labels into the decoder at once to complete parallel training.

Model Dimension

A module refers to both the encoder and the decoder. This is because the encoder and decoder can be parallelized during training. From the model’s perspective, the following dimensions can be parallelized.

Q, K, and V generation can be parallelized. Using W_Q, W_K, W_V, the calculation of these three weight matrices can be parallelized.
Parallelization of the self-attention mechanism. In the self-attention layer, the model calculates attention scores between words at all positions in the input sequence, and these calculations are independent of each other. Therefore, they can be executed in parallel on different processing units.
Parallelization of multi-head attention mechanisms. In multi-head attention mechanisms, different attention heads can be computed in parallel on different processing units.
Parallelization of FFN. FFN performs the same operation at each position of the input sequence, and these operations are independent. Therefore, they can also be executed in parallel on different processing units.

Self-attention

The key to parallelizing encoders lies in the self-attention mechanism, where computing z_i depends on all elements x_1, ..., x_n rather than relying on z_{i-1}. See the diagram below. Taking the token “ate” as an example, the self-attention mechanism uses the pairwise correlation between input elements as weights, and then performs a weighted summation on each input element x_i to map to semantic vectors z_i. z_i is a product that takes global dependencies into account; it doesn’t require strict sequential iteration and can be parallelized. Therefore, we can use matrix operations to process all z_i at once.

517

FFN

In the input sequence, each word flows into the encoder along its own separate path; that is, each word flows into the encoder simultaneously, not in a queue.

In the self-attention layer, these paths are interdependent, but FFN does not have these dependencies, so these paths can be computed in parallel as they flow through the FFN.

The same transformation matrix is used to compute each position x in the input sequence, and different parameters are used for each sub-layer. After computing Multi-Head Attention, the input matrix of the FFN layer is:

$X \in \mathbb{R}^{d_{input} \times d_{model}}$

It can be seen as being composed of each input position (d_input rows) attention results (d_model columns) stacked together. After undergoing the same linear transformation, the dimensions change, and the rows are stacked again to form the output of the FFN layer. The rows are completely separate and identically transformed according to their positions. The term “Position-wise” is explained in Section 3.3, Position-wise Feed-Forward Networks, as shown in the figure below.

518

Tensor Dimension

The tokens input to the Transformer have three dimensions: batch_size, sequence_length, and embedding_dim. The multi-head mechanism of attention computation further splits the embedding_dim dimension into head_num and head_dim, so from a computational perspective, there are a total of five dimensions:

batch_size
sequence_length
token
head_num
head_dim

The batch_size, head_num, and token dimensions themselves support parallelism. Recently, parallelism has also been explored at the sequence dimension, known as sequence parallelism. Sequence parallelism was first proposed in the paper “Sequence Parallelism: Long Sequence Training from System Perspective” to address the problem of excessive memory usage caused by long sequence lengths. We know that LLM inference mainly has two stages: prefill and decode. The former’s bottleneck lies in computation, while the latter’s lies in bandwidth. Prefill already involves splitting the sequence length for calculation and then summing the results; sequence parallelism performs this process in parallel. Specifically, it divides the input sequence into multiple blocks, with each block computed on a different GPU to reduce the memory requirements of long input sequences. To merge the computational results, the paper also proposes a ring self-attention (RSA) mechanism.

Another term is Context Parallelism, which first appeared in NVIDIA Megatron-Core. Context Parallelism mainly optimizes self-attention (Linear, LayerNorm). It splits the original input according to the sequence length dimension, distributes it to different devices, calculates it separately, and then integrates the results calculated on other devices through all-gather and reduce-scatter communication operations.

1.7 Code

Training methods

The train_model() function will choose whether to perform distributed training or single-machine training based on the configuration.

def train_distributed_model(vocab_src, vocab_tgt, spacy_de, spacy_en, config):
    from the_annotated_transformer import train_worker

    ngpus = torch.cuda.device_count()
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12356"
    print(f"Number of GPUs detected: {ngpus}")
    print("Spawning training processes ...")
    mp.spawn(
        train_worker,
        nprocs=ngpus,
        args=(ngpus, vocab_src, vocab_tgt, spacy_de, spacy_en, config, True),
    )

def train_model(vocab_src, vocab_tgt, spacy_de, spacy_en, config):
    if config["distributed"]:
        train_distributed_model( # 分布式训练
            vocab_src, vocab_tgt, spacy_de, spacy_en, config
        )
    else:
        train_worker( # 使用0号GPU进行单机训练
            0, 1, vocab_src, vocab_tgt, spacy_de, spacy_en, config, False
        )

Single-machine training code

The single-machine training code is as follows. It iterates through the data for one epoch, then calls the forward() function, followed by the loss_compute() function to calculate the gradient, update the parameters, and return the loss. The inputs to the loss_compute() function are: the model’s predicted output, the true label sequence batch.trg_y, and the number of words in the batch.

def train_worker(
    gpu,
    ngpus_per_node,
    vocab_src, # 源语言词典
    vocab_tgt, # 目标语言词典
    spacy_de, # 源语言分词器
    spacy_en, # 目标语言分词器
    config,
    is_distributed=False,
):
    print(f"Train worker process using GPU: {gpu} for training", flush=True)
    torch.cuda.set_device(gpu)

    pad_idx = vocab_tgt["<pad>"] # 得到目标语言词典中"<pad>"所对应的索引
    d_model = 512 # 词嵌入大小
    model = make_model(len(vocab_src), len(vocab_tgt), N=6) # 构建一个6层模型
    model.cuda(gpu)
    module = model
    is_main_process = True
    if is_distributed:
        dist.init_process_group(
            "nccl", init_method="env://", rank=gpu, world_size=ngpus_per_node
        )
        model = DDP(model, device_ids=[gpu])
        module = model.module
        is_main_process = gpu == 0

    # 构建损失函数
    criterion = LabelSmoothing(
        size=len(vocab_tgt), padding_idx=pad_idx, smoothing=0.1
    )
    criterion.cuda(gpu)

    # 构建数据加载器
    train_dataloader, valid_dataloader = create_dataloaders(
        gpu,
        vocab_src,
        vocab_tgt,
        spacy_de,
        spacy_en,
        batch_size=config["batch_size"] // ngpus_per_node,
        max_padding=config["max_padding"],
        is_distributed=is_distributed,
    )

    # 构建优化器
    optimizer = torch.optim.Adam(
        model.parameters(), lr=config["base_lr"], betas=(0.9, 0.98), eps=1e-9
    )
    # 构建学习率策略，依据配置来设定warmup参数
    lr_scheduler = LambdaLR(
        optimizer=optimizer,
        lr_lambda=lambda step: rate(
            step, d_model, factor=1, warmup=config["warmup"]
        ),
    )
    train_state = TrainState()

    for epoch in range(config["num_epochs"]):
        if is_distributed:
            train_dataloader.sampler.set_epoch(epoch)
            valid_dataloader.sampler.set_epoch(epoch)

        model.train()
        print(f"[GPU{gpu}] Epoch {epoch} Training ====", flush=True)
        _, train_state = run_epoch(
            (Batch(b[0], b[1], pad_idx) for b in train_dataloader),
            model,
            SimpleLossCompute(module.generator, criterion),
            optimizer,
            lr_scheduler,
            mode="train+log",
            accum_iter=config["accum_iter"],
            train_state=train_state,
        )

        GPUtil.showUtilization()
        if is_main_process:
            file_path = "%s%.2d.pt" % (config["file_prefix"], epoch)
            torch.save(module.state_dict(), file_path)
        torch.cuda.empty_cache()

        print(f"[GPU{gpu}] Epoch {epoch} Validation ====", flush=True)
        model.eval()
        sloss = run_epoch(
            (Batch(b[0], b[1], pad_idx) for b in valid_dataloader),
            model,
            SimpleLossCompute(module.generator, criterion),
            DummyOptimizer(),
            DummyScheduler(),
            mode="eval",
        )
        print(sloss)
        torch.cuda.empty_cache()

    if is_main_process:
        file_path = "%sfinal.pt" % config["file_prefix"]
        torch.save(module.state_dict(), file_path)

Overall code

The overall code is shown below, which includes training and using the trained model for inference.

def example_simple_model():
    V = 11
    criterion = LabelSmoothing(size=V, padding_idx=0, smoothing=0.0)
    model = make_model(V, V, N=2)

    optimizer = torch.optim.Adam(
        model.parameters(), lr=0.5, betas=(0.9, 0.98), eps=1e-9
    )
    lr_scheduler = LambdaLR(
        optimizer=optimizer,
        lr_lambda=lambda step: rate(
            step, model_size=model.src_embed[0].d_model, factor=1.0, warmup=400
        ),
    )

    batch_size = 80
    for epoch in range(20):
        model.train()
        run_epoch(
            data_gen(V, batch_size, 20),
            model,
            SimpleLossCompute(model.generator, criterion),
            optimizer,
            lr_scheduler,
            mode="train",
        )
        model.eval()
        run_epoch(
            data_gen(V, batch_size, 5),
            model,
            SimpleLossCompute(model.generator, criterion),
            DummyOptimizer(),
            DummyScheduler(),
            mode="eval",
        )[0]

    model.eval()
    src = torch.LongTensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
    max_len = src.shape[1]
    src_mask = torch.ones(1, 1, max_len)
    print(greedy_decode(model, src, src_mask, max_len=max_len, start_symbol=0))

0x02 Reasoning

When we talk about the inference process of Large Language Models (LLMs), we are referring to using a pre-trained model to process input text and generate corresponding output. Because the basic processes of training and inference are similar, we will mainly look at the differences between the two and the unique characteristics of inference here.

2.1 Input and Output

First, the input (the sentence to be translated) and execution process of the encoder module are the same, whether in training or inference.

Secondly, the training and prediction processes differ for the decoder.

The target inputs are different. Although both involve adding new words to the previous inputs and concatenating them into the new input for the decoder, the sources of the new words are different.
- The inference phase uses the previous output to construct the next input. That is, tgt starts from <bos> and adds the output from the previous iteration each time.
- During the training phase, the true values are continuously concatenated to form the next input. That is, tgt starts from <bos> and adds the next true value each time. And it passes the entire input sequence to the decoder at once.
The outputs are different.
- Inference: Output a new token each time. Parallelization of the Decoder only occurs during the training phase. During the inference phase, since we do not have the correct target statement, the input at time t necessarily depends on the output at time t-1, which is no different from the previous seq2seq.
- Training: The transformer outputs multiple probability distributions at once. During training, the loss can be calculated once the output probability distributions are obtained; it is not necessary to convert the probability distributions back into corresponding tokens.

2.2 Process

During reasoning, because there is no answer text in the prediction scenario, we can only rely on <bos> to begin prediction, therefore the decoder for prediction must output sequentially in a loop. In subsequent prediction steps, the output sequence from the previous time step and the previously predicted words are fed together to the decoder at the next time step until the sentence end marker is encountered. The difference from the Seq2Seq model is that at each time step, we re-input the entire output sequence generated so far, not just the last word.

The logical flow of the reasoning process is as follows:

Time step	Decoder input 1	Decoder input 2	Decoder output
1		The latent vector encoded as “I ate an apple”	I
2	I	The latent vector encoded as “I ate an apple”	ate
3	I ate	The latent vector encoded as “I ate an apple”	an
4	I ate an	The latent vector encoded as “I ate an apple”	apple
5	I ate an apple	The latent vector encoded as “I ate an apple”

The corresponding logic diagram is as follows.

519

The training process is simply to add supervision to the new elements predicted in each inference based on the above reasoning process, as shown in the figure below.

Note: The diagram below is for illustrative purposes only; in reality, all inputs are processed simultaneously, and predictions are performed in parallel.

520

Finally, we summarize the differences in the training and inference processes as shown in the table below.

step		train	reasoning
Input		Source language sequence + Target language sequence (truth value)	Source language sequence + Target language sequence (predicted output)
1	Encoder processing	Generate the encoded representation of the entire source language sequence	Generate the encoded representation of the entire source language sequence
2	Decoder processes input	The target sequence is first marked with a sentence beginning, then converted into an embedding and sent to the decoder.	The first time step uses an empty sequence containing only the sentence beginning markers, instead of the target sequence. Subsequent time steps take the entire output sequence generated so far as input. The sequence is then converted into embeddings and fed into the decoder.
3	Decoder decoding	The decoder processes the target embedding together with the encoder’s encoded representation to generate a decoded representation of the target sequence.	The decoder processes the target embedding together with the encoder’s encoded representation to generate a decoded representation of the target sequence.
4	Decoder Processing Output	The output layer converts the encoding table of the target sequence into word probabilities and the final output sequence.	The output layer converts the encoding table of the target sequence into word probabilities and the final output sequence.
5	Calculate loss	The loss function compares the output sequence with the target sequence in the training data and calculates the loss.	none
7	Iteration	No iterations, all processes completed in one go.	Iterate through steps 2-4, gradually outputting the token.

2.3 Code

Below is the inference test code from the Harvard source code.

# ## Inference:
#
# > Here we make a forward step to generate a prediction of the
# model. We try to use our transformer to memorize the input. As you
# will see the output is randomly generated due to the fact that the
# model is not trained yet. In the next tutorial we will build the
# training function and try to train our model to memorize the numbers
# from 1 to 10.
def inference_test():
    # 构建，源词典和目标词典大小都为11，
    # EncoderLayer和DecoderLayer的数量为2
    test_model = make_model(11, 11, 2)
    test_model.eval()
    # 输入形状为(1, 10)，即一个句子，该句子10个单词。
    src = torch.LongTensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
    # 定义源序列掩码，即所有的词都是有效的，没有填充词
    src_mask = torch.ones(1, 1, 10)

    # 将输入送给编码器，获取输出，记作memory
    memory = test_model.encode(src, src_mask)
    # 初始化ys为[[0]]，用于保存预测结果，其中0表示'<bos>'
    ys = torch.zeros(1, 1).type_as(src)

    # 循环调用解码器来预测下一个token。例如：假设我们要将“I love you”翻译成
    # “我爱你”，则第一次的`ys`为(<bos>)，然后输出为“I”。然后第二次`ys`为(<bos>, I)
    # 输出为"love"，依次类推，直到decoder输出“<eos>”或达到句子长度。
    for i in range(9):
        # 将编码器的输出memory和之前解码器的所有输出作为参数，让解码器来预测下一个token
        out = test_model.decode(
            # ys就是Decoder之前的所有输出
            memory, src_mask, ys, subsequent_mask(ys.size(1)).type_as(src.data)
        )
        # 将Decoder的输出送给generator进行预测。这里只取最后一个词的输出进行预测。
        # 因为传入tgt的词数是变化的，第一次是(<bos>)，第二次是(<bos>, I)
        # 所以输出out的维度也是变化的，变化的就是(batch_size, 词数，词向量)中词数这个维度
        prob = test_model.generator(out[:, -1])
        # 取出数值最大的那个token，它的index在词典中对应的词就是预测结果
        _, next_word = torch.max(prob, dim=1)
        # 取出预测结果
        next_word = next_word.data[0]
        # 将这一次的预测结果和之前的拼到一块，作为之后Decoder的输入
        ys = torch.cat(
            [ys, torch.empty(1, 1).type_as(src.data).fill_(next_word)], dim=1
        )

    print("Example Untrained Model Prediction:", ys)

def run_tests():
    for _ in range(10):
        inference_test()

show_example(run_tests)

0xFF Reference

A Contrastive Framework for Neural Text Generation (Su et al., 2022)

A Survey on Efficient Inference for Large Language Models

Attention Is All You Need (Vaswani et al., 2017)

Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding (Fu et al., 2023)

ChatGPT is the First Truly General Artificial Intelligence:

Fast Inference from Transformers via Speculative Decoding (Leviathan et al., 2022)

https://arxiv.org/abs/1801.06146

https://arxiv.org/abs/1803.05407

https://arxiv.org/abs/2006.05987

https://github.com/1311440131/deep_blue_writings/tree/main/2021_9_18_%E5%BE%AE%E8%B0%83Transformer%E7%9A%84%E9%AB%98%E7%BA%A7%E6%8A%80%E6%B3%95

https://medium.com/@plienhar/llm-inference-series-1-introduction-9c78e56ef49d

https://medium.com/@plienhar/llm-inference-series-2-the-two-phase-process-behind-llms-responses-1ff1ff021cd5

https://pytorch.org/blog/pytorch-1.6-now-includes-stochastic-weight-averaging/

https://pytorch.org/docs/stable/optim.html#stochastic-weight-averaging

LaViT: This works too. Microsoft proposed directly using the attention weights of the previous layer to generate the attention weights of the current layer | CVPR 2024 VincentLee

LLM Inference Unveiled: Survey and Roofline Model Insights

Several Parallel Mechanisms of LLM You’ll Really

Give Up LoRA Dropout as a Sparsity Regularizer for Overfitting Control

nn.KLDivLoss_GuluGuluday’s Blog - CSDN Blog_kldivloss pytorch

On the Effectiveness of Parameter-Efficient Fine-Tuning

Pytorch: Implementation of CrossEntropy Loss and LabelSmoothing_I am Dahuang’s Blog - CSDN Blog_Label Smoothing CrossEntropy

The Illustrated Word2vec Jay Alammar

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

《Rethinking the Inception Architecture for Computer Vision:

A 10,000-word line-by-line analysis and implementation of Transformer, with German-to-English translation practice (Part 1)

A 10,000-word line-by-line analysis and implementation of Transformer, with German-to-English translation practice (Part 3)

A 10,000-word line-by-line analysis and implementation of Transformer, with German- to-English translation practice (Part 2) iioSnail:

Fine-tuning large model parameters: Sparsity and Dropout Chongjie

Is Dropout still needed in the era of large models? An analysis of GLM4-9B-Chat LeonYi:

Implementing torch.nn.CrossEntropyLoss_Agwave in PyTorch - CSDN Blog

Anatomizing Transformer Part 2: Can you assemble a Transformer using an attention mechanism?

Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of Machine Learning Research (JMLR), 3:1137–1155, 2003. [PDF]