Exploring the Transformer Series (15) --- Sampling and Output

0x00 Overview

The decoder consists of many Transformer layers, each ending with an “Add & Norm” module. In other words, the last module of the encoder’s final layer is an “Add & Norm” module. The output of this module is a floating-point vector representing the semantics. Our current problem is: how to convert this floating-point vector into a word? This is what the sampling and output sections do. In short, during the prediction phase, the sampling and output sections perform the following three steps:

Calculating Probabilities. After the decoder layer outputs the results, the final prediction is performed by the Generator linear layer, which is a standard classification network. The Generator linear layer increases the dimensionality of the feature vector corresponding to the last token to the vocabulary dimension through a linear layer, and then normalizes the new vector using softmax, ultimately outputting a probability distribution (each probability corresponds to a token in the vocabulary). This distribution represents the probability that each word in the vocabulary matches this feature vector, or the probability that each token in the vocabulary will be the next token. This part is actually a classification network.
Sampling. Based on this probability distribution, sampling is guided to find the vocabulary index corresponding to the highest probability.
New words. Select the next token from the vocabulary list based on its index as the final output.

The diagram below illustrates the process described above: starting from the bottom with the output vector generated by the decoder component, it eventually transforms into an output word.

1501

0x01 Generator

Deep neural networks, represented by Transformers, are universal function approximators. Their function is to learn the probability distribution of information from the external world, compress or extract it, and construct an internal probability model. The output of the encoder-decoder is still a real-number vector, but this vector is a high-dimensional probability vector representing the complex relationships between things woven together from the Transformer’s perspective. We need to perform classification training on this vector to identify the next token from these complex relationships.

The generator then performs this classification function. The generator takes a sequence of word vectors as input and outputs the probability distribution of the word at each position. It mainly consists of two parts:

Linear: Expands the output to Vocabulary Size, or in other words, maps words to a dictionary. The input of Linear is the word vectors processed by the encoder-decoder (during inference, the Generator does not use all the output tokens of the decoder, but the last token), and the output is logits (log odds/vocabulary features).
Softmax: Softmax converts the input logits into probabilities, and the output is the probability distribution of the words in the vocabulary corresponding to the last token. The word with the highest probability will then be selected as the prediction result.

1502

1.1 Linear

The linear layer primarily functions to transform dimensions. The purpose of this step is to map the vectors generated by the decoder to a predefined dictionary size, thus preparing them for word prediction. Its relevant characteristics are as follows:

A linear layer (often called a language model head or LM head) is a simple fully connected neural network that projects the vector generated by the decoder onto a much larger vector called logits.
The linear layer can also be viewed as a Token Embeddings matrix, with the number of rows representing the number of tokens in the model vocabulary and the number of columns representing the dimension of the embeddings, meaning each token corresponds to an embedding vector. The vector generated by the decoder is multiplied by the Token Embeddings matrix (that is, the inner product is taken with the embedding vector corresponding to each token) to obtain the similarity score (logits) with each token in the vocabulary, generating a numerical list (i.e., logits) related to the probability of each word in the model vocabulary appearing as the next word.
The size of the logits vocab_size corresponds to the size of the vocabulary. Assuming our model vocabulary is 10,000 words, the dimension of the logits vector is also 10,000, with each word in the vocabulary corresponding to a logit in the logits vector. If the encoder output has a shape of (batch size, 100, 512), then taking the last token of each sequence and passing it through a linear layer of size [512, 10000] yields a list of logits with a shape of (batch size, 1, 10000).
The logits vector contains the probability that each word will be the next word in the sequence, or in other words, the score vector of the candidate tokens. Each dimension of the logits vector represents a word in the target language vocabulary, specifically corresponding to that word’s score or score weight, indicating the probability that a particular word is the “correct” next word. The higher the logit value, the greater the probability that the corresponding word is the “correct” word. Specifically, assuming we are predicting the word at position i, each word in the target vocabulary has a score at position i, representing the probability score of each word appearing at position i. The larger the value of a certain dimension in the vector, the greater the probability that this word is at position i. Therefore, the linear layer acts as a classification head (treating each word in the vocabulary as a category), only this classification head has a relatively large number of categories.
The next step is to identify which words have the highest probability of being generated from these words. Why only a few words? This is because we need to use a sampling algorithm similar to top_k to adjust the model’s performance. Otherwise, if you tell the model “I love you,” the model’s response will always be “I love you too.”

For example, given the context “I am sleepy. I start a pot of”, the following figure shows the probability distribution of each word in the vocabulary (arranged in descending order) when predicting the next token.

1503

The figure below shows the mathematical representation of the prediction head, where $W_U$ is the linear layer (unembedding matrix), sometimes with a bias. The last residual stream state is transformed through this linear mapping, converting the representation into the next token distribution based on logits, which is then transformed into a probability distribution through the softmax function.

1504

The following diagram decomposes the forward propagation process of the Transformer. The characteristics of the four terms in the equations shown in the diagram are as follows:

The first item is the direct path, which connects the input embedding and the unembedding matrix, corresponding to the red route on the far left of the top of the diagram.
The second and fourth items are called full OV circuits, which flow through a single OV matrix, corresponding to the yellow route at the top of the diagram.
The third item is called virtual attention heads. Because of the sequential reading and writing of the two attention heads in this part, it is also called V-composition.

1505

1.2 softmax

The logits output by linear layers are difficult to interpret. Therefore, we will convert the logits into probabilities using softmax, which scales the numbers in the last dimension of the vector to a probability range of 0-1, ensuring that the sum of these numbers is 1. This is especially important in multi-class classification problems because the model’s predictions can be interpreted as the probability of each class (the probability distribution of words at each position). Subsequent sampling will follow this probability distribution.

Note: The Generator returns the log value of softmax. Here, log_softmax is used instead of softmax, although the effect should be the same. However, log_softmax can solve overflow problems, speed up computation, and improve data stability.

1.3 Implementation

The blue circle in the first diagram of this chapter corresponds to the Generator class below. The Generator class includes the Linear layer and the Softmax layer. From an intuitive perspective,

The role of the linear layer is to map words to the dictionary.
The role of the Softmax layer is to select the word with the highest probability.

The constructor parameters for the Generator class are:

d_model: The size of the Decoder output, i.e., the dimension of the word vectors.
vocab: the size of a dictionary.

The specific code is as follows.

# nn.functional工具包装载了网络层中那些只进行计算, 而没有参数的层
import torch.nn.functional as F

# 定义一个基于 nn.Module 的生成器类，其将线性层和softmax计算层一起实现, 因为二者的共同目标是生成最后的结构，因此把此类的名字叫做Generator
class Generator(nn.Module):
    "Define standard linear + softmax generation step."

    # 初始化方法，接收模型维度（d_model）和词汇表大小（vocab）作为参数
    def __init__(self, d_model, vocab):
        """初始化函数的输入参数有两个, d_model代表词嵌入维度, vocab_size代表词表大小."""
        super(Generator, self).__init__()  # 调用 nn.Module 的初始化方法
        # 这个线性层的参数有两个, 就是初始化函数传进来的两个参数: d_model, vocab_size
        self.proj = nn.Linear(d_model, vocab) # 定义一个线性层，将向量从模型的输出维度映射到词汇表大小

    # 前向传播方法，输入x是Decoder的输出，x的形状是[1, d_model]，因为x是序列中最后一个token对应的向量
    def forward(self, x):
        # 将输入 x 传入线性层，然后对输出应用 log-softmax 激活函数（在最后一个维度上）

        # 在函数中, 首先使用self.proj对x在最后一个维度上进行线性变化,
        # 然后使用F中已经实现的log_softmax进行的softmax处理.
        # log_softmax就是对softmax的结果又取了对数, 因为对数函数是单调递增函数, 因此对最终我们取最大的概率值没有影响. 最后返回结果即可
        return log_softmax(self.proj(x), dim=-1)

1.4 Use

How do I use the Generator class? And how do I use the generation probability?

reasoning

During inference, we only need to take the tensor corresponding to the last token output by the Decoder and feed it to the Generator to obtain the probability distribution of a word. The following is the inference code.

def inference_test():
    test_model = make_model(11, 11, 2)
    test_model.eval()
    src = torch.LongTensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
    src_mask = torch.ones(1, 1, 10)

    memory = test_model.encode(src, src_mask)
    ys = torch.zeros(1, 1).type_as(src)

    for i in range(9):
        out = test_model.decode(
            memory, src_mask, ys, subsequent_mask(ys.size(1)).type_as(src.data)
        )
        prob = test_model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.data[0]
        ys = torch.cat(
            [ys, torch.empty(1, 1).type_as(src.data).fill_(next_word)], dim=1
        )

    print("Example Untrained Model Prediction:", ys)

torch.max(prob, dim=1) is actually the greedy decoding algorithm we’ll learn next. next_token = vocabulary[np.argmax(probs)] will retrieve the tokens from the vocabulary.

train

During training, all outputs from the Decoder are fed to the Generator. For each word in the output, a probability distribution is obtained. At each position, we first find the index of the word with the highest probability (greedy search), and then map that index to the corresponding word in the vocabulary. These words constitute the output sequence of the Transformer.

1506

The specific example code is as follows.

def example_simple_model():
    V = 11
    criterion = LabelSmoothing(size=V, padding_idx=0, smoothing=0.0)
    model = make_model(V, V, N=2)

    optimizer = torch.optim.Adam(
        model.parameters(), lr=0.5, betas=(0.9, 0.98), eps=1e-9
    )
    lr_scheduler = LambdaLR(
        optimizer=optimizer,
        lr_lambda=lambda step: rate(
            step, model_size=model.src_embed[0].d_model, factor=1.0, warmup=400
        ),
    )

    batch_size = 80
    for epoch in range(20):
        model.train()
        run_epoch(
            data_gen(V, batch_size, 20),
            model,
            SimpleLossCompute(model.generator, criterion), # 调用Generator类的实例
            optimizer,
            lr_scheduler,
            mode="train",
        )
        model.eval()
        run_epoch(
            data_gen(V, batch_size, 5),
            model,
            SimpleLossCompute(model.generator, criterion), # 调用Generator类的实例
            DummyOptimizer(),
            DummyScheduler(),
            mode="eval",
        )[0]

    model.eval()
    src = torch.LongTensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
    max_len = src.shape[1]
    src_mask = torch.ones(1, 1, max_len)
    print(greedy_decode(model, src, src_mask, max_len=max_len, start_symbol=0))

# execute_example(example_simple_model)

Specifically, the loss calculation calls model.generator for prediction. Assuming a batch size of 2 and a sequence length of 100, the code performs a softmax operation on the last dimension to obtain the probability distribution of b×100 words. During training, the true values of these b×100 words are known, so the loss function can be directly used for training.

class SimpleLossCompute:
    "A simple loss compute and train function."

    def __init__(self, generator, criterion):
        self.generator = generator
        self.criterion = criterion

    def __call__(self, x, y, norm):
        x = self.generator(x)
        sloss = (
            self.criterion(
                x.contiguous().view(-1, x.size(-1)), y.contiguous().view(-1)
            )
            / norm
        )
        return sloss.data * norm, sloss

0x02 Sampling

After obtaining the logits, the next step is to select the next token based on them. This process is called sampling. The process of extracting tokens using the probabilities generated from the logits is accomplished through a heuristic approach known as a sampling method, search strategy, generation strategy, or decoding strategy. Due to the sequential structure of language, tokens not only need to be appropriate in context but also need to flow naturally to create coherent sentences and paragraphs. Sampling methods help select tokens that follow language patterns and structures. Furthermore, sampling methods help strike a balance between deterministic output and creative, diverse responses.

The basic principle of all sampling methods is to set a probability threshold $p(t) \in [0,1]$ . Tokens with probabilities greater than this threshold will form the sampling nucleus, and their cumulative probabilities determine the quality of the nucleus, as shown in the figure below.

1507

Various sampling methods are available for different use cases. Regardless of the strategy, the final decoding result should satisfy the condition that, given the input text, the output text scores the highest among all candidate texts, which means that the joint probability of the words at each position in the output text is maximized. The torch.max(prob, dim=1) in the inference example above is actually greedy decoding. Choosing the most likely token each time is called greedy decoding. This is not always the best approach, as it can lead to suboptimal results. We will delve deeper into decoding strategies next.

2.1 Sampling Method

The output of an LLM model is a probability distribution over a vocabulary, and the sampling method directly determines the output effect. Sometimes we want completely deterministic results, and sometimes we want richer and more interesting results. Below, we introduce two main categories of sampling methods: deterministic sampling and probabilistic sampling.

Deterministic sampling

Deterministic sampling, as the name suggests, means that the output result is deterministic; it is essentially a search process. Common examples include Greedy Search and Beam Search.

Greedy decoding is an efficient method for obtaining an approximation of the predicted sequence. It selects the next token with the highest probability (most similar to the logit value) each time and discards other tokens until a termination condition is met.
Beam search selects the best few candidates (also known as the beam width) based on historical data at the current step, always choosing the best based on historical information. Beam Search is an improvement on Greedy Search, expanding the search space at each step by retaining the K best candidates, thus alleviating some of the problems of Greedy Search. K is the beam size, representing the beam width. Beam Size is a hyperparameter that determines the size of the search space; a larger beam size results in a search that is closer to the optimum, but also increases the search complexity. When the beam size equals 1, Beam Search degenerates into Greedy Search.

probabilistic sampling

Probabilistic sampling samples based on a probability distribution, randomly selecting the next word with conditional probability, thus having the potential to generate tokens with low probability. In this sampling method, the model’s logits are treated as a multinomial distribution, and sampling is then performed based on this distribution. In other words, probabilistic sampling selects a token from the vocabulary by sampling, and we can adjust this sampling distribution beforehand using simple operations such as temperature scaling, top-k, and top-p.

There are three common types of probabilistic sampling: Multinomial sampling (which performs pure random sampling based directly on the probability distribution, easily capturing words with extremely low probabilities), Top-k sampling, and Top-p sampling. Both top-p and top-k sampling can be used to increase the diversity of the model’s generated results. Top-p and top-k sampling can be used in combination to produce more fluent text than Greedy Search and Beam Search in open-ended language generation.

2.2 Greedy Decoding

Greedy search is the simplest decoding method. It selects the token with the highest conditional probability from the vocabulary V. At each step, it selects the token with the highest probability and adds it to the sequence. It continues until it encounters an end token or reaches the maximum sequence length.

Let’s explain greedy decoding from a mathematical perspective. In the sequence-to-sequence problem, given an input sequence $x=(x_1, ..., x_n)$ , we want to predict the corresponding output sequence $y=(y_1, ..., y_m)$ . A common approach to solving this problem is to learn an auto-regressive scoring model $p(y|x)$ . This model generates a portion of the answer (such as a token) from left to right.

Assuming the vocabulary space is K (i.e., all possible outputs), then each prediction will have K possibilities, resulting in $K^m$ possible answers, which is computationally expensive. To reduce computation, we can select only the output with the highest score as the result for the current step, and repeat the above process until the final result is obtained. This method is called greedy decoding, as shown in the figure below.

1508

The advantage of greedy decoding is its simplicity, making it a fast implementation for decoding. It is also well-suited for scenarios where model performance must be strictly aligned, i.e., where deterministic output is required. Because the model’s weights are deterministic during the inference phase, and there is no dropout or other randomness (ignoring uncontrollable hardware computation errors, such as the accumulated error of parallel reduction summation), the model’s output should be completely consistent after multiple runs of greedy decoding for the same input.

The problem with greedy decoding is as follows:

The model’s performance may not be optimal. This is because greedy decoding only guarantees a local optimum at each step (considering only the information at the current step) and doesn’t consider whether the joint probability of the output sequence reaches its maximum (lacking historical information). Therefore, it fails to achieve global optimum and ignores potential long-term benefits. Furthermore, if the model’s optimal output at time t is not correct, the model will be affected by this erroneous output at every time step starting from t+1, resulting in an erroneous cumulative effect. Ultimately, this causes the model to “go further astray.”
Greedy search can also lack diversity. It always chooses the most likely word and tends to favor frequently used phrases, leading to predictable results and monotonous output.
It takes m steps to generate an output of length m. As the model size increases, the latency of each step also increases, and the overall latency will be amplified by at least m times.
Each time a token is generated, all model parameters and activation tensors need to be transferred, which severely limits the decoding process to memory bandwidth.
Greedy search cannot correct its errors. Once it makes a suboptimal choice, every subsequent decision will be affected.

The code for the greedy decoding is as follows.

def greedy_decode(model, src, src_mask, max_len, start_symbol):
    """
    进行模型推理，推理出所有预测结果。
    :param model: Transformer模型，即EncoderDecoder类对象
    :param src: Encoder的输入inputs，Shape为(batch_size, 词数) ，例如：[[1, 2, 3, 4, 5, 6, 7, 8, 0, 0]]代表一个句子，该句子有10个词
    :param src_mask: src的掩码，掩盖住非句子成分。
    :param max_len: 一个句子的最大长度。
    :param start_symbol: '<bos>' 对应的index，在本例中始终为0
    :return: 预测结果，例如[[1, 2, 3, 4, 5, 6, 7, 8]]
    """
    memory = model.encode(src, src_mask) # 将src送入Transformer的Encoder，输出memory
    ys = torch.zeros(1, 1).fill_(start_symbol).type_as(src.data) # 初始化ys为[[0]]，用于保存预测结果，其中0表示'<bos>'
    # 循环调用decoder，一个个的进行预测。例如：假设我们要将“I love you”翻译成“我爱你”，则第一次的`ys`为(<bos>)，然后输出为“I”。第二次`ys`为(<bos>, I) ，输出为"love"，依次类推，直到decoder输出“<eos>”或达到句子最大长度。
    for i in range(max_len - 1):
        # 将encoder的输出memory和之前Decoder的所有输出作为参数，让Decoder来预测下一个token
        out = model.decode(
            memory, src_mask, ys, subsequent_mask(ys.size(1)).type_as(src.data)
        )
        # 将Decoder的输出送给generator进行预测。这里只取最后一个词的输出进行预测。
        # 因为传的tgt的词数是变化的，第一次是(<bos>)，第二次是(<bos>, I)
        # 所以out的维度也是变化的，变化的就是(batch_size, 词数，词向量)中词数这个维度
        prob = model.generator(out[:, -1]) # 只取出最后一个向量来预测，即 从 seq_len 维度选择最后一个词
        # 取出数值最大的那个，它的index在词典中对应的词就是预测结果
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.data[0] # 取出预测结果
        # ys就是Decoder之前的所有输出
        ys = torch.cat( # 将这一次的预测结果和之前的拼到一块，作为之后Decoder的输入
            [ys, torch.zeros(1, 1).type_as(src.data).fill_(next_word)], dim=1
        )
    return ys # 返回最终的预测结果
"""out        torch.Size([1, 2, 512])  (batch_size, seq_len, dimension)
   out[:, -1] torch.Size([1, 512])
   prob       torch.Size([1, 11])  (1, vocab_size)
"""

2.3 Beam Search

In LLM (Large Language Modeling) tasks such as machine translation, users typically expect to obtain the most suitable first n translation results. Beam Search can achieve this effect. The core idea of Beam Search is that although each step of the greedy algorithm may not be the optimal solution, the product of the next few generated tokens maximizes the probability. The following figure shows a comparison between Beam Search and other solutions.

1509

question

Beam Search aims to solve the problem of constructing an ideal output in the actual prediction process. From a probabilistic perspective, the globally optimal output sequence refers to the combination of words in the entire word space that maximizes the joint probability of the output sequence. Let’s first analyze two extreme approaches:

Exhaustive search takes a holistic view. It iterates through all possible combinations of a word sequence and selects the one with the highest probability, thus obtaining the globally optimal output. Exhaustive search focuses on the global optimum of the joint distribution representation, and can be described as “not concerned with the gains or losses of any single instance”.
A greedy search that “only looks at the present”. When making a prediction at step t, the greedy search uses the word with the highest probability obtained at step t-1 as input.

Both of these methods have serious problems. Exhaustive search is computationally too expensive to be practical, while greedy search is short-sighted and prone to error accumulation. Since both extreme methods have issues, we must resort to compromises, and beam search is a typical example.

Ideas

Beam Search is an improved version of Greedy Search. Instead of always retrieving the token with the highest score, it consistently retains beam_size sequences with the highest scores. In word prediction at step t, beam search neither uses all combinations of words like exhaustive search nor only the highest-scoring predicted word like greedy search. Instead, it uses the top k word predictions from the previous step as input for the current step. The parameter k here is called the beam width or beam size, and it determines the number of top-ranking candidate sequences retained at each step. Beam search uses a greedy algorithm within a window size, optimizing the log probability of each branch from 0 to time step t. The vanilla Transformer uses beam search, and in the Transformer, the beam width is set to 4.

In fact, beam search is similar to how humans think about problem-solving. Because natural language is full of ambiguity, we don’t have enough information at time t to make a choice. So, in daily communication, our brains don’t make choices initially, but retain the ambiguity of the language at time t. Once we’ve accumulated enough contextual information, we return to time t to make a decision (perform a deambiguation operation).

Similar to greedy search, although beam search preserves multiple sequences, the final output is still the sequence with the highest score. Therefore, for the same input, using beam search, running the model multiple times will still result in a fixed output.

1510

In Beam Search, the decoder branches off at time t due to different interpretations of the decoding result. In each branch, we use different word vectors as the decoder input at time t+1. We can then continue to “branch” the decoder, reasoning on different hypotheses and backtracking to that time in the future to make a choice. After all branches of the decoder have output, we can evaluate the sequence output by each branch and select the optimal sequence.

As shown in the diagram below, assuming beam_size is 2, meaning the two sequences with the highest scores are always retained. During decoding, Beam Search retains the two highest-probability output words at one time step. Then, it calculates the probability distribution of the word at the second position based on the first word, and then selects the two words with the highest probabilities at the second position. This process is repeated for the third and fourth positions, but in subsequent steps, the joint probability of the historical prediction sequence and the next candidate word is calculated. For example, the second step calculates the joint probability of “你腾”, “你空”, “你风”, “账房”, “帐儿”, “账本”, etc. This process continues until decoding reaches the end or exceeds the maximum decoding step size, ultimately forming two complete candidate sequences. The optimal decoding sequence is obtained by taking the sequence with the maximum joint probability score. Furthermore, the first token is selected from V, choosing the k tokens with the highest probabilities; the remaining tokens are selected from kV candidate tokens, choosing the k tokens with the highest probabilities.

Finally, four sequences are obtained. The two sequences with the highest probabilities are selected and retained. Since the highest-scoring sequence “The culprit is you, Tengkan Daren” has already generated the terminator token EOS, there will be no other sequences with higher scores, so we can terminate here.

1511

efficiency

Because beam search retains multiple sequences simultaneously, it is easier to obtain sequences with higher scores, and the larger the beam_size, the higher the probability of obtaining a higher score.

Compared to exhaustive search and greedy search, Beam Search strikes a balance between computational complexity and accuracy by limiting the number of candidates retained at each step, thus reducing the computational complexity of fully traversing the sample space. Specifically, when k = |V|, Beam Search becomes exhaustive search, where |V| is the dictionary size; while when k = 1, Beam Search degenerates into greedy search. From a data structure perspective, exhaustive search uses a |V|-ary tree structure, greedy search uses a linked list, and Beam Search uses a k-ary tree structure. Let the decoding step size be T, the vocabulary length be N, and the beam width be K. Then, the time complexity of Beam Search is O(K * N * T), because at each step size T, it requires K inferences of the entire vocabulary length N and sorting N.

Because each step requires beam_size forward computations, beam search’s computational cost is several times greater than that of greedy search. Furthermore, LLM inference typically uses key-value caches, which further increases their memory footprint and management complexity. Additionally, beam search cannot guarantee finding the most likely sequence, especially if the beam width k is too small compared to the vocabulary size. This is why beam search is rarely used in LLM inference.

Miscellaneous

We will now discuss Beam Search further.

punish

The goal of decoding is to obtain the word sequence with the highest joint probability P. This formula is a product of probabilities, and the value is a negative number. The longer the sequence, the lower the joint probability, and the closer this value is to 0. Therefore, to simplify calculations and avoid errors, in practice, the logarithmic summation of probabilities is used instead of a product of negative numbers. However, a new problem arises: when using log-likelihood as the scoring function, Beam Search typically favors shorter sequences. This is because log-likelihood is negative; as the length of the decoded text increases, the sequence score becomes increasingly negative. Therefore, the algorithm assigns higher scores to shorter text results, causing a more reasonable translation to be rejected by a less reasonable shorter text result due to the longer text.

Beam Search can address this issue by employing a text length-based penalty to prevent long texts from receiving negative scores. For example, an “n-gram penalty” technique can be used. This technique ensures that any given n-gram appears only once: an n-gram sequence is generated and added to the sequence, and if the n-gram already exists in the sequence, its probability is set to zero.

The paper “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation” proposes to combine log-likelihood, length normalization, and coverage penalty to construct a new scoring function to solve this problem, as shown in the formula below, where lp is length normalization and cp is coverage penalty. Coverage penalty is mainly used when using attention mechanisms. It allows the decoder to focus evenly on each token in the input sequence x, preventing some tokens from receiving excessive attention.

1512

stop

Beam Search has two types of stopping conditions for a single candidate sequence:

Candidate sequence decoding stop. Specifically, this refers to decoding a single candidate sequence of a single translated text to EOS. Stop if other candidate sequences have not yet been decoded. If this happens, it will not affect the search for other candidate sequences.
Stopping too early means the candidate sequence score is already lower than the current best sequence that has been decoded. Since the longer the decoding length, the lower the sequence score, stopping too early would be counterproductive; waiting until all candidates are decoded would be detrimental. This would result in a very large set of decoded results. Since many results with very low scores are unnecessary, there should be a situation where, upon decoding a certain word, it can be determined that there is no need to continue searching based on that word. Beam Search introduces an early stopping mechanism to achieve this effect. The early stopping mechanism compares the current score with the scores of the sequences that have already been fully decoded. If the current score is significantly lower than the score of the optimal path, early stopping is performed. The degree to which this “significantly lower” is controlled by multiplying the maximum score by α.

So far, the discussion has focused on the stopping conditions for a specific candidate sequence. When does the search for the entire text to be translated end? The answer is when there are zero available candidates in the previous step; at this point, the Beam Search for that sample ends. If many texts are input into the Transformer in a batch, all samples construct their own candidates in parallel. The Beam Search decoding for the entire batch of text stops when all samples have no available candidates.

optimization

There are many optimization schemes for Beam Search, and here’s one for you to learn. The authors of the paper “Best-First Beam Search” provide a general pseudocode for Beam Search, which includes four replaceable key components. Traditional Beam Search, Best-First Beam Search, and A* Beam Search can all be obtained by modifying the replaceable components of the pseudocode.

1513

Furthermore, during the inference process of autoregressive models, when using beam search, the TopK operation typically follows Softmax, and TopK does not require calculating all y_i values. This allows for a significant performance improvement.

TopK: Each element of the input vector needs to be read at least once. If safe Softmax and TopK are run separately, each input element requires 5 memory accesses; if online Softmax is used instead of safe Softmax (but still run separately), it requires 4 memory accesses. We can achieve this by combining Softmax and TopK, resulting in only one memory access per input vector element.

In the softmax algorithm, not only is the maximum value m and the normalization term d maintained while traversing the input vector, but the vectors of the TopK input values u and their indices p are also saved.

1514

2.4 top-k

As can be seen from the above introduction, whether it is greedy search or beam search, the output of the model is fixed for a fixed input, which is rather monotonous. In order to increase the diversity of the model output, people have proposed the top-k sampling strategy.

Unlike greedy search, which always selects the highest-scoring sample, top-k sampling chooses TopK the samples with the lowest probabilities as candidates. This means that at each step, K candidate samples are retained, and the token is selected from this limited pool. This approach, to some extent, guarantees global optimality. Because Top-K sampling uses the scores of the top-k samples as weights for random sampling to obtain the next token, it introduces randomness, allowing tokens with relatively high probabilities, but not the highest, to be selected. This also addresses the issue of model generation diversity, and the larger K is, the richer the diversity.

The characteristics of top-k are as follows:

Based on the output probability distribution of the next token, the top k tokens are selected in descending order. To avoid sampling tokens with excessively low probability, the number of sampling candidates, k, remains constant.
Re-normalize the probabilities of these k tokens and sample and output tokens according to the new distribution.
If top_k = 1, then the top k algorithm degenerates into a greedy decoding algorithm.

1515

Let’s take the example above as an example. As shown in the diagram above (assuming k = 3), each step of the top-k series can be divided into two steps:

Determine the candidate set: The newly generated embedding corresponding to the last token “Yes” is used to calculate the similarity score (logits). The logits are processed using softmax to obtain probabilities, with each word (or token) having a probability. Select the 3 tokens with the highest probabilities: ["You", "Heaven", "I"], with corresponding weights of [0.6, 0.15, 0.1].
Sampling from the candidate set: Random sampling is performed using this weight to obtain a new Token “day”.

One of the more challenging issues in top-k algorithm optimization is choosing K to guarantee global optimum. For a relatively uniform probability distribution (where there are many equally good options for the next word), such as each number having a probability of 1/n, K in the top-k algorithm should be a relatively large number. This is because, in terms of accuracy, each number is equal, and a larger K increases diversity. Conversely, if the probability distribution is extremely uneven, such as the highest probability being 0.95, then a smaller K needs to be chosen, because only the number with the highest probability guarantees accuracy. In other contexts, certain tokens dominate the probability distribution. A small K might lead to generic text, while a large K might include inappropriate word candidates.

2.5 top-p

Top-p sampling (also known as nucleus sampling) is similar to top-k sampling, but differs in how the candidate token set is selected. Top-k sampling selects the k words with the highest probabilities, while top-p sampling is not limited to a fixed number of tokens (K), but dynamically selects the set of tokens whose cumulative probability exceeds a preset threshold P (e.g., 0.9).

The paper “The Curious Case of Neural Text Generation” proposes top-p sampling. In top-k, sampling is always done from k tokens, but special cases inevitably arise, such as a token with a very high score while others have very low scores. In these cases, there is still a certain probability of sampling those tokens with very low scores, leading to a deterioration in the quality of the generated output. If k is variable, then tokens with very low scores can be filtered out. To balance the diversity and quality of the generated text, in top-p, when generating the next_token at each step, random sampling is performed from the set of tokens whose cumulative probability exceeds a threshold p. That is, the algorithm does not select the K most likely words, but rather the smallest set of words whose combination probability exceeds the threshold p.

The characteristics of top-p are as follows:

Sort the probabilities of all words generated by the model at the current time step in descending order.
Among the sorted words, selections are made from the highest probability to the lowest probability (while accumulating the probabilities from largest to smallest) until the sum of the probabilities of the selected words is greater than or equal to p. This smallest set of tokens is then selected, denoted as V_p. For example, if p = 0.9, the first few words are selected so that the sum of their probabilities is at least 0.9.
Random sampling is performed only from the set of tokens whose cumulative probability exceeds a certain threshold p, ignoring other low-probability tokens. This means the number of candidate tokens varies each time due to the different distributions of token scores. Because the number of sampled candidates changes dynamically, this avoids sampling tokens with excessively low probability.
Re-normalize the probability of these tokens and randomly sample and output tokens according to the new distribution.
The top-p sampling method can dynamically adjust the number of candidate words, avoiding the problems that may arise from a fixed number of candidate words.
The smaller the top-p, the more low-probability tokens are filtered out, the fewer options are available during sampling, and the less diverse the generated results.

1516

Using the example above, as shown in the diagram (assuming p = 0.85), top-p can be divided into two steps:

Determine the candidate set:
- The newly generated embedding corresponding to the last token “Yes” is used to calculate the similarity score (logits).
- The logits are processed using softmax to obtain probabilities, with each word (or token) having a probability.
- Arrange the probabilities in descending order and gradually accumulate the probabilities until the accumulated probability reaches a certain threshold, such as 0.85.
- Select the tokens with a cumulative score of over 0.85: ["I", "You", "Day"], with corresponding probabilities of [0.1, 0.6, 0.15].
Sampling from the candidate set: Renormalize the probabilities in the candidate set, and perform random sampling based on the normalized probabilities to obtain a new Token “天”.

We will choose Llama3’s code for study.

Top-p algorithm

def sample_top_p(probs, p):
    """
    对概率分布进行 top-p (nucleus) 采样
    Args:
        probs (torch.Tensor): 概率分布张量，形状是(batch_size, vocab_size).
        p (float): 用于 top-p 采样的概率阈值

    Returns:
        torch.Tensor: 采样后的 token 索引，形状是 (batch_size, 1)

    Note:
        Top-p 采样首先获取累积概率超过阈值p的最小 token 集合。然后据选定的 token 重新规范化概率分布.
        该方法之所以可以控制生成的随机性, 是因为通过设置阈值p就可以控制采样得到的 token 集合中小权重 token 的数量.
        - 当 p 趋近 1 时, 采样的集合中会有更多小权重的 token, 生成的文本更加随机.
        - 当 p 趋近 0 时, 仅有权重较大的 token 被采样, 生成的文本更加确定.
    """
    # 对概率进行降序排序. 降序是因为要按概率从大到小选择 token 集合.
    # 假如probs是torch.tensor([0.1,0.2,0.3,0.25,0.15])
    # probs_sort里面是排序后的概率，形状和probs相同：[0.3000, 0.2500, 0.2000, 0.1500, 0.1000]
    # probs_idx是排序后的索引，用于映射回原始词汇表
    probs_sort, probs_idx = torch.sort(probs, dim=-1, descending=True)
    # 计算累积概率. 这是为了后续快速做差分然后判断 token 是否在 top-p 集合中.
    # probs_sum里面是累积到当前的概率和[0.3000, 0.5500, 0.7500, 0.9000, 1.0000]
    probs_sum = torch.cumsum(probs_sort, dim=-1)
    # 创建一个掩码, 排除累积概率超过阈值 p 的部分, 所以需要减去当前概率判断是否已经超过阈值.
    # 假如p是0.8，mask是[False, False, False, False,  True]
    mask = probs_sum - probs_sort > p
    # 使用掩码将超过阈值的 tokens 概率设置为 0.
    # probs_sorts是[0.3000, 0.2500, 0.2000, 0.1500, 0.0000]
    probs_sort[mask] = 0.0
    # 对筛选后的概率重新做归一化，确保总和为1。div_方法将规范化后的概率分布保存在 probs_sort 中.
    # probs_sort是[0.3333, 0.2778, 0.2222, 0.1667, 0.0000]
    probs_sort.div_(probs_sort.sum(dim=-1, keepdim=True))
    # 从规范化的后的概率分布中采样一个 token.
    # 用多项式采样，得到排序后的索引。probs_sort中概率大的元素被采样的几率就大
    # torch.multinomial基于输入的概率权重进行采样
    next_token = torch.multinomial(probs_sort, num_samples=1)
    # 根据采样的索引probs_idx映射回原始词汇表索引
    # torch.gather函数按照给定的索引张量index，从输入张量中收集 (获取) 数据，并返回一个与索引张量形状一致的张量
    next_token = torch.gather(probs_idx, -1, next_token)
    # 返回采样得到的 token 索引.
    return next_token

How to call

@torch.inference_mode()
def generate(
   self,
   prompt_tokens: List[List[int]],
   max_gen_len: int,
   temperature: float = 0.6,
   top_p: float = 0.9,
   logprobs: bool = False,
   echo: bool = False,
) -> Tuple[List[List[int]], Optional[List[List[float]]]]:
   """
   Generate text sequences based on provided prompts using the language generation model.

   Args:
       prompt_tokens (List[List[int]]): List of tokenized prompts, where each prompt is represented as a list of integers.
       max_gen_len (int): Maximum length of the generated text sequence.
       temperature (float, optional): Temperature value for controlling randomness in sampling. Defaults to 0.6.
       top_p (float, optional): Top-p probability threshold for nucleus sampling. Defaults to 0.9.
       logprobs (bool, optional): Flag indicating whether to compute token log probabilities. Defaults to False.
       echo (bool, optional): Flag indicating whether to include prompt tokens in the generated output. Defaults to False.

   Returns:
       Tuple[List[List[int]], Optional[List[List[float]]]]: A tuple containing generated token sequences and, if logprobs is True, corresponding token log probabilities.

   Note:
       This method uses the provided prompts as a basis for generating text. It employs nucleus sampling to produce text with controlled randomness.
       If logprobs is True, token log probabilities are computed for each generated token.
   """
   params = self.model.params
   bsz = len(prompt_tokens)
   min_prompt_len = min(len(t) for t in prompt_tokens)
   max_prompt_len = max(len(t) for t in prompt_tokens)
   total_len = min(params.max_seq_len, max_gen_len + max_prompt_len)
   pad_id = self.tokenizer.pad_id
   tokens = torch.full((bsz, total_len), pad_id, dtype=torch.long, device="cuda")
   for k, t in enumerate(prompt_tokens):
       tokens[k, : len(t)] = torch.tensor(t, dtype=torch.long, device="cuda")
   if logprobs:
       token_logprobs = torch.zeros_like(tokens, dtype=torch.float)

   prev_pos = 0
   eos_reached = torch.tensor([False] * bsz, device="cuda")
   input_text_mask = tokens != pad_id
   if min_prompt_len == total_len:
       logits = self.model.forward(tokens, prev_pos)
       token_logprobs = -F.cross_entropy(
           input=logits.transpose(1, 2),
           target=tokens,
           reduction="none",
           ignore_index=pad_id,
       )

   stop_tokens = torch.tensor(list(self.tokenizer.stop_tokens))

   for cur_pos in range(min_prompt_len, total_len):
       logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
       if temperature > 0:
           probs = torch.softmax(logits[:, -1] / temperature, dim=-1)
           next_token = sample_top_p(probs, top_p)
       else:
           next_token = torch.argmax(logits[:, -1], dim=-1)

       next_token = next_token.reshape(-1)
       # only replace token if prompt has already been generated
       next_token = torch.where(
           input_text_mask[:, cur_pos], tokens[:, cur_pos], next_token
       )
       tokens[:, cur_pos] = next_token
       if logprobs:
           token_logprobs[:, prev_pos + 1 : cur_pos + 1] = -F.cross_entropy(
               input=logits.transpose(1, 2),
               target=tokens[:, prev_pos + 1 : cur_pos + 1],
               reduction="none",
               ignore_index=pad_id,
           )
       eos_reached |= (~input_text_mask[:, cur_pos]) & (
           torch.isin(next_token, stop_tokens)
       )
       prev_pos = cur_pos
       if all(eos_reached):
           break

   if logprobs:
       token_logprobs = token_logprobs.tolist()
   out_tokens, out_logprobs = [], []
   for i, toks in enumerate(tokens.tolist()):
       # cut to max gen len
       start = 0 if echo else len(prompt_tokens[i])
       toks = toks[start : len(prompt_tokens[i]) + max_gen_len]
       probs = None
       if logprobs:
           probs = token_logprobs[i][start : len(prompt_tokens[i]) + max_gen_len]
       # cut to after eos tok if any
       for stop_token in self.tokenizer.stop_tokens:
           try:
               eos_idx = toks.index(stop_token)
               toks = toks[:eos_idx]
               probs = probs[:eos_idx] if logprobs else None
           except ValueError:
               pass
       out_tokens.append(toks)
       out_logprobs.append(probs)
   return (out_tokens, out_logprobs if logprobs else None)

Top-p addresses the challenges faced by top-p methods. This dynamic approach offers greater flexibility because the number of candidate tokens can vary based on the generation context. By adjusting the threshold P, the model can control the number of tokens considered in each step, achieving a balance between diversity and coherence in the generated output. For a fixed p, more points can be selected in the most extreme uniform distribution; for extremely uneven distributions, top-p will only select the one or a few values with the highest probability, avoiding the introduction of incorrect tokens. Of course, how to choose p within top-p remains a concern. Decreasing p tends to reduce diversity, while increasing p tends to introduce more low-probability tokens. Regardless of the choice of p, it’s possible to include tokens with very low probabilities in the candidate set. To address this issue, a temperature adjustment mechanism has been introduced.

Additionally, while top_p may seem more elegant than top_k in theory, both methods work well in practice. top_p can also be used in conjunction with top_k (when combined, the token set is first restricted to K candidates, and then further narrowed down to tokens that meet the probability accumulation threshold P), which avoids tokens with very low scores while providing some room for dynamic selection.

2.6 Performance

Researchers compared the results of applying each sampling method individually and found that Top-K had the highest cost, followed by Top-P, while the cost of duplicate penalty was the lowest. It’s worth noting that the cost of duplicate penalty is lower than that of Top-K and Top-P sampling because the former two require sorting algorithms.

Furthermore, the additional computational overhead of sampling techniques becomes more significant when computation is limited, such as with high request rates, datasets with heavy decoding tasks, or large batch sizes. In such cases, the sampling overhead must be carefully considered.

When request rates are low, sampling causes almost no performance degradation. This is because as request rates increase, the workload becomes more computationally constrained. Higher request rates result in larger running batch sizes and higher operation density.

0x03 Sampling Parameters

Common sampling parameters are shown below. Different models may have different parameters and thresholds.

parameter	default value	meaning
top_p	0.95	The top-p probability threshold is used. If top_p is less than 1, the probability is accumulated from high to low until top_p is reached, and the top N words are selected as candidates.
top_k	50	The top K result words are retained as candidates.
repetition_penalty	1.0	The parameter for repeated penalties. 1.0 means no penalty.
temperature	1.0	Temperature values are used to control the randomness and creativity of generated text in a generative language model. A smaller temperature value means a higher probability of selecting the most likely word, and the same input is more likely to produce the same output. A larger temperature value, on the other hand, indicates a higher probability of other results.

3.1 temperature

concept

While Softmax can yield a distribution, it also has drawbacks: it easily amplifies/contracts the differences between internal elements (degenerating into a max/mean). That is, for vectors with similar numerical values, the probabilities can differ significantly. For example, predicting the answer to “The boy _ to the market.”, possible answers are [goes, go, went, comes]. Assuming the classifier’s output values are [38, 20, 40, 39], the Softmax result obtained using the above formula is [0.09, 0.00, 0.6, 0.24]. If sampling according to this distribution, there’s a 60% probability of “went,” but the answer to the blank could also be “goes” or “comes” depending on the context. The initial values of the classifier’s words are relatively close, but the Softmax value can significantly widen the gaps.

The parameter Temperature is used to solve this problem. It adjusts Softmax to make its distribution more consistent with our expectations, thereby controlling the credibility, randomness, creativity, and diversity of the LLM generation results. Why is it called temperature? We know that the higher the temperature, the more intense the Brownian motion; similarly, the higher the temperature, the more random the sampling results, and the greater the diversity of the generated content.

Mathematically, setting the temperature is a very simple operation; temperature is the temperature coefficient used when performing softmax processing on the probability distribution of each word. The model output logits simply needs to be divided by this coefficient temperature. The specific formula is shown in the figure below. z_i is the i-th logit.

$softmax(z_i/T)=\frac{e^{z_i/T}}{\sum_j e^{z_j/T}}$

T amplifies the differences between logits. The higher the T setting, the more random the generated results and the smoother the output distribution. The smaller the T setting, the weaker the randomness and the steeper the output distribution.

T < 1: Lower temperatures sharpen the probability distribution, making the model more confident and deterministic, thus producing more predictable outputs. When T \to 0, the wealth gap is amplified to the extreme, causing the probability of the element with the highest value to approach 1, and the others to become 0. At this time, the information entropy is 0, and the effect of softmax is similar to that of argmax.
When T = 1, dividing logits by 1 has no effect on the softmax output, and the output distribution will be the same as the standard softmax output.
T > 1: Higher temperatures smooth the probability distribution, making the model more likely to choose tokens with lower probabilities. This can generate more creative and diverse text, but it also increases the risk of generating incoherent results. When T \to \infty, it minimizes the gap between rich and poor, making all output probabilities approach the same value, turning the distribution into a uniform distribution, which is completely random. At this point, the information entropy is at its maximum. Some people refer to “higher temperatures” as the model’s “creativity.”

The application code is shown below:

# logits 是LLM的推理输出, 形状为 [batch_size, seq_len, vocab_size]
# logits[:, -1]表示选择的是最后一个token（seq_len 维度的最后一项）对应的logits，形状为[batch_size, vocab_size]
probs = torch.softmax(logits[:, -1] / temperature, dim=-1)

We also need to do some special analysis on the behavior when the temperature is 0.

When the temperature is 0, the model performs greedy decoding: at each step, it always chooses the next tag with the highest probability, essentially predicting the argmax choice in the probability distribution. Theoretically, greedy decoding should eliminate randomness in the generation process. If the model and input are fixed, the generated tag sequence should be identical each time. However, setting the temperature to zero does not guarantee 100% determinism in practice. For example, if two or more next tag options have (nearly) the same highest probability, the model or decoding library might break the tie in an arbitrary way. This is rare, but it can happen. In this case, even with temperature = 0, the choice between tags can be nondeterministic. Furthermore, modern LLM architectures and hardware behavior introduce a degree of variability. For instance, the complexity of expert hybrid model architectures (capacity limitations and batch races can cause output inconsistencies), the subtleties of floating-point operations on parallel hardware, and other implementation details mean that even without any “randomness” parameters, two calls to the same prompt for the same model can occasionally diverge.

Dynamic temperature coefficient

The temperature parameters of the vanilla Transformer are static. Next, we will look at the idea of dynamically adjusting the temperature during the LLM decoding process.

KL-Divergence Guided Temperature Sampling

The paper “KL-Divergence Guided Temperature Sampling” proposes a method for adjusting temperature based on KL divergence. KL divergence is used to measure the statistical distance between two distributions. The paper’s starting point is to adjust the temperature T during the current decoding based on the correlation between the current token and the prompt. Specifically, the authors introduce an additional LLM for decoding, whose input does not include the prompt but also contains x_{<t} (the already decoded response prefix).

The two models run simultaneously, then the KL divergence of the original LLM logit and the LLM logit is calculated, and the temperature is calculated based on the KL divergence. The more similar the two distributions are, the closer the value is to 0, and vice versa. The physical meaning is: how important is the decoding of the next token for the prompt?

1517

However, this method has obvious drawbacks: it requires an additional LLM, which uses twice the GPU memory and computational resources, representing a significant overhead for large models. From an efficiency standpoint, it is essentially unacceptable.

Hot or Cold

The paper “Hot or Cold? Adaptive Temperature Sampling for Code Generation with Large Language Models” proposes a fairly simple method. In the code generation task, the paper categorizes tokens into two classes: challenging (high loss) and confident. The former is difficult for LLMs to predict, so a larger value is needed to allow the LLM to explore further. The figure below illustrates the rules for determining whether a token is challenging and adjusting temperature (T). This approach is relatively simple because it is entirely rule-based.

1518

EDT

The paper “EDT (Entropy-based Dynamic Temperature)” uses entropy to dynamically calculate temperature. The entropy of an “n-state” system is defined as follows, where p_i is the probability of the i-th event occurring.

$Entropy=-\sum_{i=1}^{n} p_i \log(p_i)$

The paper uses the entropy of logit to measure the confidence of the LLM in the next token.

A higher entropy indicates that the large model is less confident about the tokens generated in this round; in the extreme case, every token has the same logit value. When the large model itself is unsure which token to generate, we should increase the diversity of the system, that is, we should use a large temperature range for exploration.
A lower entropy indicates that the large model is more certain about which token should be generated in this round. When the large model is very confident about which token should be generated in this round, we should use a small temperature to make the system even more certain about its choice. Doing so can also solve the problem of some low-probability values being selected incorrectly.

The diagram below illustrates the EDT decoding process. In each decoding step, the system first obtains logits (➀) and generates the probability distribution (➁) for the next token. Then, it calculates the entropy (➂) of all tokens based on the initial probability distribution. Next, the model selects the temperature (➃) based on the entropy, obtains a new distribution (➄) based on the temperature, and samples the next token (➅).

1519

3.2 repetition_penalty

The repetition_penalty parameter is used to prevent the model from continuously outputting repeated results. The larger the repetition_penalty, the less likely repetition will occur; the smaller the repetition_penalty, the more likely repetition will occur.

Reason for the problem

The paper “Learning to Break the Loop: Analyzing and Mitigating Repetitions for Neural Text Generation” systematically studies why LLMs tend to generate repetitive sentences during greedy decoding and answers the following questions through quantitative research:

Why does sentence-level repetition occur?
Why does the model get stuck in a loop?
What kinds of sentences are more likely to be repeated?

The paper analyzes and points out that the self-reinforcement effect is the core problem leading to repetition. The reason the model tends to generate repetitive results may be that the probability of generating the previous sentence based on the maximization-based decoding algorithm is relatively high, and the model tends to further increase the probability of that sentence being repeated. Specifically, once the model generates a repeated sentence, the probability of that sentence appearing later will further increase because more repetitions share the same sentence-level context to support this replication operation. As a result, due to the self-reinforcement effect, the model gets stuck in this sentence-level repetition. Let’s take the example of using the input “I love orange. I love” to predict the next token.

Why does sentence-level repetition occur? Because the model has already seen the pattern “I love orange” in the previous context, it assigns a higher probability to Pθ('oranges'|'I love oranges . I love') than to Pθ('oranges'|'I love'). Therefore, the model may be overconfident in the repetition of previous context and has learned a “cheap” shortcut: directly copying the next token “orange”.
Why does the model get stuck in a repetitive loop? This can be understood from an attention perspective. During the first repetition, a certain token has a weight of w in the attention, and the representation of this token is similar across different positions in the sentence. As this token appears multiple times, it will appear with weight w multiple times in the attention weights, effectively increasing its weight. This makes it easier to repeat the token.
What kind of sentences are more likely to be repeated? Sentences with a high initial sentence probability (average token probability) have a stronger self-reinforcing effect, and sentences generated by the maximization-based decoding algorithm are more likely to be repeated (the generated sentences have higher initial likelihood). TP_0

Another question is: why does the larger the model, the less repetition? LLM predicts the next token, and the next token can be divided into several categories for discussion:

true tokens, correct tokens.
Context tokens, the tokens mentioned earlier.
random tokens/other, other tokens.

Ideally, the model’s prediction probabilities for these three tokens should be in the order: P(true tokens) > P(context tokens) > P(random tokens). The model maps a prefix to a representation space and decodes this representation space to predict the next token. This representation space should be able to distinguish between different semantics and decode the corresponding correct token. However, there is a poor case where the representation space is not discriminatory enough against different semantics, easily decoding the wrong token. A model with poor capability and small capacity, and a small representable space, will find it difficult to map the prefix to a better representation space and decode the correct token. This can easily lead to P(true tokens) < P(context tokens), causing the model to repeat itself.

Parameter principle

The repetition_penalty parameter was first proposed in the paper “CTRL: A Conditional Transformer Language Model for Controllable Generation” to address the problem of repetitive generation in language models. The idea is to record previously generated tokens, and when predicting the next token, artificially reduce the score of already generated tokens, decreasing their probability of being sampled again (i.e., penalizing tokens already selected in previous steps). This balances text coherence and increases the diversity of generation.

As shown in the figure below, it is implemented directly based on the softmax with temperature coefficient T mentioned above. Here, g represents the list of tokens that have already been generated. If a token is already in the list of generated tokens g, then its corresponding temperature coefficient T is multiplied by a coefficient θ, where θ is any value greater than 0.

θ = 1 indicates that no penalty is imposed.
Setting θ > 1 is equivalent to minimizing repetition. Users can reduce degradation while maintaining sentence coherence. However, higher penalties can lead to decreased output coherence because they may over-penalize tokens crucial to sentence structure.
θ < 1 means we want repetition to occur.
only for tokens that have already been generated θ.

1520

Let’s learn from the code of MNN.

int Llm::sample(VARP logits, const std::vector<int>& pre_ids, int offset, int size) {
    std::unordered_set<int> ids_set(pre_ids.begin(), pre_ids.end());
    auto scores = (float*)(logits->readMap<float>()) + offset;
    if (0 == size) {
        size = logits->getInfo()->size;
    }
    // repetition penalty
    const float repetition_penalty = 1.1;
    for (auto id : ids_set) {
        float score = scores[id];
        // 小于0的数会乘以repetition_penalty, 否则是除以repetition_penalty
        scores[id]  = score < 0 ? score * repetition_penalty : score / repetition_penalty;
    }
    // argmax
    float max_score = scores[0];
    int token_id = 0;
    for (int i = 1; i < size; i++) {
        float score = scores[i];
        if (score > max_score) {
            max_score = score;
            token_id  = i;
        }
    }
    return token_id;
}

In addition, there are other methods to control repetitive output: frequency penalty and presence penalty. Both impose a penalty by subtracting a certain value from the logits. Furthermore, the frequency penalty applies the penalty based on the number of repetitions, while the other methods only apply the penalty based on whether the token exists.

0x04 Logits Analysis

In Transformer, tokens are actually representations of vectors in different dimensions and semantic spaces. The thought process is essentially a movement from one semantic space to another. The final hidden layer of Transformer, logits, is the result of this thought process. We will delve deeper into the properties of logits and also examine some solutions based on logits.

4.1 Compressed Information

Let’s first look at why we can predict the next token by using only the embedding corresponding to the last token during inference.

Example

We input the first part of the text from “The Fourth Lacquer Screen of Judge Dee’s Cases” into the model:

Teng Kan, the magistrate of Muping County, stood blankly behind the door of his study. He felt dizzy and disoriented, his vision blurred, and he couldn’t see anything clearly. He closed his eyes, slowly raised his hands to press on his temples, and the intense headache gradually subsided, and the ringing in his ears stopped. It was already summer, and the yamen runners, having finished their afternoon break, were busy again. He heard a familiar voice coming from the backyard and thought, “It must be the steward bringing me tea.”

…

The old steward led Di Renjie into Teng Kan’s study. Teng Kan had changed into his old blue robe, which he wore during his leisure time, and a soft gauze headscarf. When he saw Di Renjie enter the room, he hurriedly bowed and offered him a seat. The old steward brought out a tea tray and then respectfully withdrew. This scene reminded Di Renjie of the first time they met here.

Teng Kan poured tea for Di Renjie, who suddenly noticed that the four lacquered screens were missing. Teng Kan gave a bitter smile and said, “I don’t want to see them again. Brother Di, I’ve moved the lacquered screens upstairs and locked them. You know, they bring back many painful memories for me.”

Di Renjie suddenly put down his teacup and said sternly, “Minister Teng, please don’t repeat this lie about the lacquer screen to me again! Once is enough!”

Teng Kan was taken aback, staring blankly at Di Gong’s expressionless face, and asked, “What do you mean by that, Brother Di?”

Let’s add a new sentence:

狄公冷冷地对滕侃说：“罪魁祸首就是”

If you input all the above text into the model for prediction, the model will most likely output: “you”.

analyze

The processing procedure is as follows.

First, the model processes all the preceding text of “The Fourth Lacquer Screen of Judge Dee’s Cases in the Tang Dynasty,” resulting in a processing result. This result is then aggregated into the vector “是”. Initially, the “是” vector was simply obtained from a lookup table during training. After training, the Transformer integrates all the key semantics of all the text into the last vector of the sequence, namely “是”.

Secondly, the model operates on the “is” vector to obtain the probability distribution of all tokens, which is the probability of each token appearing in the next word in the vocabulary. In this example, we choose the word “you” with the highest probability as the next word.

1521

4.2 Changes

The paper “How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States” points out that the changes in intermediate hidden states of a language model are actually a process of layer-by-layer feature allocation. That is, the model determines whether the input belongs to the “answerable” category in the early layers (after the first few layers, the model assigns features sufficient to be separated by a hyperplane to the hidden states), associates shallow guesses of sentiment in the intermediate layers, and finally refines the output into the corresponding format. Jailbreak disrupts the early activations, causing changes in the intermediate ‘guessing associations,’ ultimately leading to the model responding to harmful questions.

4.3 Preprocessing logits

Background and Motivation

Traditional sampling strategies face numerous challenges in inference tasks. In particular, while sampling methods can provide diverse outputs, they often perform poorly in tasks requiring precise inference. For example, probabilistic sampling methods (such as temperature scaling, nucleus sampling, top k sampling, and min p sampling) tend to prioritize output diversity and reducing redundancy over inference accuracy. These techniques often struggle to effectively filter out irrelevant tokens, resulting in a seemingly unavoidable trade-off between diversity and inference precision.

To challenge this traditional view, the paper “Top-nσ: Not All Logits Are You Need” proposes the top-nσ method, a novel sampling strategy that directly applies a statistical threshold to pre-softmax logits. This method assumes that logits naturally distribute into a region containing Gaussian noise and a distinct information region, a characteristic that allows for efficient token filtering without complex probability calculations. Compared to existing methods, the top-nσ method maintains a stable sampling space even at high temperatures, a characteristic that gives it a significant advantage in inference tasks.

The paper mainly explores three key questions:

How to interpret probability-based sampling methods from the logit space.
The basic characteristics of the logit distribution in large language models.
How to effectively distinguish between noise regions and information regions using these distributions.

Insight

While existing methods primarily focus on handling probability distributions, this paper argues that examining pre-softmax logits can reveal deeper insights into the model generation process. The figure below shows the logits distribution and descending probabilities of the LLaMA3-8B-Instruct on AQuA samples. In Figure a, the leading tokens (higher probabilities) correspond to the right-hand region of the logits distribution. The maximum logit is approximately 10σ higher than the distribution’s mean. An interesting observation is that the pre-softmax logit distribution exhibits a highly regular pattern, typically consisting of two distinct parts: a Gaussian-like distribution of the background markers and a set of prominent outliers. Although most markers follow a Gaussian distribution, the outlier tail dominates the probability mass (as shown in Figure b). We refer to these components as the noise region and the information region, respectively. Notably, the maximum logit deviates from the mean by more than 5σ (standard deviation), significantly exceeding typical statistical standards for outlier identification.

1522

Noise area

While logits do exhibit characteristics of a noise distribution, the logits of most tokens typically exhibit a Gaussian distribution. When the probabilities corresponding to these logits are generally considered negligible, they are categorized as noisy regions. The characteristics of this noisy region align with statistical intuition: a Gaussian distribution generally indicates the presence of random noise in the system.

As shown in the figure above, when the boundary between the noise region and the information region becomes narrower, the probabilistic interference caused by noise affects the quality of model generation, especially in high-temperature sampling scenarios. Excessively high temperatures cause the performance of current nondeterministic sampling algorithms to degrade, mainly because the noise distribution dominates the probability space.

Information area

Conversely, the region where only a small number of tokens occupy the majority of the probability set is called the information region. As shown in the diagram above, although the number of tokens in this region is limited, it carries a great deal of information.

This phenomenon can be exploited by the min-λ sampling method to improve generation quality. This method sets a baseline probability threshold λ and eliminates all probability values below p_max · p.

Core idea

Determine the boundary

To effectively distinguish between informative and noisy regions, this paper proposes that informative regions can be considered as outliers in the noise distribution. However, based on empirical observations, traditional methods based on the α + 3α rule are not suitable for this task. Therefore, this paper proposes the concept of α-distance, defined as the standard deviation interval between the maximum probability and the distribution mean, as shown below:

$\sigma\text{-distance}=\frac{M-\mu}{\sigma}$

Through analysis, the authors found that the distance between the maximum probability and the mean was consistently greater than 10k, and exhibited significant fluctuations during the generation process. This finding suggests that information tokens should not be considered outliers of noise tokens. Instead, a higher k-distance indicates strong confidence in the model’s generation results.

In summary, we should eliminate tokens in noisy regions and retain tokens in distinct information regions. To this end, the top-bearing algorithm proposed in this paper expands downwards from the highest probability and uses standard deviation to dynamically adjust the boundary to effectively distinguish between information tokens and noisy tokens.

algorithm

The core idea of the top-nσ algorithm is to start with the maximum value of logits and combine it with the standard deviation to determine the boundary, thereby effectively distinguishing signal tokens from noise tokens. The specific steps are as follows:

Generate logits: Given an input sequence, the model first generates a logit vector l = (l_1, l_2, …, l_V) ∈ R^V, where (V) is the vocabulary size.
Calculate the maximum value and standard deviation.
Determine the filtering threshold: The algorithm expands downwards from the maximum value by nσ, where the parameter n is usually taken as 1.0. Therefore, the threshold is: [M - nσ].
Filter candidate tokens: Select all tokens that meet the following condition: [l'_i \geq M - n\sigma]. This condition ensures that only tokens with high information content are selected based on the statistical characteristics of logits.
Sampling: Randomly select from the filtered tokens and generate text.

1523

4.2 Implicit Thinking Chains

Next, let’s look at the approach of allowing large models to think in the latent space.

pause tokens/Filler Token

The paper “Think before you speak: Training language models with pause tokens” proposes a unique method for training language models by introducing pause tokens. The core idea is to simulate the pauses humans make during the thinking process, allowing the model a “thinking” gap before generating an answer, thereby improving the quality and logic of the response. This pause token approach also has the advantage of eliminating the need to spend time generating a CoT during reasoning; simply input the question and then concatenate a string of pause tokens to improve the accuracy of the answer.

The diagram below illustrates this approach: inserting a Pause Token during the Pretraining/SFT phase and also during inference.

1524

The paper “Let’s think dot by dot: Hidden computation in transformer language models” discovered an interesting phenomenon: some meaningless filler tokens can significantly improve the performance of LLM, much like pause tokens. However, for a model to learn to use filler tokens, specific, intensive supervised training is required; models trained on standard CoT cannot utilize filler tokens.

CoT

The paper “Chain-of-Thought Reasoning without Prompting” found that LLMs can demonstrate reasoning ability without any prompts, simply by changing the decoding process.

The figure below illustrates this phenomenon: given a reasoning problem, LLM generates incorrect answers through the standard greedy decoding path, while CoT decoding achieves better results by explicitly encouraging diversity in the first decoding step.

LLM does indeed face inference difficulties when relying solely on greedy decoding paths. The poor sampling results stem from the model’s strong tendency to directly provide the answer during decoding, resulting in low diversity for the first token. The CoT decoding model, however, uses a different workflow. Instead of directly selecting the best answer (without greedy decoding), it chooses from a list of best answers. Specifically, CoT decoding uses the top k tokens (top-k) in the first step and continues with greedy decoding in subsequent steps. This approach gives the model greater confidence in the final answer. The CoT inference pattern naturally emerges in the LLM’s decoding trajectory when considering alternative paths among the top k tokens. Decoding paths 2 and 4 in the diagram below represent more accurate answers. This decoding modification skips the hint process, is implemented entirely unsupervised, and requires no model fine-tuning.

1525

Workflow

Let’s see how CoT decoding works:

Input format: First, we need to use the standard question-and-answer (QA) format to input: Q: [Question]\nA:, where [Question] is the actual question.
Decoding process: Instead of using only greedy decoding, we need to explore the first k candidate tokens at the first decoding position. By default, this paper uses k = 10.
Path exploration: After considering the first k tokens at the first position, we can continue greedily decoding the rest of the sequence. This creates multiple potential response paths.
Path Selection: Based on this confidence pattern, we can develop a method to filter the top k paths and select the most reliable output. We propose a weighted aggregation method, which selects the answer that maximizes \tilde{\Delta}_a = \sum_k \Delta_{k,a}. Here \Delta_{k,a} is the path with the answer X in the X-th decoding path. We find that this method enhances the stability of the results.

Branching occurs in other decoding steps.

Another natural question is whether branching can occur in later decoding stages, rather than only in the first decoding step. In the figure below, the paper highlights the impact of considering alternative terms in subsequent decoding steps. It is clear that early branching (e.g., in the first decoding step) significantly enhances the diversity of potential paths. Conversely, later branching is significantly influenced by previously generated terms. For example, if decoding starts with the term “5,” the likelihood of correcting incorrect paths is greatly reduced. However, the optimal branching point may vary depending on the task; in year parity tasks, mid-course branching can effectively generate correct CoT paths.

1526

coconut

The paper “Training Large Language Models to Reason in a Continuous Latent Space” introduces Coconut. Coconut involves a simple modification to the traditional Chain of Thought (CoT) process: instead of mapping hidden states to language tokens through a language model head and embedding layers, Coconut directly embeds the last hidden state (i.e., continuous thought) as the input to the next token (as shown in the diagram below) into the LLM. This solves the problem of poor generalization when using Pause Tokens/Filler Tokens.

The image below compares the Coconut Mind Chain (Coconut) and the CoT Mind Chain (CoT).

In CoT, the model runs as a standard language model, using autoregression to generate the next token. The model generates a sequence of word tokens for the inference process. In other words, CoT outputs the intermediate inference steps in plain natural language.
In CoconUT, the model uses the last hidden state as a representation of the current inference state (called “continuous thinking”) and embeds it as the next input. That is, the CoconUT method allows the model to perform intermediate inference steps within the “hidden layer” or “latent vector” without having to convert them into visible linguistic symbols. This approach frees the model from the constraints of language generation, allowing for more flexible and efficient thinking.

1527

This paradigm leads to an efficient reasoning pattern, as detailed below.

This liberates reasoning from the linguistic space, and because continuous thought is fully differentiable, the system can be optimized end-to-end using gradient descent.
It supports the potential for parallel exploration (similar to breadth-first search). Unlike language-based reasoning, inference in the latent space, Coconut’s continuous thinking can encode multiple potential next steps simultaneously, achieving a reasoning process similar to BFS (breadth-first search). The authors emphasize that the model doesn’t necessarily make a “hard decision” in the first step, but rather retains multiple possible reasoning paths in the hidden layers simultaneously. As the reasoning process unfolds, it “eliminates” some seemingly unreasonable candidate paths, retaining more feasible branches. This approach differs from ordinary autoregressive generation, which typically makes “irreversible” decisions step by step.
Compared to the input token, the input hidden states are not affected by the information loss/error accumulation caused by the argmax operation.

Specifically, the training process gradually replaces Discrete Tokens in CoT with Continuous Tokens. This is a multi-stage training process (inspired by the ideas of “progressive” or “segmented” approaches). This process begins with explicit linguistic reasoning, gradually guiding the model to “internalize” the reasoning steps into the continuous vector space. Furthermore, the Next-Token-Prediction Loss on Continuous Tokens is masked during training, ensuring that Continuous Tokens are not merely a compression of the original CoT. To enhance the training of latent reasoning, the paper employs a multi-stage training strategy that effectively utilizes the linguistic reasoning chain to guide the training process.

4.3 Entropy-based sampling

When trying to reduce model illusions, engineers might set sampling hyperparameters like temperature to 0. However, this doesn’t necessarily increase the probability of the model not producing illusionary outputs; it merely forces it to assign a higher probability to a word relative to other words. Let’s look at some entropy-based sampling or decoding approaches.

SED

An extreme case in token generation is where all tokens have very low probabilities, and the probabilities of the top-k tokens are almost identical. EDT argues that in this situation, choosing any token is acceptable, and thus increases the temperature to enhance the diversity of the LLM. Conversely, the paper “SED: Self-Evaluation Decoding Enhances Large Language Models for Better Generation” argues that in this case, incorrect answers can be generated without careful attention. This error cannot be resolved using methods like beam search, and the PPL metric is also low. To address this issue, new methods are needed to identify and correct this situation.

The paper refers to the situation where the token probabilities are similar as a “chaotic point,” and the figure below illustrates two methods for determining this.

1528

Once we identify that the current step is a chaotic point, we must take corrective measures. The specific process is shown in the figure below.

1529

Combining the two yields the complete processing algorithm:

1530

entropix

Is it necessary to generate a CoT for every token? Obviously not. For example, if the current token is “Apple”, the probability of the next token being “Fruit” is extremely high, requiring almost no thought. Is there a simple way to determine which tokens to insert a CoT for? Entropix has provided its own approach.

Entropy: In information theory, entropy quantifies the uncertainty and randomness of a probability distribution. For language models, entropy measures the uncertainty of predicting the probability distribution of the next word. Let’s look at the implications of entropy for models.
- Low entropy indicates that the model is very certain in predicting the next word.
- Entropy spikes occur when a model faces multiple equally probable next-word options. These spikes indicate the challenges of high uncertainty.
- Stable behavior at moderate entropy levels. When the entropy value is in the moderate range of 1 to 2.5, the model’s behavior is more stable and predictable. This stable state helps the model generate creative and coherent outputs.
- Entropy collapse is a phenomenon where a model’s output becomes highly certain when its uncertainty is significantly reduced. While this can lead to confident and accurate predictions, it can also cause the model to produce overconfident but incorrect responses.
Variance entropy: Variance entropy is the variance of entropy, measuring how the uncertainty of a model varies across different words or model layers. Varentropy measures the variance of the information content of different possible next words. Unlike entropy, which measures average uncertainty, varentropy captures the variation in this uncertainty, helping the model adapt to complex situations. When both entropy and varentropy are high, the model can identify complex situations and adjust its sampling strategy, generating deeper inference steps.

Core idea

The core idea of Entropix is to quantify the uncertainty of the model by using entropy and variance entropy to guide the sampling strategy during the decoding process. When the model is confident (low entropy and low variance entropy), normal sampling continues; when the model is uncertain (high entropy and/or high variance entropy), different words or inference paths are explored.

This adaptive approach simulates the thought process by adjusting the sampling strategy, allowing the model to dynamically adjust in more complex scenarios, enabling the model to “think” more, improving its decision-making ability, and potentially producing more accurate and coherent outputs.

Before decoding each token, the Entropix authors calculate the logit’s entropy and varentropy (variance of entropy), providing a measure of uncertainty across different tokens. Low varentropy means the model’s uncertainty is constant across tokens, while high varentropy indicates significant uncertainty across tokens. The model then uses the entropy to determine whether to insert CoT and adjust the temperature coefficient. Specific rules are illustrated below:

With small entropy and entropy variance in logit, LLM was confident and proceeded with greedy decoding using conventional methods.
Logit has high entropy and low entropy variance, so LLM is not very confident. By inserting CoT and adjusting it with the entropy of attention to get a larger T, it will explore alternative labels or inference paths.

In large models, parameter sharing techniques allow the same set of weights to be reused in different parts of the network. This not only helps reduce the number of parameters but also improves model efficiency while maintaining performance. A common approach is head sharing in embedding-lm, where the word embedding layer shares the same weights with the head layer of the final language model. Another example is cross-layer parameter sharing (Cross-layer attention/FFN), which uses the same weights across multiple layers of the model. This sharing technique can be seen in models such as Gemma and Qwen, where the inter-layer parameter sharing mechanism effectively prevents the number of parameters from increasing with network depth, significantly improving the training and inference efficiency of the model.

5.1 vanilla Transformer

The vanilla Transformer shares weights in two places:

Weight sharing of the embedding layer between the Encoder and Decoder;
The embedding layer and the fully connected (FC) layer share weights in the decoder.

The original Transformer paper said

In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to (cite). In the embedding layers, we multiply those weights by \sqrt{d_{model}}.

The Harvard author wrote that

Shared Embeddings: When using BPE with shared vocabulary we can share the same weight vectors between the source / target / generator. See the (cite) for details. To add this to the model simply do this:

The specific code example is as follows, where weight is the word embedding matrix of shape [vocab, d_model].

if False:
    # Encoder和Decoder间的Embedding层权重共享
    model.src_embed[0].lut.weight = model.tgt_embeddings[0].lut.weight
    # Decoder中Embedding层和FC层权重共享
    model.generator.lut.weight = model.tgt_embed[0].lut.weight

5.2 Shared vocabulary weights

principle

The reason for sharing word embedding weights is that in machine translation tasks, if the source and target languages are very similar (for example, English and German both belong to the Germanic language family and share many common roots or subwords), they can share a single word embedding matrix. Furthermore, for both the encoder and decoder, only the embeddings of the corresponding language are activated during embedding, so a single word matrix can be shared for weight sharing. It’s even possible to share the encoder’s input embedding, the decoder’s input embedding, and the decoder’s output embedding. This allows for better representation of words that appear in both languages (such as numbers, punctuation marks, etc.). However, for language families that differ significantly, such as English and Chinese, there is absolutely no need to share a word embedding matrix.

However, sharing a vocabulary increases the size of the vocabulary and the computation time of softmax. Therefore, whether to share a vocabulary in practice may need to be weighed on a case-by-case basis.

Historical basis

Let’s look at a reference for vanilla Transformers that allow weight sharing. The paper “Using the Output Embedding to Improve Language Model” is an early work on shared vocabulary. This paper binds the model’s input and output word vectors together, sharing a single word vector space. The same word is combined into a single word vector in both the input and output vectors. In a common neural network language model, the vector flow is as follows:

The current input word is first represented as a vector c ∈ R^C.
The word embedding matrix U is used to project c into a dense representation.
Perform some calculations on the word vector U^T c to get h_2.
Use a second matrix V to project h_2 onto vector h_3, i.e., h_3 = V h_2. Vector h_3 contains many scores, with each word in the vocabulary corresponding to one of the scores.
The softmax function is used to convert the score vector into a probability vector p, which represents the model’s prediction of the next word.

Based on the above process, the paper presents the derivation. The paper refers to U as the input embedding and V as the output embedding. After model training, U is typically used as the pre-trained word vector for other upstream models, while V is ignored. The paper also compares the quality of the input and output word vectors, and its main results are as follows:

In the Word2Vec Skipgram model, the output word vectors perform worse than the input word vectors.
In RNN-based language models, the output word vectors outperform the input word vectors.
By combining two word vectors, i.e., forcing U = V, the combined word vector is more similar to the output word vector in an untied model than the input word vector in an untied model.
Combining input and output word vectors can improve the perplexity of various language models, regardless of whether dropout is used.
When not using dropout, it is recommended to add an extra projection P before V and apply regularization to P.
Weight tying in neural translation models can reduce the model size (number of parameters) to less than half its original size without affecting performance.

The relevant information is shown in the figure below. The top of the figure is an example from “Using the Output Embedding to Improve Language Model”, and the bottom shows the comparison of subwords between English and French, and between English and German.

1531

The practice of reusing embedding weights at the output of a language model is called “Tied Embeddings” or “Coupled Embeddings.” The idea is that the embedding matrix and the projection matrix from the output to logits are the same size (only differing by a transpose). Since this parameter matrix is relatively large, the same weights are shared to avoid unnecessary waste. In a sense, we can view the embedding layer and the linear layer in decoding as the inverse process.

Before decoding begins, the model uses one-hot encoding to obtain the embedding vector corresponding to the one-hot encoding from the embedding layer.
After the decoder generates the latent vector, the word in the row of the FC layer (the projection matrix of the output to logits) that is closest to the latent vector will have a higher prediction probability.

Because the embedding matrix and the FC matrix are the same size (they differ only in transpose), and because this parameter matrix is relatively large, the FC and embedding layers share the same weights to avoid unnecessary waste. This reduces the number of parameters and speeds up convergence.

1532

When the language model is relatively small, the number of parameters in the embedding layer can be considerable when the backbone of the model is small and the vocabulary is large. Therefore, reusing embedding weights at the output is a common practice. However, if the embedding layer accounts for a small proportion of the model parameters, it is necessary to consider whether it is necessary to share weights.

Additionally, sharing embeddings can have some negative effects, such as leading to very large initial losses during pre-training. This is because we often use techniques like DeepNorm to reduce training difficulty, which initialize the model’s residual branches close to zero. In other words, the model approximates an identity function in the initial stage, making the initial model equivalent to a 2-gram model with shared embeddings.

0xFF Reference

CTRL: A Conditional Transformer Language Model for Controllable Generation

ByteDance’s Big Data Model Interview: “What is the worst-case time complexity of Beam Search?” (Learn through graphs)

A new, and possibly groundbreaking, method to enhance language model reasoning with entropy-based sampling and parallel

Beam Search and its optimization methods are not afraid of being attacked.

Beam search algorithm, related concepts and C++ code iyayaai

Best-First Beam Search

Chain-of-Thought Reasoning without Prompting

Dola: decoding by contrasting layers improves factuality in large language models

EDT: Improving Large Language Models’ Generation by Entropy-based Dynamic Temperature Sampling https://arxiv.org/abs/2403.14541v1 )

EDT: Dynamically Adjusting Temperature in LLM (Du Lingxiao [Tanzhixuan])

EDT: Dynamically adjust the temperature in LLM (Leaf)

Entropix has finally found a real solution to hallucinations. (Python programming by Jiege)

Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Hot or Cold? Adaptive Temperature Sampling for Code Generation with Large Language Models

How Alignment and Jailbreak Work:Explain LLM Safety through Intermediate Hidden States

Kl-divergence guided temperature sampling. arXiv preprint arXiv:2306.01286

KL-Divergence Guided Temperature Sampling

[learning to break the loop: analyzing and mitigating repetitions](https://arxiv.org/pdf/2206.02369.pdf）

LLM Inference Common Parameter Analysis AI Talk

LLM Decoding Strategy and Code Implementation by Alex [Algorithm Dog]

penalty decoding: well suppress the self-reinforcement effect in open-ended

SED: Self-Evaluation Decoding Enhances Large Language Models for Better Generation

SED: Enhancing the Correctness of the LLM Decoding Stage (Du Lingxiao [Tanzhixuan])

Top-nσ: Not All Logits Are You Need (Chenxia Tang, Jianchun Liu, Hongli Xu, Liusheng Huang)

Training Large Language Models to Reason in a Continuous Latent Space

“Rethinking embedding coupling in pre-trained language models”

Can large-scale model thinking be stimulated without a prompt? Google DeepMind’s new work proposes a new paradigm of CoT.

Dynamic temperature coefficient T and the recently popular Entropix machine learning platform [Notes on 10,000 research papers]

Solving the “illusion” problem of large models during decoding [

How to generate text: Generating text using Transformers with different decoding methods

Temperature Coefficient and Top-P Sampling Strategy Explained by Zhang

Generating Repetitions: Learning to Break the Loop

A Re-exploration of Shared Embeddings in Language Model Output - Zhihu (zhihu.com)

A Re-exploration of Shared Embeddings in Language Model Outputs (by Su Jianlin)

What exactly is the “temperature” parameter in large language models? How do we set it correctly? (AI agent)

Some Thoughts on Deepseek’s Use of the EP Inference Method (Yang Pengcheng)

Is it possible to completely determine the LLM output if the temperature of the large model is set to 0? The truth is revealed! Alex [Algorithm Dog](javascript:void(0)😉

The Origins of Latent Space Inference | What is Meta’s Cocoon? Tensorlong Sees the World

| [Transformer Series] (15) Sampling and Output

Exploring the Transformer Series (15) --- Sampling and Output

Exploring the Transformer Series (15) --- Sampling and Output

0x00 Overview

0x01 Generator

1.1 Linear

1.2 softmax

1.3 Implementation

1.4 Use

reasoning

train

0x02 Sampling

2.1 Sampling Method

Deterministic sampling

probabilistic sampling

2.2 Greedy Decoding

2.3 Beam Search

question

Ideas

efficiency

Miscellaneous

punish

stop

optimization

2.4 top-k

2.5 top-p

2.6 Performance

0x03 Sampling Parameters

3.1 temperature

concept

Dynamic temperature coefficient

KL-Divergence Guided Temperature Sampling

Hot or Cold

EDT

3.2 repetition_penalty

Reason for the problem

Parameter principle

0x04 Logits Analysis

4.1 Compressed Information

Example

analyze

4.2 Changes

4.3 Preprocessing logits

Background and Motivation

Insight

Noise area

Information area

Core idea

Determine the boundary

algorithm

4.2 Implicit Thinking Chains

pause tokens/Filler Token

CoT

coconut

4.3 Entropy-based sampling

SED

entropix

Related information

Core idea

0x05 Weight Sharing

5.1 vanilla Transformer

5.2 Shared vocabulary weights

principle

Historical basis

5.2 FC and embedding sharing

0xFF Reference