Exploring the Transformer Series (32) --- Lookahead Decoding

0x00 Overview

The speculative sampling paradigm is predict + verify, while another approach is based on Jacobi decoding and its evolutionary branches, which are built on the Jacobi iteration.

Jacobi iteration transforms the N iterations of autoregression into N equations, which are then solved jointly. Jacobi Decoding, on the other hand, uses the output of the previous iteration as the input for the next iteration; essentially, it treats the output of each token as a 2-gram and uses this as the draft model. Assume (\mathbf{y}_i) Given a sequence of length m to be predicted, Jacobi Decoding predicts from random data (\mathbf{y}_0). Initially, autoregressive iterations were performed repeatedly, achieving a perfect hit rate in at most m iterations. The authors of the paper “Break the Sequential Dependency of LLM Inference Using Lookahead Decoding” realized that recording more historical information would allow them to create an N-gram as a draft model, thus improving the accuracy of speculative decoding. This is Lookahead Decoding. In short, Lookahead = N-gram + Jacobi iteration + parallel verification. It utilizes the Jacobi iteration method to simultaneously extract and verify n-grams, breaking the sequential dependency of autoregressive decoding, thereby reducing the number of decoding iterations and accelerating inference. Compared to previous parallel decoding methods, Lookahead Decoding requires neither a draft model nor head fine-tuning like Medusa. The authors consider Jacobi Decoding a special case of Lookahead Decoding in the 2-gram case.

3201

0x01 Jacobi decoding

The earliest work on the Jacobi decoding algorithm comes from the paper “Accelerating Transformer Inference for Translation via Parallel Decoding”. Jacobi decoding is similar to the iteration of an RNN, starting from an initial sequence (\mathbf{y}_0). Initially, after at most m iterations, a sequence of length m is predicted (\mathbf{y}_i).

1.1 Motivation

The conventional autoregressive decoding process is shown in the figure below. The decoding process is equivalent to solving the previous equation each time and then substituting it into the next equation.

3202

Jacobi decoding is based on the Jacobi iterative formula, which transforms the N iterations of autoregression into N equations related to the input and output (x,y). Thus, the starting point becomes to view the autoregressive decoding process as a problem of solving a system of equations.

3203

1.2 Approach

Jacobi decoding directly uses an iterative method to find solutions to a system of equations. First, it randomly assigns an initial solution y, then updates y based on the autoregressive equation and the initial solution, repeating this process until the iteration stops. An exact solution to an m-variable system of equations can be found in at most m iterations. The specific principle is shown in the diagram below. The LLM output (in Greedy Decoding) is a fixed point; through continuous self-iteration, the LLM can find the solution (fixed point) to the system of equations in fewer iterations.

3204

The comparison between autoregressive decoding and Jacobi decoding is shown in the figure below. The autoregressive decoding on the left is sequential, while the Jacobi algorithm on the right decodes m tokens at once, but allows for multiple iterations. Compared to autoregressive decoding, each Jacobi decoding step requires more computation because it needs to call a large model for forward computation on >1 tokens simultaneously. However, due to the parallel processing capabilities of GPUs, this usually does not result in a slowdown.

3205

A major drawback of Jacobi decoding is that it is only applicable to Greedy decoding. Greedy decoding guarantees that at least one stable token will be obtained in each iteration, so the number of steps required is no more than the number of steps required for autoregressive decoding.

1.3 Algorithm

The paper employs the following iterative algorithm: First, an output token of length m is randomly initialized. Then, with each input x, this output token is continuously updated (selecting the output with the highest probability) until the output tokens of two consecutive iterations are identical. Each step of the loop iteration is similar to a prefill operation, and the speed is similar to the decoding speed of a single token.

3206

Although the actual speedup effect of Jacobi decoding is not very significant, it has provided great inspiration for subsequent work.

Next, we will learn Lookahead Decoding.

0x02 Principle

2.1 Approach

Jacobi decoding has the following problems compared to previous models:

Autoregressive decoding takes time that is proportional to the number of decoding steps and cannot make good use of the parallel capabilities of accelerators.
Speculative decoding and its variant algorithms cannot guarantee the acceptance rate of draft tokens, and improving the acceptance rate of draft tokens by training a draft model is costly and has poor universality.
Although Jacobi decoding can decode multiple tokens through many steps, its random generation of initial solutions results in a low acceptance rate during iteration, thus leading to poor speedup.
Sometimes, Jacobi decoding predicts the correct token sequence fragments, but these sequences appear in the wrong positions, requiring several cycles to correct and move the sequences to the correct positions, thus affecting the walltime acceleration.

Lookahead Decoding attempts to solve the above problem by using the memory property of n-gram pools to model subsequence fragments and send them as candidate subsequences into the verification stage. It also extends to N-step dependencies, which can generate multiple tokens at once.

3207

2.1.1 Starting Point

The core of the Lookahead Decoding algorithm is to generate N-grams based on the Jacobi Trajectory generated during the Jacobi Decoding process.

For a target sequence of length m (\mathbf{y}_i) The worst-case scenario requires experiencing (\mathbf{y}_1) arrive (\mathbf{y}_m) Only when the complete trajectory is generated can all the tokens be finally generated. At this point, it degenerates into autoregressive decoding, which may even be slightly slower because m tokens are inferred simultaneously at each step. Although the randomly initialized tokens in Jacobi decoding may not be accepted, each new token at each position in the sequence forms a Jacobi Trajectory during the decoding process. We can use the initialized tokens and the Jacobi Trajectory to construct a series of n-grams, which may be used in later decoding steps, thereby accelerating the decoding process.

In Vanilla Jacobi Decoding (\mathbf{y_i}) The generation only depends on (\mathbf{y}_{i-1}). Lookahead Decoding attempts to extend the dependency to the first N-1 steps. For example, if we backtrack 3 iterations after the current decoding, we will have a set of 3-grams for each position. Lookahead Decoding caches these n-grams during iterations and verifies the cached n-grams in parallel while performing Jacobi decoding. Accepting an N-gram allows us to advance N tokens at a time. This overcomes the limitations of Jacobi decoding by generating N-grams in parallel.

2.1.2 Parallelism

To accelerate the decoding process, each Lookahead Decoding step is divided into two parallel branches: the lookahead branch for generating n-grams and the verification branch for validating n-grams, both of which are executed in a forward propagation process.

Lookahead branch: This is the original Jacobian decoding process. Due to inconsistencies, this process is not used as the primary speculative verification mechanism, but rather as a parallel decoding process for sampling or generating n-grams. The purpose of the Lookahead branch is to generate new N-Grams, which, along with the newly generated tokens, can be used to construct the candidate sequence for the next Verify branch.
Verification branch: This branch uses multiple candidates matched from the n-gram set as speculative verification inputs to complete the specific speculative sampling process. The verification branch selects and verifies promising n-grams and uses them to update the sequence of the next Lookahead branch.

2.1.3 Data Structures and Hyperparameters

2D Window

Unlike most Jacobi decoding (which uses only the historical tokens from the last step, or equivalently generates 2-grams), Lookahead Decoding effectively utilizes more information in the trajectory by generating many n≥2 n-grams in parallel using historical tokens from n-1 past steps. Therefore, the key design of Lookahead Decoding is to track the Jacobi iterative trajectory of the Jacobi decoding and generate new n-grams from that trajectory. This is achieved by maintaining a 2D window of variable size. The 2D window is defined by two important parameters: the window size W representing the sequence dimension and the n-gram size N representing the time dimension, to generate multiple disjoint n-grams in parallel from the Jacobi iterative trajectory.

W: W is used for looking ahead, which is the number of tokens that we hope to generate in the future. It is how far ahead of the future token position can we decode in parallel (the amount of computation increased by lookahead decoding is proportional to W, so we need to set an upper limit for W to control the computation cost).
N: N is used for backtracking, i.e., looking at how many steps back in the Jacobi iterations to retrieve the N-Gram. When N=2, Lookahead Decoding degenerates into Jacobi decoding.

Therefore, the 2D window consists of W columns and N-1 rows. During decoding, W additional tokens are decoded, which, together with these N-1 rows, form W N-grams. Furthermore, according to the diagram on the paper’s authors’ blog, the first window size includes the last input_ids. The 2D window is shown in the diagram below, where the horizontal axis W represents the length of the Jacobi Decoding window at each step, and the vertical axis represents the historical window size (\mathbf{y}_i). The generation results are as follows. Each iteration generates W N-grams.

3208

n-gram pool

To improve efficiency, Lookahead Decoding introduces an n-gram pool to cache all historical n-grams generated along the trajectory so far. These n-gram candidates are later validated via a validation branch to maintain the output distribution of the LLM; if they pass validation, disjoint n-grams are integrated into the sequence. In this way, Lookahead Decoding can significantly reduce the latency of LLM inference by utilizing the computational resources not used by autoregressive decoding.

Guess set size

To limit the size of the N-gram pool, the authors introduced a third hyperparameter G, representing the Guess set size, meaning that each key corresponds to at most G N-grams, and updates are performed using an LRU strategy.

2.1.4 Overview Diagram

The image below provides an overview of Lookahead. The blue 0 represents the prompt and the last previously determined output, i.e., the input for the current step t. Here, the window size W=5, N-gram size N=4, and verification count V=2. Orange corresponds to the results of the previous t-3 steps, green to the results of the previous t-2 steps, and red to the results of the previous t-1 steps. The number on each token indicates its relative position to the current input token (blue 0).

For the Lookahead branch, at this stage, we follow the trajectory formed by the first three steps, executing the modified Jacobi iterative algorithm to generate new tokens for all five positions. After generation, we collect and cache them in an n-gram pool (n=4) for example, a 4-gram consists of the orange token at position 1, the green token at position 2, the red token at position 3, and the newly generated tokens. The most outdated tokens in both dimensions (time and sequence) are removed, and the newly generated tokens are appended to the Lookahead branch to maintain a fixed window size for each step. For example, we remove all orange tokens and the green token at position 1 from the graph. Then, we form a new Lookahead branch with the green tokens at indices 2, 3, 4, and 5, all red tokens, and all newly generated tokens for the next step. Note the dependencies here; for example, red 6 depends on green 5 and all orange tokens.

The Verification Branch selects samples in a simple way: it directly selects N-grams from the N-gram Pool where the first digit is the blue token and the last digit is the N-gram. Those that are verified and accepted are then used as the output for this step.

3209

2.2 Example

Let’s use an example to demonstrate Lookahead Decoding: given the input “ABC”, predict the English alphabet.

If it is an autoregressive decoding scheme, the generation process will be as follows: (ABC \rightarrow ABCD \rightarrow ABCDE \rightarrow ABCDEF).

If it’s Lookahead Decoding, the process is as follows:

Suppose that the n-gram pool is now a 4-gram pool, which includes three 4-grams: CDEF, CDFE, and CDFG.
The n-gram candidate is added to the existing sequence for verification. That is, the input sequence is [ABC] [DEF, DFE, DFG], which is concatenated to obtain ABCDEFDFEDFG, and the output ABCDEF is calculated.
We’ve now achieved parallel decoding by verifying the n-grams to obtain the DEF (decoder error) in one pass. However, where do these n-grams that need verification come from? Or rather, how are they generated? Therefore, we need to add a 2D window data structure to generate the n-gram sequence. The 2D window contains three values: FGH, FGE, and FGJ. These three values are also added to the input sequence.
Therefore, the input sequence consists of three parts: [the existing sequence, the sequence used to generate and maintain n-grams, and the sequence used to verify n-grams], which enables parallel iterative loops. The new input sequence is: [ABC][FGH, FGE, FGJ][DEF, DFE, DFG], which, when concatenated, becomes ABCFGHFGEFGJDEFDFEDFG. [FGH, FGE, FGJ] will generate new n-grams in parallel for subsequent verification, while [DEF, DFE, DFG] will verify the n-grams.
Assuming DEF is accepted, the next new input sequence is [ABCDEF] [XXX, i.e., the rows corresponding to the new n-grams generated by FGH, FGE, FGJ] [FGH, FGE, FGJ]. The calculation will predict ABCDEFGH.

We will explain this process in more detail next.

0x03 Implementation

Next, we’ll examine some specific technical details of Lookahead Decoding based on llama.cpp and the original code provided by the paper’s authors (hereinafter referred to as the original code). In our example, we’ll set the hyperparameters as follows:

N=4, so the 2D Window has 3 rows.
W=5, so the 2D Window has 5 columns, meaning that 5 n-grams can be collected in each step.
With G=2, at most two matching sequences can be found from the N-gram pool.

In the image below, the top image is the original figure from the paper, and the bottom image is the author’s interpretation based on the original figure.

3210

3.1 mask

Since LLM decoding is primarily limited by memory bandwidth, we merge the lookahead and validation branches in the same step, leveraging the parallel processing capabilities of the GPU to hide the overhead. The mask is key to parallel decoding. The mask in this example is shown in the diagram below. The blue token marked 0 represents the input at the current step t. Orange corresponds to the result of the previous t-3 steps, green to the result of the previous t-2 steps, and red to the result of the previous t-1 steps. The number on each token indicates its position relative to the current input token (blue 0).

This mask follows two rules:

The tokens in the Lookahead Branch and Verification Branch are not visible to each other. For example, for the last 3 in the verification sequence, it can only see the blue input token 0, and the light blue tokens 2 and 3 that precede it.
Each token can only see the tokens preceding it and itself, like a causal mask. For example, green token 5 (purple circle in the image) is only visible to red token 6 (sky blue circle in the image); while red token 6 (sky blue circle in the image) can only see green token 5 (purple circle in the image) and blue token 1 (red circle in the image), and orange tokens 1 to 4 (green circles in the image).

At each time t, we use the trajectory from the previous N-1 steps to perform Jaccobi iterations to generate tokens at window size = 5 positions, thus obtaining multiple N-grams at the same position, such as blue 0-green 1-red 2 at the current input token position. During verification, we first identify the n-gram that matches the first token with the last input token through string matching. After adding the identified n-gram to the current input, we verify it through LLM forward propagation, thereby generating N tokens at once.

3211

3.2 Reasoning

3.2.1 Inference sequence

The inference sequence refers to the batch sent to the LLM for inference during the decoding process. The inference sequence consists of three parts:

Enter the token. Or, more specifically, the prompt, which is the token that has already been generated.
A lookahead sequence. All rows extracted from a 2D window constitute a sequence.
The verification sequence. The concatenated guess tokens are G matching sequences extracted from N-grams and concatenated together. Each N-gram is (g_i), in (g_i^{k}) it represents (g_i) the token at the k-th position.

In summary, the final inference sequence input to the model in each decoding step is: [input token, lookahead sequence, verification sequence]. However, the specific implementation differs slightly between the original code and llama.cpp, which will be explained using variables from both below.

Original code

The image below is an interpretation combining the paper and the original source code. The top of the image shows the original figure from the paper, and the bottom shows the author’s secondary annotations based on that original figure.

3212

enter

For the original code, the final inference sequence input to the model in each decoding step is [input_ids, past_tokens, guess_tokens].

input_ids is the input token, and its shape is (batch_size, sequence_length).
past_tokens is a lookahead sequence.
guess_tokens is a verification sequence.

As can be seen from the code, the final concatenation is [input_ids, past_tokens, guess_tokens] for reasoning.

for ll in range(fill_level + 1):
	all_past += past_tokens[ll]

if guess_tokens is not None:
    # 此处会拼接
    input_ids = torch.cat((input_ids, torch.tensor(all_past + guess_tokens, device=input_ids.device, dtype=input_ids.dtype).unsqueeze(0)), dim=1)
    guess_ids = list(range(lst_id + 1, lst_id + 1 + guess_size)) * (len(guess_tokens) // guess_size)
    position_ids = torch.cat((position_ids, torch.tensor(ids_list + guess_ids, device=input_ids.device, dtype=input_ids.dtype).unsqueeze(0)), dim=1)
    attention_mask = torch.cat((attention_mask, torch.ones(1, attn_size + len(guess_tokens), \
            device=input_ids.device, dtype=input_ids.dtype)), dim=1)

Forward propagation

Performing forward decoding will generate output, which contains the following:

out_logits: This refers to the logits of the next token output during the normal decoding process;
inp_logits: The logits of W tokens generated from the 2D Window. These W tokens will be concatenated with the 2D Window to form W N-grams.
guess_logits: If there are matching guess tokens, logits for those guess tokens will be generated for verification.

The corresponding abridged code is as follows.

How to generate

lguess = len(guess_tokens)
ret.out_logits = ret.logits[:,prefill_size - 1,:].to(input_ids.device) #decode logits
if lguess > 0:
    window = len(past_tokens[fill_level])
    start = ret.logits.size(1)-window-lguess
    end = ret.logits.size(1)-lguess
    ret.inp_logits = ret.logits[:,start:end,:] #lookahead branch logits
    ret.guess_logits = ret.logits[:,-lguess:,:] #verification branch logits

How to use

next_token_logits = outputs.out_logits #outputs.logits[:, -1, :]
if past_tokens[1] is None:
    past_tokens[1] = torch.argmax(outputs.inp_logits, dim=-1)[0].tolist() #fill window with argmax
elif past_tokens[LEVEL - 2] is None: 
    past_tokens[fill_level + 1] = torch.argmax(outputs.inp_logits, dim=-1)[0].tolist()[1:] #fill window with argmax
 else:
    guess_logits = logits_warper(input_ids, outputs.guess_logits[0])

llama.cpp code

The image below is an interpretation combining the paper and the lookahead.cpp source code of llama.cpp. The top of the image is the original figure from the paper, the middle is the comment from the llama.cpp source code, and the bottom is the author’s secondary commentary based on the original figure from the paper.

3213

As can be seen from the interpretation of Logits (representing the output logits) in red in the middle of the image above, the total input inference sequence that llama.cpp ultimately provides to the model is (W+G+1). Specifically, regarding the above diagram:

The input tokens are a sequence: blue 0.
Lookahead has 5 sequences: blue 0 + green 1 + red 2; blue 0 + orange 1 + green 2 + red 3; blue 0 + orange 2 + green 3 + red 4; blue 0 + orange 3 + green 4 + red 5; blue 0 + orange 4 + green 5 + red 6;
The verification consists of two sequences. The first n-gram is: blue 0 + sky blue 1 + sky blue 2 + sky blue 3. The second n-gram is: blue 0 + sky blue 1 + sky blue 2 + sky blue 3.

The model’s input consists of three parts: [existing input token, 2D window, guess token]. The prompt generates the next token; the 2D window generates the next token for each n-gram branch; and the guess generates and validates the token. These three parts can be executed in parallel without interfering with each other.

There is an input token.
- Input one token: a blue 0, corresponding to batch 0 in the code diagram.
- The next token of the current sequence is generated directly based on autoregressive decoding. The next token generated in this process is compared with the sky-blue 1 in the guess token.
Rows extracted from a 2D window.
- The lookahead sequence consists of 5 sequences: blue 0 + green 1 + red 2; blue 0 + orange 1 + green 2 + red 3; blue 0 + orange 2 + green 3 + red 4; blue 0 + orange 3 + green 4 + red 5; blue 0 + orange 4 + green 5 + red 6; corresponding to batch 1 to batch 14 in the code diagram.
- The sequences described above correspond to various n-grams. Different n-gram sequence branches generate different next tokens, thus creating new n-gram combinations. The purpose of this process is to maintain and update the n-gram pool, and it is unrelated to the tokens currently being verified.
n-gram, also known as guess token.
- The first n-gram of the verification sequence is: blue 0 + sky blue 1 + sky blue 2 + sky blue 3. This corresponds to batches 15 through 17 in the code diagram.
- The second n-gram in the verification sequence is: blue 0 + sky blue 1 + sky blue 2 + sky blue 3. This corresponds to batches 18 to 20 in the code diagram.
- For guess tokens, the process merges them with existing sequences to generate logits for each position. These logits are then compared one by one with the guess tokens, and those that meet the sampling requirements are added to the existing sequence. The purpose of this process is to verify whether there are any eligible tokens in the existing n-gram pool, thereby adding new tokens to the existing sequence.

Additionally, for better illustration, the following image is a screenshot of comments from llama.cpp. The terminology in the image is explained below.

Batch: The original inference sequence executed in parallel. The number represents the position of the token in the original inference sequence.
T: If the current step is t, then 0 represents t-1 steps, and -1 represents t-2 steps. That is, 0 is the latest token of the newly generated N-gram in the previous time step.
Info: I represents the input token, L represents the lookahead branch, and V represents the verify branch.
Pos: Used for mask setting. T determines the time step order, and the relative order of tokens in the same T is determined by pos. Therefore, each token can only see the mask before its current position.
Logits: Logits generated during inference. llama.cpp optimizes the original code, resulting in fewer inference sequences compared to the original.
Seq: W+G+1=8, there are 8 branches, each reasoning independently and without interference. Seq is the mask corresponding to these 8 branches.
j_tokens and id: variables in the actual code.

3214

The corresponding code is as follows, where tokens_j is a 2D window.

// the input token belongs both to all sequences
std::vector<llama_seq_id> seq_id_all(W + G + 1);
for (int i = 0; i < W + G + 1; i++) {
    seq_id_all[i] = i; // W+G个序列都有prompt
}

// current token - first token of the first level
// 输入token对应的推理，共一个推理序列。n_past代表位置，输入token需要输出logits
common_batch_add(batch, id, n_past, seq_id_all, true);

// verification分支对应的推理
// verification n-grams - queue this before the lookahead tokens for less KV cache fragmentation
{
    const int g_cur = ngrams_observed.cnt[id];
    ngrams_cur.resize(g_cur);
    for (int g = 0; g < g_cur; g++) {
        ngrams_cur[g].active = true;
        ngrams_cur[g].tokens.resize(N);
        ngrams_cur[g].i_batch.resize(N);
        ngrams_cur[g].seq_id = W + 1 + g;
        ngrams_cur[g].i_batch[0] = 0;
        ngrams_cur[g].tokens [0] = id;
    }

    // 一共最多G个推理
    for (int j = 0; j < N - 1; j++) {
        for (int g = 0; g < g_cur; g++) {
            const int idx = id*(N - 1)*G + g*(N - 1);
            const llama_token t = ngrams_observed.tokens[idx + j];
            ngrams_cur[g].tokens [j + 1] = t;
            ngrams_cur[g].i_batch[j + 1] = batch.n_tokens;
            // 放到prompt后j+1处；对应的序列是第{ W + 1 + g }个；这些token需要输出logits，对应上图，就是天蓝色的1，2，3都需要输出logits
            common_batch_add(batch, t, n_past + j + 1, { W + 1 + g }, true);
        }
    }
}

// 依然是输入token对应的推理，填补余下W-1给token。n_past + i代表位置
// fill the remaining W - 1 tokens for the first level
for (int i = 1; i < W; i++) {
    seq_id_look.resize(W - i);
    for (int j = 0; j < W - i; j++) {
        seq_id_look[j] = i + j + 1;
    }
    // tokens_j[0][i]代表从2d window第一行提取W-1个token，塞到prompt后面1~i处，这些token不需要输出logits，对应上图，就是橙色和绿色不需要输出logits；对应序列是 i + j + 1
    common_batch_add(batch, tokens_j[0][i], n_past + i, seq_id_look, false);
}

// lookahead分支对应的推理
// fill the rest of the levels
for (int j = 1; j < N - 1; j++) {
    for (int i = 0; i < W; i++) {
        // tokens_j[0][i]代表从2d window第j行提取W-1个token，塞到prompt后面(1~i)xj处，如果是第N-2行（N-2就是2d window的最后一行的下标）就需要输出logits，对应序列是{ i + 1 }
        common_batch_add(batch, tokens_j[j][i], n_past + j + i, { i + 1 }, j == N - 2);
    }
}

The code for common_batch_add() is shown in the image below.

void common_batch_add(
                 struct llama_batch & batch,
                        llama_token   id,
                          llama_pos   pos,
    const std::vector<llama_seq_id> & seq_ids,
                               bool   logits) {
    GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded");

    batch.token   [batch.n_tokens] = id; // 新加入token的id
    batch.pos     [batch.n_tokens] = pos; // 新加入token的位置
    batch.n_seq_id[batch.n_tokens] = seq_ids.size(); 
    for (size_t i = 0; i < seq_ids.size(); ++i) {
        batch.seq_id[batch.n_tokens][i] = seq_ids[i]; // 新加入token属于哪个序列
    }
    batch.logits  [batch.n_tokens] = logits; // 是否要输出logits

    batch.n_tokens++; // 本batch有多少个token
}

3.3 Overall Process

A Decoding Step typically includes the following steps:

Parallel Decoding: Generates a token for each position in the lookahead branch, that is, after one forward inference, generates a sequence of tokens to be verified corresponding to the candidate tokens.
Verify: Compare the token to be verified generated in the previous step with the candidate tokens to determine the longest correct sequence. The n-gram pool stores all historical n-grams (G were actually selected). (g_i^k) The first position (g_i^1) It is exactly all the n-grams of the last token in the current output sequence.
Collect N-Grams: Collect and cache newly generated n-grams from the lookahead branch trajectory. Specifically, use unverified candidate tokens and their corresponding generated tokens to form an N-gram sequence and add it to the N-gram Pool.
Update: Update the candidate sequence (lookahead branch) with the generated sequence of tokens to be verified to maintain a fixed window size, i.e., the 2D window slides to the right as a whole.
Match N-Grams: Use the tokens in the candidate sequence to match the corresponding tokens from the N-Grams in turn, and replace the candidate sequence.

The following diagram illustrates the LOOKAHEAD decoding process with W=5, N=3, and G=2.

3215

Each decoding step performs the following operations. The input for each decoding step should include: the currently generated tokens, the compressed 2D Windows, and the concatenated guess tokens:

Concatenate the token from the 2D Window into the input.
If the token generated in the previous step (the last token) has a matching N-gram in the N-gram pool, append these guess tokens to the input. For example, if the token generated in the previous step is “机” and the N-gram pool is {“机”: [“机学习”,“关枪！”]}, then append the following to the input: [“机学习关枪!”].
Construct an Attention Mask with the following characteristics: each token is only visible to other tokens whose position index is greater than its own; tokens in the Lookahead Branch and Verification Branch are not visible to each other.
Perform forward decoding to generate output, which contains the following:
- out_logits: That is, the logits of the next token output in the normal decoding process;
- inp_logits: The logits of W tokens generated from the 2D Window are concatenated with the 2D Window to form W N-grams.
- guess_logits: If matching guess tokens are found, logits for those guess tokens will be generated for verification.

The algorithm is as follows.

3216

3.4 Initialization

The official blog doesn’t explain where the initial Guess Tokens (“who”, “is”, “the”, “he”, “just”, “great” in the diagram) come from, but the author provides an answer in a GitHub issue. The first level can be randomly selected from the input Prompt, or even randomly selected from the vocabulary. Then, through (N - 2) warmups, a multi-level n-Gram Pool can be generated, or even completely randomized. The author chose to randomly select the first level from the list of tokens corresponding to the Prompt, and then warm up the subsequent levels. Since N is relatively small compared to the number of steps in the entire generation process, it generally becomes effective after several iterations.

3.4.1 Warm up

Before decoding, the n-gram pool and 2D window are empty and need to be initialized. This requires constructing the N-gram pool and filling the 2D window. Let’s continue by assuming W=5 and N=4.

3.4.2 Filling a 2D Window

At time T=0, the first position of the 3-gram is randomly sampled from the prompt, and the second position of the 3-gram comes from the 2-gram parallel decoding prefill. In each subsequent step, the last position of the 3-gram is decoded in parallel. Furthermore, at the next step, the 3-gram positions are rolled (as decoding progresses, the oldest token in the trajectory is removed, thus discarding the first position and leaving the last two positions as input for parallel decoding). The concept of N-gram follows the same logic. The specific operations are as follows.

At time T=0, W+N-3=6 tokens are randomly selected from the prompt to fill the first row of the 2D Window. At this time, the 2D Window contains only one row, let’s assume it is:

$[E,F,G,H,I,J]$

E is offset by 1 relative to the last token of the prompt, F is offset by 2 relative to the last token of the prompt, and so on.

At time T=1, E, F, G, H, I are filled into the input. Assuming the input is A, the concatenation result is AEFGHI. Executing a forward operation will not only yield the next token B from the normal decoding step, but also the next set of tokens corresponding to EFGHI, which is assumed to be KLMNO.

Update the 2D Window. The B obtained from normal decoding will replace E, so E needs to be removed. Additionally, the 2D Window needs to be populated with newly generated tokens. The final 2D Window looks like this:

$[F,G,H,I,J]\\ [K,L,M,N,O]$

At time T=2, if F, G, H, I, J, K, L, M, N, O are filled into the input, the concatenation result is ABFGHIJKLMNO. Executing a forward operation will not only yield the next token C from the normal decoding step, but also the next set of tokens corresponding to FGHIJKLMNO, let’s say it’s PQRST.

Update the 2D Window: The C obtained from normal decoding will replace F, so F needs to be removed; additionally, the 2D Window needs to be populated with newly generated tokens. The final 2D Window looks like this:

$[G,H,I,J]\\ [K,L,M,N,O]\\ [P,Q,R,S,T]$

At this point, the 2D Window is fully populated. It’s important to note that initially, the first row of the 2D Window should be filled with W + N - 3 tokens. This is because each time a row is filled, the first token of the previous row needs to be removed; a total of N - 2 filling operations are required. After filling, the first row will ultimately have W + N - 3 - (N - 2) = W - 1 tokens, and the remaining rows will each have W tokens. N - 2 means: a total of N operations, one for the prompt, one for random filling, and the remaining N - 2 operations.

3.4.3 Filling n-grams

If configured POOL_FROM_PROMPT, an N-gram pool will be constructed from the prompt. You can iterate through the prompt, using the current token ‘t’ as the key, and store the N-grams starting with ‘t’ in a list. For example, if the prompt is “BOOK, BUS!”, then the value corresponding to “B” in the N-gram pool would be: {“B”: [“OOK”,“US!”]}

If not configured, a 2D window will have been generated after N-2 inference iterations. A separate forward propagation using the input + 2D window is needed to generate a complete n-gram, which is then used to initialize the pool. Therefore, the generalized warm-up includes N-1 forward propagations.

The Lookahead Branch requires N−2 forward propagations to be fully constructed. Before this, the N-gram Pool is empty, and there is no Verification Branch at this point.

3.5 Lookahead Branch

The lookahead branch maintains a fixed-size 2D window to generate new n-grams based on the Jacobi iterative trajectory.

Specifically, it involves iteratively generating past tokens with different fill levels; the desired final shape is [WINDOW-1, WINDOW, …, WINDOW], with a length of LEVEL-1. The reason there are only LEVEL-1 tokens instead of LEVEL tokens is because these LEVEL-1 tokens are considered as input; during decoding, there are an additional set of IDs of length WINDOW, totaling LEVEL tokens, forming a LEVEL-gram.

Using the paper’s illustration, in the current step t, a Jacobi iteration is performed using the trajectory formed in the previous steps, generating new tokens for all 5 positions. Then, 4-grams are collected (e.g., a 4-gram can include the orange token at position 1, the green token at position 2, the red token at position 3, and the newly generated yellow token 4 in the current step). As decoding progresses, the oldest token in the trajectory is removed to maintain the constancy of N and W.

3.6 Verification Branch

n-grams are stored in an n-gram pool, and the verification branch identifies and determines promising n-grams from this pool. In this branch, the authors use the first token in the n-gram to match the last token of the input; this step is determined through simple string matching. Once identified, these n-grams are added to the current input token and verified through forward propagation of the LLM. As the n-gram cache increases, multiple n-grams starting with the same token will appear, and this happens more and more frequently, increasing the verification cost. To reduce this cost, the authors set the upper limit on the number of candidate n-grams in the verification branch to G. This is typically set to be proportional to W, for example, G=W.

If matching guess tokens are found, proceed to the Verification Branch to verify acceptance of the guess tokens. Currently, Lookahead Decoding supports Sampling and Greedy Search.

The Greedy Search solution verifies guess_logits all candidate N-grams and then performs the longest match.

The sample algorithm is shown in the figure below, and we summarize it as follows:

Suppose there are K candidate N-grams that match the token decoded in the current step, and that prob = Softmax(out_logits);
Traverse these K N-grams along the dimensions of the N-gram. Suppose we are currently traversing to the i-th position of the N-gram, and there are k candidate N-grams remaining. Traverse these k candidate N-grams:
- Take the token at its i-th position (t_i);
- Sample r∼U(0,1);
- like (r < prob(t_i)) Accept the token, remove all N-grams whose i-th position is not the token from the candidate set; and update the list prob = Softmax(guess_logits[i]).
- Otherwise, continue to verify the next N-gram;
After the traversal is complete, a new token is sampled prob from it.

3217

3.7 Prepare for next iteration

After the current iteration ends, preparations are made for the next iteration. Specifically:

Update the 2D Window. Replace the previous layer with the next (the last layer is filled with the logits from the latest output), and truncate each layer based on the number of accepted tokens. Randomly sample from the current sequence to fill the truncated portions.
Update n-grams. How to generate new n-grams? Essentially, it involves searching from top to bottom within a 2D window.
Update the Attention Mask and KV Cache for the next forward pass. Assuming k tokens are received, expand the Attention Mask accordingly and append the caches of these k tokens to the KV Cache.
The result is returned when the exit condition is met, such as when the length of the generated tokens reaches the required value max_length.

3.7.1 Original Code

The original code uses the next layer to update the previous layer.

if past_tokens[1] is None: #filling multi-level window, the very first step is different
    past_tokens[0] = past_tokens[0][1:] 
    past_tokens[1] = torch.argmax(outputs.inp_logits, dim=-1)[0].tolist()
    fill_level += 1
elif past_tokens[LEVEL - 2] is None: #filling multi-level window
    for level in range(fill_level + 1):
        past_tokens[level] = past_tokens[level][1:] 
    current_past_tokens = torch.argmax(outputs.inp_logits, dim=-1)[0].tolist()
    past_tokens[fill_level + 1] = current_past_tokens[1:]
    fill_level += 1
else: 
    # 用后一层来更新前一层
     if ALWAYS_FWD_ONE:
        past_tokens[0] = past_tokens[1][1:]
        for level in range(1, LEVEL - 2):
            past_tokens[level] = past_tokens[level + 1][:]
        past_tokens[LEVEL - 2] = new_results             
    else:
        past_tokens[0] = past_tokens[1][1 + max_hit:]
        for level in range(1, LEVEL - 2):
            past_tokens[level] = past_tokens[level + 1][max_hit:]
        past_tokens[LEVEL - 2] = new_results[max_hit:]

3.7.2 llama.cpp

llama.cpp updates the last line with logits. v is a loop that verifies each n-gram, encapsulating both the update of the 2D window and the verification of the n-gram.

// update lookahead tokens
{
    for (int i = 0; i < W; i++) {
        tokens_j_prev[i] = tokens_j[0][i];
    }

    // 用后一层来更新前一层
    for (int j = 0; j < N - 2; j++) {
        tokens_j[j] = tokens_j[j + 1];
    }

    if (v == 0) {
        // sample from the last level
        // 用logits更新最后一行。v是逐个校验n-gram的循环，此循环把对2d window的更新和对n-gram的校验都封装在一起
        for (int i = 0; i < W; i++) {
            tokens_j[N - 2][i] = common_sampler_sample(smpl, ctx, ngrams_cur.size()*(N-1) + W*(N - 2) + i);
        }
    } else {
        for (int i = 0; i < W; i++) {
            // there are different ways to init these tokens
            if (0) {
                // random init
                tokens_j[N - 2][i] = all[1 + rand() % (all.size() - 1)];
            } else {
                // init from the previous level
                tokens_j[N - 2][i] = tokens_j[0][i];
            }
        }
    }
}

0xFF Reference

https://github.com/ggerganov/llama.cpp

Lookahead Decoding

Lookahead Decoding: Understanding Megatron’s Meow

A quick look at Medusa and Lookahead’s speculative reasoning is done by A-Yuan.

Lookahead Decoding: Illustrated Guide by Sjrrr the Great Serpent

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

hao-ai-lab/LookaheadDecoding - Github

A 10,000-word overview of 10+ LLM speculative sampling inference acceleration solutions; AI casual discussion.

Jacobi decoding, paper speed reading, Bruce, wandering the world with a sword.

[2401.07851] Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

Lookahead Decoding & Comparison of 6 LLM Accelerated Decoding Methods: A Casual Discussion

https://lmsys.org/blog/2023-11-21-lookahead-decoding/

https://github.com/hao-ai-lab/LookaheadDecoding/issues/8

https://github.com/hao-ai-lab/LookaheadDecoding

https://jalammar.github.io/illustrated-gpt2/

https://jalammar.github.io/how-gpt3-works-visualizations-animations/

https://arxiv.org/abs/1811.03115

https://arxiv.org/abs/2211.17192

[Transformer 101 Series] In-depth Study of LLM Speculative Sampling (Part 1) aaronxic

[Transformer 101 Series] In-depth Study of LLM Speculative Sampling (Part 2) aaronxic