Exploring the Transformer Series (30) --- Decoding Speculation

0x00 Overview

Speculative decoding, also known as predictive decoding/speculative sampling, uses a small model to predict the behavior of a large model, thereby improving the decoding efficiency of the model and accelerating the execution of the large model. Its core idea is shown in the diagram below. First, multiple candidate tokens (serial sequences, trees, multi-head trees, etc.) are quickly generated in a low-cost manner (primarily using a small model, but also including multi-head, retrieval, and Early Exit methods). Then, the correctness of multiple tokens is quickly verified through a parallel verification stage. As long as the average number of tokens verified per step is greater than 1, multiple tokens can be generated at once, thus reducing the total number of decoding steps and achieving the goal of acceleration.

The left side of the image below shows the autoregressive decoding model, and the right side shows the speculative decoding mechanism.

3001

Essentially, speculative decoding aims to “speculate” in parallel by making better use of redundant computing power during the inference phase, without significantly altering the model. In contrast, another approach combines multiple models of different sizes and performance through routing. Routing determines which model to call before the call begins and continues until the call is complete. Speculative decoding, however, repeatedly calls models of varying sizes within a single query.

Note:

The complete list of articles is here. It’s estimated to eventually have around 35 articles. This list will be updated after each subsequent article is published. (Cnblogs Exploring Transformer Series: Article List)
This series is a study and interpretation of papers, blogs, and code, drawing on many articles from online friends, to whom I express my gratitude and will list them in the references. Because there are so many references in this series, there may be omissions in the citations. If the original authors or other friends find any omissions, please point them out, and I will add them to the references.

0x01 Background

1.1 Problem

As we all know, generative LLMs are mostly decoder-only structures. On the one hand, the models are relatively large, and the storage space and computational load required during inference are relatively large. On the other hand, when decoding large models, tokens are generated serially one by one. When the batch size is 1, matrix multiplication in the Transformer block degenerates into matrix-vector multiplication operations. For GPU inference, this is a very obvious IO bound, which makes it impossible to fully utilize the computing power of the GPU.

1.2 Autoregressive Decoding

Most current mainstream LLMs are Decoder-Only Transformer models, which use autoregressive sampling in the inference phase, and have the following characteristics:

The model uses a prefix as input, processes and normalizes the output to a probability distribution, and then samples to generate the next token.
After generating the first token, an autoregressive method is used to generate one token at a time, that is, the current round’s output token is concatenated with the historical input tokens as the input tokens for the next round, and then decoded.
Repeat step 2. In subsequent executions, the inputs of the two rounds differ by only one token.
The process will not end until a special Stop Token is generated (or a user condition is met, such as exceeding a certain length).

3002

The algorithm corresponding to autoregressive decoding is shown in the figure below.

3003

The disadvantages of autoregressive sampling are as follows:

Because autoregressive sampling generates tokens one by one during text generation, the generation of the next token depends on previously generated tokens. This serial mode results in slow generation speed and low efficiency. See the diagram below for details. Assuming there are a total of N tokens in the output, the Decoding stage needs to perform N-1 Forwards, which can only be performed serially.
During the generation process, the number of tokens that need to be considered increases (the generation of each token requires attention calculation with the previous tokens), and the amount of computation also increases accordingly.
The inference process of large models is often constrained by memory access speed. This is because inferring the next token requires relying on previous results. Therefore, when using GPUs for computation, all model parameters and key-value cache need to be moved to on-chip memory for processing. However, on-chip memory bandwidth is generally two orders of magnitude lower than computational performance, making large model inference memory-bandwidth-bound, and memory access bandwidth a serious bottleneck.

Furthermore, the capabilities of large models follow the scaling law, meaning that the more parameters a model has, the more powerful it is; however, larger models naturally require more computational resources. The scaling law tells us that we cannot reduce memory access by directly reducing the number of model parameters.

To address the issue of slow reasoning speed, researchers have implemented numerous engineering optimizations for reasoning, such as:

Improved computational core implementation, multi-GPU parallel computing, batch processing strategies, etc. The simplest approach is to increase the batch size during inference, such as using dynamic batching to merge multiple requests and transforming matrix-vector multiplication back into matrix multiplication operations. With a small batch size, this can almost achieve a linear increase in QPS. However, these methods do not fundamentally solve the problem that the LLM decoding process is constrained by memory access bandwidth.
Quantizing the model and KV Cache reduces the total number of bits required to read model parameters during the generation of each token, thus alleviating I/O pressure.
Increasing the arithmetic intensity, that is, increasing the ratio of “floating-point calculation volume/data transfer volume”, so that data transfer does not become a bottleneck.
Reducing the number of decoding steps, i.e., shortening the decoding process. Speculative decoding falls into this category.

0x02 Definition & History

2.1 Decoding Speculation

Speculative decoding allows us to process multiple tokens within the same user request together. Its purpose is similar to dynamic batching, which is to transform matrix-vector multiplication back into matrix multiplication operations. This is well-suited for scenarios where larger batch sizes are not feasible or where the goal is simply to reduce end-to-end latency. Speculative decoding typically uses two models: a Draft Model to quickly generate multiple candidate results, and a Target Model to validate and modify them in parallel, ultimately yielding a satisfactory answer. Specifically:

The draft model is used for guessing. The draft model reasoning is faster and takes on the work of serial processing. It generates K tokens in an autoregressive manner, allowing the target model to be computed in parallel.
The target model is used to evaluate the sampling results and review corrections. The target model samples from the autoregressive model by computing multiple tokens in parallel, and uses the inference results to decide whether to use these tokens generated by the draft model.

The algorithm for speculative decoding is shown in the figure below.

3004

Speculative decoding requires no changes to the output and guarantees the same sampling distribution as using the original model, making it equivalent to decoding directly with a large model. In the diagram on the right, the draft model first generates five prediction tokens, then inputs all five tokens into the target model. Using this prefix as input, the target model generates several tokens and then verifies them. Green indicates that the tokens generated by the draft model and the target model are consistent, and the prediction token passes “verification” - this token is a result that the LLM would generate itself. Red tokens are “speculation” tokens that fail verification. The first “speculation” token that fails verification, along with all subsequent “speculation” tokens, are discarded. Because this red token is not a result that the LLM would generate itself, the prefix correctness assumption is broken, and the verification of these subsequent tokens cannot guarantee that the prefix input is “correct.”

3005

2.2 Development History

The following is a history of the development of speculative decoding.

3006

Two articles deserve special mention. Both are considered pioneering works in the field of speculative analysis, and the mysteries surrounding them are difficult to unravel.

Speculative Decoding

The paper “Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation” was the first to use the term Speculative Decoding and established the paradigm of using the draft-then-verify method to accelerate Auto-Regressive generation.

Speculative Decoding aims to address the slow inference speed of existing Autoregressive models. Figure (a) below shows Blockwise Decoding, which introduces k-1 FFN heads to the target autoregressive model. These heads use shared attention to predict the following k tokens. (b) shows the Spec-Drafter model, an independent model for predicting draft tokens, using different queries to predict each draft token. The yellow portion in the figure above represents the autoregressive AR model, and the red portion represents the newly added modules.

3007

Speculative Sampling

The paper “Fast Inference from Transformers via Speculative Decoding” first proposed Speculative Sampling. This paper and the previous one were from the same period and are considered pioneering works in Speculative Decoding, with many subsequent studies building upon them. In this paper, the target model refers to the large model to be accelerated, and the approximation model refers to the small model used to help accelerate the large model.

From now on, we will use the term “speculative decoding” consistently.

Next, we will first conduct a brief analysis of “Blockwise Parallel Decoding,” a pioneering work in this field, and then study it in conjunction with two groundbreaking works.

0x03 Blockwise Parallel Decoding

The paper “Blockwise Parallel Decoding for Deep Autoregressive Models” proposes Blockwise Parallel Decoding, a pioneering work in this field, or rather, the first work on parallel decoding. Therefore, studying it carefully will help us understand subsequent developments. Blockwise Parallel Decoding (BPD) uses a multi-head approach to generate candidate sequences (a serial sequence) and then performs parallel verification.

3.1 Motivation

BPD aims to solve the low computational efficiency problem of serial greedy decoding in Transformer-based Decoders: during sequence generation, tokens are generated serially one by one, and the computational cost and the time required for the generation result are proportional to the number of tokens generated.

Next, let’s look at the starting point and approach of BPD.

3008

The image above illustrates the greedy decoding algorithm. While highly efficient, greedy decoding may not find the global optimum and has several drawbacks, as detailed below.

Assuming the length of the output sequence is m, then Autoregressive Decoding needs to perform m steps to obtain the final result. As the model size increases, the latency of each step also increases, and the overall latency will be amplified by at least m times.
Because each time a token generation calculation is performed, all model parameters and activation tensors need to be transferred, the decoding process is severely limited by memory bandwidth.

To overcome the above limitations, the motivation for improving BPD is as follows.

The authors expect to complete the entire prediction in n steps, where n is much smaller than m.
But how can we break the serial decoding curse and generate the next k tokens in parallel? Since language models predict the next token, if we have k-1 auxiliary models, each model can predict the tokens at positions 2 to k based on the input sequence. Then, the auxiliary models and the original model can potentially run independently, thus generating the next k tokens in parallel.

3.2 Approach

This paper proposes a parallel decoding technique for deep autoregressive models - the Blockwise Parallel Decoding (BPD) scheme. This scheme trains an auxiliary model (by adding a small number of parameters to the decoder of the original model) that enables the model to predict outputs at future positions (predicting and verifying the last k tokens in parallel). These predictions are then used to skip some greedy decoding steps, thus accelerating the decoding process. Specifically, BPD proposes a draft-then-verify paradigm using special drafting heads, with three stages: Predict, Verify, and Accept.

The Predict phase uses the original model plus k-1 auxiliary models to predict tokens at k positions. The paper transforms the original single head (the MLP used to predict the token distribution) into multiple heads. The first head is the head of the original model, used to predict the next token. The newly added heads predict the next token after that, and the next token after that, which is equivalent to predicting multiple tokens at once.
The Verify phase uses the original model to verify several possibilities formed by candidate words at these k positions in parallel. Since multiple tokens have already been generated, these token sequences can be verified in parallel using the original model in the next inference step (because model computation is I/O bounded, the increased computation from parallel verification has almost no impact on inference latency). The Verify process groups these tokens into batches, implements a suitable attention mask, and obtains the vocabulary probabilities for these k positions at once. Because the first head is the same as the original model’s head, the result is guaranteed to be correct, ensuring that the actual number of tokens generated in each decoding step is >= 1, thus reducing the number of decoding iterations. Additionally, new tokens to be predicted can be generated simultaneously during verification.
The Accept phase accepts the longest verified prefix and appends it to the original sequence. This phase greedily selects the token with the highest probability; if the verified token matches the token predicted in the Predict phase, it is retained. If they are different, all subsequent token predictions are incorrect.

It’s important to note that this paper only supports greedy decoding and is not suitable for other decoding algorithms (while Speculative Sampling can be adapted to Beam Search). Without sacrificing performance, the number of effective tokens may be limited. Furthermore, the model requires fine-tuning using training data. Therefore, Blockwise Parallel Decoding = multi-draft model + top-1 sampling + parallel verification. Inspired by this, subsequent Speculative Sampling methods also address the same problem by using a small model for parallel prediction and a large model for verification.

3.3 Architecture

BPD proposes a multi-head parallel decoding mechanism. Besides the original model p, several auxiliary models p2, …, pk are used in the Predict stage to assist in prediction. However, we face a problem: if these auxiliary models adopt the same structure as the original model p and are trained separately, the computational cost in the Predict stage is K times that of generating a single token. Even ignoring the Verify stage, the overall computational cost of the training task is not reduced in the ideal case. Moreover, the memory consumption of these K models would be staggering. Therefore, the paper does not actually construct k-1 auxiliary models; that is, p2, …, pk are not independent copies of the original model. The paper slightly modifies the original model, allowing these auxiliary models to share the backbone with the original model p1, and then adds a hidden layer, giving each model p1, …, pk an independent output layer. This allows the new model to predict the next k tokens, ensuring that the actual computational cost of the Predict stage is roughly equivalent to the computational cost of predicting a single token.

The specific model architecture is shown in the figure below, which adds three layers (from bottom to top) on top of the original model:

After the last Transformer Decoder layer in the original model, a hidden layer is added. Its input is (batch_size, sequence_length, d_model), and its output is (batch_size, sequence_length, k* d_model).
Several additional heads, p2, …, pk, are added after the hidden layer. The logits output from the Transformer Decoder layer are first passed to the hidden layer for projection, and the projected output is then passed to these heads. The calculation results of these heads are then residually concatenated with the logits of the original model. Each head is responsible for predicting a token, and the outputs of these k heads are the logits of tokens at k different positions. Head 1 is responsible for predicting the next token, Head 2 is responsible for predicting the next-next token, and so on.
Finally, the results are fed into a vocabulary projection layer (which includes a linear transformation and a softmax function) to predict the probability distribution of each word, and ultimately generate a token using a sampling method. This vocabulary projection layer is shared among multiple heads.

The backbone network plus Head 1 (red in the image below) is the original model, or the base model, which is the pre-trained model. The other Heads are what the paper calls auxiliary models (blue and green represent two auxiliary networks). Since the next token can be predicted based on the input sequence, it is also possible to predict the next token after that, and the token after that, based on the same sequence, although the accuracy may be lower. This allows an additional candidate sequence to be generated during the Decoding step, which the base model can then validate in the next Decoding step.

3009

3.4 Training

The modified model still needs to be trained using training data. Due to memory limitations during training, the paper cannot use the average of the k cross-entropy losses corresponding to the outputs of the k project layers as the loss. Instead, it randomly and uniformly selects one layer output for each minibatch as the loss.

The parameters of an FFN can be trained in the following ways:

Frozen Parameters: Freezes the original model parameters and only updates the parameters of newly added FFN layers. This ensures accurate prediction of the next token, but may affect the accuracy of the auxiliary model’s predictions.
Finetuning: Fine-tuning all parameters using the original parameters as initial values may improve the internal consistency of the model, but may result in a loss in final performance.
Distillation is well-suited for parallel decoding because both the teacher and student models have the same structure. Distilled data is generated by beam search of the original model using the same hyperparameters but different random seeds.

3.5 Steps

The diagram below illustrates the three stages of blockwise decoding: Predict, Verify, and Accept.

3010

Based on the above diagram, we will provide a detailed interpretation, assuming that the sequence length to be generated is m and the number of parallel Heads is k.

In the Predict phase.

The prediction involves using the original model plus k-1 auxiliary models to predict tokens at k positions. The original model p1 and the auxiliary models p2, …, pk are independent and can be executed in parallel. Therefore, the time to generate these k words is basically the same as the time to generate a single word, which reduces the overall number of steps and thus helps to reduce the overall latency.
In the above diagram, the original model and the two auxiliary models independently and in parallel predict the last three tokens, namely “in”, “the” and “bus”.

In the Verify phase, we need to select the longest prefix that meets the requirements from the K words generated in the previous step.

The original sequence and the generated k tokens are concatenated to form a batch (with a corresponding mask added). Pair<sequence_input,label> This Pair<sequence_input,label> batch is sent to the first batch in parallel to verify the k positions (to see if the token generated by the first batch is consistent with label).
Regarding the above diagram, the three tokens generated in the previous step are scored. Specifically, we concatenate the generated ‘in the bus’ with the prefix and feed it into the original model for a forward inference operation. In the Verify stage of the diagram, the black box contains sequence_input, the blue one is label to be verified, and the red arrow points to the prediction result. By performing only one forward inference operation, we can obtain the probability distribution of the vocabulary for the last three output positions.
- The first input to the batch is “I saw a dong ride”, and the output is “in”.
- The second input to batch is “I saw a dong ride in”, and the output is “the”.
- The third input to the batch is “I saw a dong ride in the”, and the output is “car”.

During the Accept phase, the longest K tokens whose predicted result matches that of Head1 will be selected as the acceptable result.

We can greedily select the token with the highest probability as the verification result. Looking from left to right, if the verified token is the same as the token predicted in the Predict stage, we keep that token. If they are different, then that token and all subsequent token predictions are incorrect.
Because it only accepts words before the first inconsistent word, and the original model p1 is used for validation, this ensures that the final result is completely consistent with the result predicted by the original sequence.
Regarding the image above, since “car” and “bus” are inconsistent, only “in” and “the” are retained.

Assume the sequence length to be generated is m, and the number of parallel Headers is k. In the autoregressive generation method, a total of m steps are required. In BDP, the above three-stage process is executed once for each k tokens: the predict stage performs one step, producing multiple Headers; the verify stage performs one step in parallel; and the accept stage is time-efficient. Therefore, in ideal conditions (where each generation of K tokens is acceptable), the total number of decoding operations is reduced from m to 2*m/K. Since the original model is used in both the predict stage (p1) and the verify stage, the original model is only used twice.

3.6 Optimization

Because of the existence of two stages, Predict and Verify, even in the ideal scenario, the total number of decoding iterations is 2m/K, not the optimal m/K. In fact, since the models in the Predict stage share a common backbone, and the Verify stage uses the original model p1, the Predict result in the (n+1)th step can be directly generated using the Verify result. Therefore, the authors further optimized the algorithm by simultaneously predicting the last k tokens during the original model verification. This allows the Predict and Verify stages to be merged, and the verification process also yields candidate tokens for the last k tokens.

After optimization, the model’s first inference only executes the predict phase (1 step), calling the original model once. Then it enters the overlapping verify and predict phase, processing the sequence forward by a length k each time until a termination token is generated (a total of m/k steps, calling the original model m/k times). That is, except for the first iteration, each iteration only needs to call the model forward once, instead of twice, thus halving the number of model calls required for decoding. This further reduces the number of model calls from 2m/k to m/k + 1.

3011

As shown in the image above, let’s take the previous example again:

In the Predict phase, the input word “I saw a dog ride in the” is used to perform an inference on the original model, generating new words “in”, “the”, and “bus”.
Verify phase:
- Group 1: Input “I saw a dog ride”, the word to be verified is “in”, the actual predictions are “in”, “the”, “car”, “last”, the top 1 word is “in”, the result is the same, so we accept the word “in”.
- Group 2: Input: I saw a dog ride in. The word to be verified is “the”. The actual predictions are “the”, “car”, “this”, and “week”. The top-ranked word for the first word is “the”, so the result is the same. We accept the word “the”.
- Group 3: Input “I saw a dog ride in the”, the word to be verified is “car”, the actual predictions are “bus”, “last”, “week”, “when”, the first word in the Top 1 is “bus”, the results are different, so the word “car” is not accepted.
Accept phase. Because the bus and car in the third group are different, the result of the third group is not accepted, and the result of the second group is accepted. Therefore, car, this, and week can be used as the new Predict result to continue the Verify process.

3.7 Revenue

Let’s take a look at the returns next.

This approach accelerates decoding because the Verify phase can use the base model p1 to decode the k predicted tokens simultaneously in parallel. Since each iteration of the Predict phase generates k tokens, which can be considered a block, this method is called blockwise parallel decoding. The inference result obtained by this method is the same as that of the autoregressive decoding method, thus avoiding any accuracy loss due to generation effects.

The speed of blockwise decoding depends on the number of model forward operations. Under memory-constrained conditions, the time to forward “I saw a dog ride” is approximately the same as the time to forward “I saw a dog ride in the car,” because both require access to model parameters and the KV cache, and the memory access overhead from a few more tokens is negligible.

0x04 Principle

Having reviewed BPD, this foundational work, let’s now take a look at Speculation Decoding.

4.1 Motivation

The motivation for speculative decoding comes from several observations and one reference.

4.1.1 Observation

Let’s first look at some key observations:

Difficult tasks often contain easier subtasks. In difficult language modeling tasks, there are usually some relatively easy subtasks. For example, when predicting certain tokens, the probability distribution of the softmax output tends to concentrate on those tokens, indicating that the model has a high degree of confidence in determining the next output token. This means that not all decoding steps are equally difficult. If we use a small model to answer these simple questions and then call a larger model when encountering difficult ones, we can improve the overall generation efficiency. In other words, most easily generated tokens can actually be generated using a model with fewer parameters.
Memory bandwidth and communication are bottlenecks for large model inference. For LLM inference, the bottleneck is usually not mathematical computation, but rather memory bandwidth, communication volume, and communication speed. Most of the inference time used in each decoding step of an LLM is not for forward computation of the model, but is consumed in migrating the massive amount of LLM parameters from GPU High-Bandwidth Memory (HBM) to the cache (for computational operations). This means that in some cases, appropriately increasing the computational load will not affect inference speed and can be used to improve concurrency.
When large models perform inference tasks (decoding phase), they often have a batch size of 1, generating only one token at a time, which prevents parallel computation and leads to significant computational redundancy. In fact, with limited increases in the number of tokens, the computation latency for inputting multiple tokens and inputting a single token in a single round is essentially the same. If we can enable large models to process a batch of tokens at once, we can utilize computational power and achieve a balance between computation and memory access.

4.1.2 Reference

“Speculative execution” is an optimization technique commonly used in processors (CPUs).

The basic idea is to execute a task in advance when it’s uncertain whether it truly needs to be executed, and then verify whether the task is indeed necessary. This increases concurrency and performance; a typical example is branch prediction. In processors, “speculative execution” is typically used to handle branch instructions. When a processor encounters a branch instruction, it doesn’t know the specific outcome of the branch condition, so it chooses a path to execute. If the branch condition ultimately meets the expectation, everything is normal, and the program continues execution. However, if the condition does not meet the expectation, the processor rolls back to the state before the branch, discards the previous operations, and then chooses the correct path to execute.

4.2 Approach

As mentioned above, speculative decoding was first proposed in two papers. Based on the above observations and the mechanism of speculative execution, the authors of the two papers extended the optimization technique of “speculative execution” to the decoding process of autoregressive models.

Speculative decoding uses two models: an original target model and a draft model, which is much smaller than the original model. The draft model and target model perform joint inference. The draft model generates $\gamma$ tokens, while the target model verifies whether these $\gamma$ tokens are the final required tokens. Essentially, a small model generates multiple draft tokens, and a larger model performs parallel verification, correction, and optimization on these draft tokens. This allows multiple tokens to be generated in a time approaching that of a single token generated by a large-parameter model. Let’s analyze this in detail.

“Speculative decoding” refers to using the output of a small model for speculation.
- First, a more efficient approximate small model is used to predict several subsequent tokens (some possible inference results, which are called “speculative prefixes”), which makes full use of the advantage of the fast decoding speed of the small model.
- During the decoding process, some tokens are relatively easy to decode, while others are very difficult. Therefore, simple token generation can be handled by smaller models, which should also be able to obtain correct predictions. Difficult tokens, on the other hand, should be handled by larger models. If the current problem is relatively simple, smaller models are more likely to guess multiple tokens correctly.
In the paper, parallelism refers to the large model calculating multiple tokens at once, saving transmission overhead. In other words, the large model is used in parallel to verify whether these tokens match the output of the large model; the approach is as follows.
- In a single forward propagation, multiple draft tokens are validated simultaneously. The process truncates the first draft token that does not match the original model output, discarding all subsequent draft tokens. This is the discarding behavior in “Speculative execution”.
- This approach leverages the higher computational efficiency of the prefill phase compared to the decoding phase. Large models can prefill the input from several smaller models’ decoding steps at once, arbitrating the results and improving inference speed. Replacing the decoding phase with the prefill phase of the large model saves memory access and fully utilizes tensor cores to accelerate matrix multiplication. This is not a purely algorithmic or purely hardware-system-centric acceleration solution, but rather a solution that considers both algorithms and hardware systems.
- Then, a novel sampling method is used to maximize the probability of these speculative tasks being accepted. This “speculative decoding” verification and resampling process is theoretically equivalent to sampling directly from the target LLM, thus ensuring that the final generated text distribution is consistent with the target LLM.

In summary, “speculative decoding” makes inference from large autoregressive models faster and more efficient by fully leveraging the complexity differences between models and employing parallel computing methods. It also maintains the same output distribution as the target model (accelerating inference on the target LLM without sacrificing decoding quality) without changing the model architecture, training process, or output. The execution flow is shown in the figure below.

3012

4.3 Comparison

The following is a comparison between speculative execution and speculative decoding.

category	Speculative execution	Decoding Speculation
Execute in advance	When encountering a branch instruction, the CPU does not know the specific result of the branch condition, so it will choose a path to execute.	The draft model performs sequential inference to generate draft tokens. This is equivalent to using the draft model to decode each token individually.
verify	Verify execution results	The target model performs parallel inference based on the serial output of the draft model, and then verifies and optimizes it. This is equivalent to using the results of several decoding steps of the smaller model as prefill input from the larger model for arbitration.
Verification successful	If the branch condition ultimately meets the expectation, then everything is normal, and the program will continue to execute.	Accept tokens generated by small models
Verification failed	If the conditions are not met, the processor will roll back to the state before the branch, discarding the previous operations.	Truncate the first draft token where it does not match the target model output, and discard all subsequent draft tokens.
Repair after failure	Select the correct path to execute.	Adjusting the probability distribution

The following is a comparison between speculative decoding and previous methods.

category	Previous plan	Decoding Speculation
Should the model architecture be changed?	Many previous methods required modifications to the model structure to make the inference process more efficient.	unnecessary
Should the training program be changed?	Some methods may require modifications to the training process so that the model can run more efficiently during the inference phase.	No modification to the training process is required; it can be applied directly to existing models.
Should we retrain?	Previous methods may have required retraining the model to adapt to the new architecture or training procedure.	unnecessary
Should the output distribution be changed?	Previous methods may cause changes in the model’s output distribution when accelerating the inference process.	By using the “speculative sampling” method, it is ensured that the results generated from the model have the same distribution as the original model.

Furthermore, the main difference between blockwise parallel decoding and speculative decoding lies in their model usage. Speculative decoding requires additional small models to regressively generate speculative tokens. These small models are constrained and more efficient than the target model, so the speedup can cover their cost.

In summary, the authors’ proposed method avoids many of the changes in model structure and training involved in previous methods while maintaining the same output characteristics, thus accelerating the inference process.

4.4 Classification & Design

The key to accelerating speculative decoding lies mainly in the following two points:

The efficiency and accuracy of “speculation”: How to quickly and accurately “speculate” the generation results of multiple decoding steps in the future of LLM.
The choice of “verification” strategy: How to ensure quality while allowing as many “speculated” tokens as possible to pass verification, thereby improving decoding parallelism.

Therefore, researchers typically categorize the implementation and research of speculative decoding based on these two points. Of course, the categorization methods may vary slightly. The following diagram shows a formal classification of speculative decoding techniques given in the paper “Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding,” which includes:

The draft model strategy covers how to design the model, the termination conditions, and how to manage multiple models (if any). The design of the “speculation” phase focuses on the trade-off between “speculation accuracy” and “speculation latency.” Generally, the larger the model used for speculation, the higher the speculation accuracy (i.e., the more tokens are verified), but the longer the speculation phase takes. The main concern of the speculation phase is how to achieve a balance between these two factors to ensure a relatively high overall speedup for speculation and decoding.
Validation Strategy. This category involves the design of validation schemes and acceptance criteria. The validation model is usually the target model, and its primary purpose is to ensure the quality of the decoding results. The acceptance criteria aim to determine whether a draft token should be (partially) accepted, i.e., whether the length of the accepted token is less than k. In each decoding step, the validation model validates the draft tokens in parallel to ensure that the output is aligned with the target LLM. This process also determines the number of tokens accepted at each step, which is an important factor affecting speedup. Sampling methods are specifically divided into lossless sampling and lossy sampling. (a) Lossless sampling mainly refers to using the original sampling method, such as greedy sampling or temperature sampling, for the original LLM, and then checking whether there are tokens that meet the requirements in the draft. The core of this method is that drafting is completely transparent to the original LLM and will not lose model performance. (b) Lossy sampling mainly refers to evaluating the quality of the draft during the validation phase, and then filtering out some high-quality drafts for acceptance based on some prior thresholds. The core of this method is to improve the draft acceptance rate and achieve higher speedup with some acceptable quality loss. Common verification standards include Greedy Decoding, Speculative Sampling, and Token Tree Verification. Because not all tokens with the highest probability are the most suitable decoding results, some works have proposed appropriately relaxing the “verification” requirements to allow more high-quality “speculated” tokens to be accepted, further improving the speedup ratio.

3013

The following diagram further details the classification content in the paper. The strategy for the draft model corresponds to number 1 in the diagram. The validation strategy corresponds to number 2 in the diagram. The specific speculative decoding method corresponds to number 3 in the diagram.

3014

4.4.1 Strategies for the Speculation Phase

The strategies in the speculation phase mainly consist of the following parts.

Draft

To some extent, a draft model is often a causal language model that generates speculative tokens. A draft model can be an additional small model outside the target model, such as generating candidate tokens in speculative decoding, or it can be several lightweight prediction heads connected to the target model, such as predicting upcoming tokens in blockwise parallel decoding. Recent advances suggest that draft models can also be retrievers that retrieve tokens from large corpora to complete the preceding context.

3015

The specific features of these draft models are as follows.

Independent Drafting. The main idea is to use a smaller model from the same family as the target LLM for “inference”. Because they are from the same family, the smaller model inherently possesses a certain degree of “behavioral similarity” to the target LLM, making it suitable as an efficient “inference” model. It’s crucial that the smaller model has the exact same vocabulary as the target model. Current optimizations focus on enhancing the “behavioral similarity” between the smaller and larger models, making the smaller model “more similar” to the target model. For example, knowledge distillation. The advantage of this approach is its ease of implementation and deployment. The disadvantages are: not all LLMs have readily available smaller models; integrating two different models into a single system introduces additional computational complexity, especially unfavorable for distributed deployment scenarios; and often requires training a draft model from scratch, a pre-training process that demands significant additional computational resources. Furthermore, separate pre-training may introduce distributional variations between the draft and original models, leading to sequence results that the original model may not like.
Self-Drafting. Due to the aforementioned disadvantages, related research proposes leveraging the target LLM itself for “efficient speculation,” that is, using the validation model itself as the drafting model. For example, reusing some intermediate results or parameters from the original LLM, and using hidden layer states to better predict future sequences. This approach naturally avoids the problem of model performance consistency, reduces additional computational overhead, and is also friendly to distributed inference. In terms of latency, Self-Drafting uses strategies to reduce the average number of parameters in the validation model, thereby achieving efficiency. For example, Blockwise Decoding and Medusa introduce multiple additional FFN Heads above the last decoder layer of the target LLM, allowing the model to generate multiple tokens in parallel at each decoding step as “speculation” results. However, these FFN Heads still require additional training. In addition to these two works, some studies have proposed using Early-Existing or Layer-Skipping for “efficient speculation,” or even simply inserting multiple [PAD] tokens at the end of the model input to achieve parallel “speculation.” Early-Existing is based on saturation observation: when generating a token, if the output tokens before and after the i-th layer are completely identical, we consider it to have reached the saturation point, and subsequent layers do not need to continue processing; we can directly return the token generated by the i-th layer. Because layers after the i-th layer are removed, the number of model parameters is reduced. Layer-Skipping determines which tokens, if skipped, do not significantly affect the generation of most tokens; these layers are skipped during token generation to reduce the number of parameters in the drafting model.
The method is based on retrieval. The idea is that the word groups in most common sentences can be statistically analyzed. Therefore, after generating a token, this token can be used to retrieve statistical data from the database to obtain which tokens are likely to be associated with this token, and then these tokens can be retrieved for verification.

Furthermore, draft models are not limited to a single small model. Some argue that, driven by ensemble learning, staged or cascaded small models of different scales can further improve performance. For example, the paper “Cascade Speculative Drafting for Even Faster LLM Inference” proposes Vertical Cascade and Horizontal Cascade. Vertical Cascade uses speculative decoding to accelerate speculative decoding. Horizontal Cascade refers to using a larger draft model for the first few tokens with high acceptance rates, and using a smaller model to “fool” the later tokens with lower acceptance rates.

Termination conditions

Sequences of speculative tokens that are too short or too long are suboptimal, but finding a truly suitable criterion is also difficult. Therefore, researchers have conducted in-depth studies on termination conditions, which can be broadly categorized into several types.

Static Setting: The simplest solution is to set the length k to a static value that can be iterated over and manually reset.
Adaptive Thresholding: While static settings can satisfy most use cases, the constant manual adjustment can be cumbersome. To address this, an adaptive thresholding method has been proposed, aiming to stop draft generation based on per-token consistency as early as possible. If the consistency falls below the threshold, draft model generation will cease. The threshold can be adaptively adjusted based on certain optimization objectives, such as the quality of draft tokens.
Heuristic Rules: Some heuristic rules can also be used to determine termination conditions. For example, if the previous guess is fully accepted during verification, the token length is presumably increased; otherwise, it is decreased. Another approach might be to adjust the length based on the batch size from the perspective of the system service.

Although various methods have been developed to automatically detect ideal values for termination conditions, it remains difficult to determine whether they are good enough. Given this need, we should develop more robust methods for searching and setting these parameters to achieve more stable and attractive performance.

4.4.2 Strategies for the Verification Phase

The verification phase, also known as the large model validation phase, is divided into the verification scheme (how to organize the input of multiple sequences, such as token tree verification) and the design of the acceptance criteria (sampling methods, such as greedy sampling, nucleus sampling, and typical sampling).

Verification scheme

The simplest way to organize multiple sequences of input is to directly form all possible inputs into multiple batches.

If only one token needs to be verified, then a chain-based validator (a general validator that uses the token as a sequence or link) should be sufficient. However, if multiple tokens are used, verifying them one by one sequentially will result in redundant computation and become too time-consuming. For example, consider two sequences, “maching learning is a” and “machine learning is the,” where the only difference is whether the last token is “a” or “the,” even though the prefixes are the same.

Therefore, researchers have proposed a tree-based verification method that enables the target LLM to verify multiple draft sequences in parallel. This method first constructs a trie from multiple candidate token sequences by sharing a prefix and prunes less frequent nodes from the trie tree. Then, it verifies it in parallel using tree attention in a single run (i.e., a child token can only see its parent token through an attention mask), which facilitates parallel verification of potential multi-token pairs. In contrast, a single token requires only one attention chain. The tree-based verification method relies on causal and lower-triangular masks, as shown in the figure below.

3016

3017

Acceptance Standards

Once the draft token is input into the target model, we can obtain the corresponding output probability. By aligning the speculative token and the probability, we can infer whether each token is valid in the draft.

Exact match

The simplest acceptance criterion is exact matching, which checks whether the speculative token has the highest probability accordingly. This strategy is based on a greedy algorithm. The verification of greedy sampling mainly ensures consistency in results when both the drafting model and the verification model use the greedy strategy. That is, it is necessary to verify whether each generation of the verification model is exactly the same as the generation of the drafting model.

Note: These are two groundbreaking works. $M_p$ , $M_q$ It’s the opposite; please be sure to pay attention when reading.

3018

While exact matching is simple, clear, and direct, and can guarantee consistency between the validated output and the target model’s output at a relatively low cost, it has some drawbacks:

While exact matching can guarantee that the validated output matches the output of the target model at a relatively low cost, this equation only holds true when greedy decoding is used.
When the target model uses sampling decoding, exact matching may struggle to accept tokens from the draft model, potentially slowing down the decoding process rather than speeding it up.
Overly strict matching requirements often lead to the rejection of high-quality tokens simply because they differ from the first prediction of the target LLM, thus limiting the paradigm’s speedup.

Rejection Sampling

To address the aforementioned issues, numerous studies have proposed various approximate verification criteria. Compared to lossless criteria, these methods slightly relax the matching requirements to place greater trust in drafts, thereby increasing the acceptance of draft tokens. For example, researchers proposed an acceptance criterion modified from rejection sampling to alleviate this problem (the two seminal papers). Theoretically, this acceptance criterion can be applied to both greedy decoding and sampling decoding.

3019

Typical Acceptance

The two acceptance criteria mentioned above provide a rigorous guarantee of quality. However, overly stringent acceptance criteria may offset the efforts of parallel verification and reduce the burden of speculative execution, especially when temperature parameters are applied. Therefore, in some cases, it is necessary to moderately relax the acceptance criteria to achieve a more significant speedup. Typical Acceptance can do this: if the speculative probability of a token exceeds a hard threshold, then the token in the draft is accepted. Furthermore, the threshold can be dynamically adjusted via top-k constraints. When multiple tokens are provided, Typical Acceptance will consider the token that forms the longest sequence and discard the others.

0x05 Algorithm

5.1 Overall Process

The diagram below illustrates the overall algorithm flow for speculative decoding. This algorithm first uses a more efficient approximation model $M_q$ to generate multiple guessed tokens, then use the target model $M_p$ . The algorithm evaluates the probabilities of these guessed tokens in parallel and decides which guesses are acceptable based on the evaluation results (accepting those that lead to the same distribution in parallel). If necessary, the algorithm also adjusts the distribution of the target model to maintain consistency. Finally, the algorithm returns a result from $M_p$ and $M_q$ . The generated results were obtained from this process. This process effectively utilizes the strengths of both models, accelerating the generation process.

Here we assume $p_i(x)$ , $q_i(x)$ . These are the distributions of the target and draft models, respectively.

3020

We’ll illustrate how random sampling works with an example. In the diagram below, each row represents one iteration. Green tokens are suggestions proposed by the approximate model and accepted by the target model. Red tokens are suggestions proposed by the approximate model but rejected by the target model; blue tokens are corrections made by the target model to the red tokens, i.e., rejecting the red tokens and resampling to obtain the blue tokens.

In the first line, the approximate model generates 5 tokens. The target model uses these 5 tokens and the prefix concatenated sentence “[START] japan’s benchmark bond” as input to verify the small model’s generation effect through one inference execution. Because the last token “bond” is rejected by the target model, “n” is resampled and generated. Thus, the four tokens in the middle, “japan”, “‘s”, and “benchmark”, are all generated by the small model. Similarly, because the large model executes the input sequence in parallel, it only forwards 9 times and generates 37 tokens. Although the total computational cost of the large model remains the same, the latency of the large model inferring one token is similar to the latency of the small model generating 5 tokens (parallel processing is always faster than generating them one by one), thus significantly improving the generation speed.

3021

5.2 Key Steps

Next, we will analyze the key steps and operations of the SpeculativeDecodingStep algorithm.

5.2.1 Prerequisites

The algorithm takes three parameters as input: the target model. $M_p$ Draft model $M_q$ and the known prefix.

target model

The target model refers to the original large autoregressive model, such as a large Transformer model. It is the primary model for inference and is responsible for generating accurate output. Target models typically have more parameters and computational resources, but this also results in slower single-step inference speeds.
Assumption $M_p$ For the target model, model inference involves obtaining the corresponding distribution p(xt|x<t) from the model given a prefix input x<t. Speculative decoding aims to accelerate this inference process.

draft model

The draft model is a more efficient approximation model designed to generate the next token faster given a prefix. Compared to the target model, it may have fewer parameters and higher computational efficiency to improve overall inference speed. The draft model can adopt the same structure as the original model but with fewer parameters, or simply use an n-gram model.
Assumption $M_q$ For a more efficient approximation model for the same task, given a prefix input x<t, the corresponding distribution q(xt|x<t) can be obtained from the model.

The paper “Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation” establishes two principles for draft models: the Capability Principle (as accurate as possible) and the Latency Principle (as fast as possible). It’s also important to note that the number of parameters in the smaller model should be significantly smaller than the original model for noticeable results; the draft model and the original model need to use the same tokenizer, otherwise, additional decoding and encoding time will be incurred.

5.2.2 Step 1 - Sampling

For the input prefix, while performing inference using the LLM (target model), the draft model is run serially in an autoregressive manner based on the input prefix in parallel. Next, generate $\gamma$ each token (called guesses or draft tokens). That is, for each $i \in \gamma$ , calculate $q_i(x)$ , in $q_i$ yes $M_q$ exist $prefix+[x_1, ..., x_{i-1}]$ The distribution on, and then from this distribution $q_i$ sample a token $x_i$ .

5.2.3 Second Step - Parallel Execution of the Target Model

Generate $\gamma$ each token and its prefix are concatenated together and fed into the target model. $M_p$ Perform one forward propagation (the verification model samples the candidate tokens as input and computes them in parallel using a prefill method). That is, for each guessed token $x_i$ attach it to the prefix $prefix+[x_1, ..., x_{i-1}]$ Then run the target model in parallel. $M_p$ They received $M_p$ Output $p_1(x), ..., p_{\gamma+1}(x)$ . Compared to autoregression, although the computational cost is the same, $\gamma$ multiple tokens can participate in the computation simultaneously, significantly improving the computation-to-memory ratio.

5.2.4 Step 3 - Calculate the number of accepted guessed tokens

Now that we have the outputs of the target model and the draft model, we can calculate the number of accepted guessed tokens. However, how do we evaluate the quality of a token generation? This is done by comparing $M_q$ the generated guesses and $M_p$ the probability of these guesses is determined by the evaluation probability (p, q represent the sampling probability in the size model, which is the probability distribution after logits normalization). That is, the distribution corresponding to each token is compared in turn to find the number of tokens n that meet the requirements, or in other words, rejection sampling is performed for each token based on p/q to determine the number of accepted tokens.

if $q(x) \leqslant p(x)$ This indicates that the larger model has higher confidence in the token x sampled by the smaller model, and therefore does not need to further reduce the probability distribution on this token, so the result is accepted.
Conversely, if q(x) > p(x), it means the larger model has lower confidence on token x than the smaller model. Therefore, some sampling probabilities from the larger model on this token need to be allocated to other tokens on which the larger model has higher confidence. $\frac{p(x)}{q(x)}$ The probability of accepting the current token, in order to $1-\frac{p(x)}{q(x)}$ The probability of discarding x is.

5.2.5 Step Four - Adjusting the Probability Distribution

This step will be adjusted as needed. $M_p$ The distribution of.

If the number of accepted guess tokens n is less than $\gamma$ , then the distribution generated by Mp may need to be adjusted to ensure that the accepted guesses are consistent with the output distribution of the target model.
Specifically, this is achieved by creating a new distribution.

$p'(x) = \mathrm{norm}(\max(0, p_{n+1}(x) - q_{n+1}(x)))$

To adjust.

It’s important to note that if a sample is rejected, it means we need to recalculate, which wastes computational resources. Therefore, during algorithm execution, we aim to accept as many samples as possible generated by Mq to minimize computational costs.

We will explain the significance of adjusting the probability distribution in detail later.

5.2.6 Step 5 - Returning the generated results

After adjusting the probability distribution, the generated result will be returned.

If the results generated by the smaller model are satisfactory, then the next token is sampled using the larger model. That is, using $M_p$ sample the next token, and add $M_q$ Return all n generated tokens together.
If a token x is unsatisfactory and is rejected, all tokens after token x are discarded. This is because the distribution of Mp has been adjusted in step four, and the tokens will be distributed according to this new probability distribution.

$p'(x) = \mathrm{norm}(\max(0, p(x) - q(x)))$

A new token is sampled as a correction.

Because of the addition of the large model rejection sampling process and the supplementary sampling process based on the probability distribution difference of the large model, the above sampling process is equivalent to sampling directly from p(x).

What is the maximum number of tokens that can be generated? If the verification process is considered to have an acceptance probability of $\alpha$ The algorithm involves $\gamma$ consecutive determinations. From the above algorithm flow, we know that the length of the output token ranges from [1, $\gamma$ +1], and there are the following 3 cases.

Scenario 1: If the first token is rejected by the large model, then the sampled output of the large model is used directly to generate a token of length $\gamma$ =1.
Case 2: When the t-th token is accepted by the large model, but the (t+1)-th token is rejected by the large model, the generated length is L = t+1. Note that in this case, t $\leq$ $\gamma$ -1.
Scenario 3: When all k tokens are accepted by the large model, the maximum generation length L = $\gamma$ should be reached. However, if the $\gamma$ tokens generated by the draft all pass validation, an additional token can be sampled from the logits of the already calculated $\gamma$ +1th token. Since this token is generated by the target model, it doesn’t need validation. Therefore, the final generation length L = $\gamma$ +1.

5.3 Key Analysis

Next, let’s look at some key points in decoding speculation.

5.3.1 Parallel Verification

3022

Let’s look at an example to see how to perform parallel verification.

In the diagram below, the input is: Our technique illustrated in the case of . The small model generates three tokens sequentially, each time accepting an input of (1, vocal_size). See number 1 in the diagram below for details.

In the first inference, the small model generates an unconditional one.
In the second inference, the small model generates the language.
The third inference step involves generating a small model.

3023

There are two methods to verify these tokens.

Scheme 1 is the scheme proposed in the paper, as shown by number 2 in the diagram above. Parallelism in the paper refers to calculating multiple tokens at once, saving transmission overhead. However, the paper $M_p$ Parallel computing is a form of acceleration that disregards computational resources. It attempts to compute large models in parallel at every step to achieve speed optimization, but this places extremely high demands on parallel computing capabilities. For example, with r = 3, four large models need to be computed simultaneously. Under extreme parallelism, theoretically optimal speed can be achieved, but at the cost of wasted computing power, which is unacceptable in engineering.

Scheme 2 is the practical solution, leveraging the higher computational efficiency of the prefill phase (parallel processing of multiple tokens) compared to the decoding phase (serial generation of multiple tokens) to accelerate the process. The target model’s task is not generation, but verification. Due to the parallel capabilities of modern computers, we can approximate that the time taken for a large model to process one token is almost the same as the time taken to process multiple tokens in parallel. This ensures that the verification process can be implemented in parallel; that is, calling the target model to perform the prefill operation once can complete the verification of multiple draft models (multiple decoding steps) in one go, thereby reducing the number of inference steps. Simultaneously, depending on how well Mq approximates Mp, multiple new tokens may be generated, up to a maximum of $\gamma$ + 1. Figure 3 above illustrates this process. The large model receives (3, vocal_size) inputs at once, meaning it directly checks the three new tokens: “unconditional”, “language”, and “modeling”. This is why it’s called parallel processing. The approach is similar to cross-entropy verification during the LLM training phase. Through the parallelism of misalignment and matrix computation, a single computation can verify the correctness of the three results generated by the small model, thus completing the verification. Four verifications need to be executed in parallel (using argmax as an example):

Prefix “Our technique illustrated in the case of”, generates “unconditional”, which is the same as the first token “unconditional” generated by the approximate model, and is accepted.
Prefix “Our technique illustrated in the case of unconditional”, generates “language”, which is the same as the second token “language” generated by the approximate model, and is accepted.
The prefix “Our technique illustrated in the case of unconditional language” generates “method”, which is different from the third token “modeling” generated by the approximate model and is therefore unacceptable.
Prefix “Our technique illustrated in the case of unconditional language modeling” generates “of” as a candidate. If all the previous ones are accepted, then this token is accepted.

5.3.2 Acceleration Effect

What is the principle behind this acceleration? In short, speculative decoding speeds up inference compared to autoregressive sampling because it reduces the number of serial calls to the original model. Therefore, speculative decoding requires combining the following two steps to achieve faster inference.

Draft generation. Mq generates $\gamma$ completions. Because the draft model has fewer parameters, it generates tokens faster than the target model, making it a more efficient model, thus reducing the time spent generating completions.
Draft validation. All guesses from Mq and their corresponding probabilities are evaluated in parallel using the target model Mp. Guesses that lead to the same distribution are accepted, and an additional token is drawn from the adjusted distribution to fix the first rejected token, or an additional token is added if all tokens are accepted. That is, through the parallelism of misalignment and matrix computation, a single computation can verify the correctness of the $\gamma$ results generated by the small model.

3024

The example below shows different numbers of tokens (verified), where purple represents the decoder executing the target model Mp, blue represents the decoder executing the approximate model Mq, and yellow and orange represent the encoder calls. Here, the number of tokens that can be received in one iteration is defined as generated tokens. The speedup effect is related to $\gamma$ , p, and q. Intuitively, the larger $\gamma$ is, the closer the distributions of p and q are, and the larger the number of generated tokens.

To explain in layman’s terms.

At the bottom, the large model directly predicts the new token, which takes too long.
The middle and top sections first use a small model to predict 12 tokens. Then, the large model, leveraging the parallel nature of matrix computation, can verify which of the first 12 tokens are correct in one go. If any are correct, it saves a lot of time (because the small model is much smaller than the large model, the time consumed by the small model is negligible).

3025

The factors affecting acceleration ratio are:

The size of the small model and the number of tokens in one inference.
The latency of the small model in generating candidate tokens.
The acceptance rate of large models for inference tokens from small models, or the degree of alignment between small and large models.

Therefore, if the acceptance rate of the small model’s output draft is high enough and the latency of generating candidate tokens is short, speculative decoding can achieve a higher speedup. Suppose we guess n tokens at a time, and on average m tokens are ultimately accepted. In this process, we call the small model D n times, the large model T once, and generate m tokens. As long as nD is significantly less than (m-1)T, a good speedup can be achieved.

Having understood the principle, we can see the limitations of this speedup method: whether the distribution generated by the small model is consistent with that of the large model. The acceptance rate of the validation will greatly affect the final speedup ratio; the higher the acceptance rate, the more decoding steps are reduced, and the less computation is wasted due to unaccepted data.

5.3.3 Adjusting the distribution

We raise the following question: In the fourth step of the algorithm, when n < $\gamma$ , why is it necessary to adjust the distribution obtained from the target model (Mp)? What is the purpose of this adjustment?

This leads to another core aspect of speculative decoding: how to ensure that the probability of tokens obtained through speculative decoding is the same as that sampled directly from the large model. In fact, both the paper on speculative decoding and the paper on speculative decoding provide proofs that this verification and resampling process is theoretically equivalent to sampling directly from the target LLM. Therefore, it can be guaranteed that the final generated text distribution is consistent with the target LLM. That is, for any distributions p(x) and q(x), the distribution of tokens obtained through speculative decoding from p(x) and q(x) is the same as the distribution of tokens obtained by sampling only from p(x).

Let’s first outline how to prove it. Essentially, we want to examine what $p(x=\tilde{x})$ After using the speculative decoding strategy, is the probability still equal to our original probability? $q(x=\tilde{x})$ , Right now $q(\tilde{x})$ The probability decomposition approach is as follows: there are two possibilities for sampling $\tilde{x}$ . It can be proven that after resampling, the overall probability is consistent with the original probability.

Path 1: The small model p(⋅|⋅) sampled $\tilde{x}$ And it was successfully accepted. Note that if at this point $\tilde{x}$ If a rejection occurs, it is impossible to obtain the result through resampling. $\tilde{x}$ The reason is that a refusal indicates $q(\tilde{x})$ Less than $p(\tilde{x})$ Therefore, in resampling $\max(q(\tilde{x}) - p(\tilde{x}), 0)$ If the value is 0, it is impossible to resample. $\tilde{x}$ .
Path 2: The small model p(⋅|⋅) sampled other values. $x \neq \tilde{x}$ And a rejection occurred, at which point resampling was performed. $\tilde{x}$ .

Secondly, the detailed derivation process is shown in the figure below. We have organized and annotated the formula based on the paper “Accelerating Large Language Model Decoding with Speculative Sampling”.

3026

Deviation

When n < $\gamma$ , it means that fewer tokens are sampled from the more efficient approximation model Mq than $\gamma$ , meaning that some of these guesses are rejected by the target model Mp. This may be because $M_q$ The generated guesses and the target model $M_p$ The actual distribution has some deviation.

When using this approximation model $M_q$ The probability of the generated token is less than or equal to that of the target model. $M_p$ When generating the probability of this token, we will retain this token. When using an approximate model $M_q$ The probability of the generated token is greater than that of the target model. $M_p$ When generating the probability of this token, we cannot simply accept it, as this might lead to a result inconsistent with the distribution of the target model. Therefore, in this case, we will reject the token with a certain probability and resample from the adjusted probability distribution.

Note: For quick understanding, if $M_p$ The probability of generating a certain token is 0.5. $M_q$ The probability of generating this token is 0.6, which means $M_q$ It’s even more unreliable than a large model; it’s unbelievable.

Compensation for deviation

Adjust the target model $M_p$ The purpose of the distribution is to compensate for the shortcomings of the approximate model. $M_q$ The guesses and target models obtained in $M_p$ The differences between distributions are analyzed to ensure that the final generated result conforms to the true distribution of the target model. This ensures that the results obtained during the speculative decoding process maintain a certain level of accuracy and consistency.

Adjusting the distribution operation compensates for the small model $M_q$ and large models $M_p$ The gap in probability distributions between them. The idea is: for small models $M_q$ Each guess, based on the large model $M_p$ and small models $M_q$ We use the probability distribution to determine how likely this guess is to be correct. This is essentially using a small model $M_q$ Sampling to large model $M_p$ A mapping was established between the samples. This allows for the mapping of small models. $M_q$ and large models $M_p$ The probabilities are treated as several random events, and then the small model is $M_q$ Random events and large models $M_p$ We map random events onto each other; if the outcomes of the random events on both sides are consistent, we consider the guess correct. Specifically, if the two probability distributions are identical, the probability of a correct guess is 1. If, at some step, we consider the smaller model $M_q$ If the initial guess is incorrect, then all subsequent results are invalid. In this case, a larger model is used. $M_p$ The final step involves sampling the probability distribution obtained and then exiting. This step ensures that the output is identically distributed and that at least one token is output each time.

Specifically, the author needs to define a new distribution. $p'(x)$ It is based on the target model $M_p$ The original output distribution $p_{n+1}(x)$ This is derived from adjustments. If n < $\gamma$ (i.e., the target model rejects some guesses), the authors used an adjustment function to modify it. $p_{n+1}(x)$ This adjustment function is $\max(0, p_{n+1}(x) - q_{n+1}(x))$ Its function is to ensure $p_{n+1}(x)$ Not less than $q_{n+1}(x)$ The purpose of this is to keep the distribution generated by the target model as consistent as possible with the distribution of the approximate model.

Here’s a more intuitive explanation. This adjusted probability distribution $p'(x)$ It is obtained by subtracting the probability distribution (p(x)) of the target model from the probability distribution (q(x)) of the approximate model, taking the maximum value of the result, and then normalizing it. This adjusted distribution ensures that the results sampled from the target model have the same distribution characteristics, while also handling rejected tokens, ensuring consistency in the final generated results.

If p(x’) > q(x’), it means the larger model has a higher probability of generating token x’ than the smaller model. Therefore, the larger model is more confident in generating token x’, indicating that the smaller model’s generation is less problematic and x’ can be retained. If p(x’) ≤ q(x’), the smaller model is more confident, and the larger model rejects the token with a probability of 1-p(x)/q(x) and resamples. Because the probability of acceptance is biased towards positions with larger q(x), the probability of resampling should be biased towards positions with larger p(x), hence the norm(max(0, p(x)-q(x))).

Make up for the result

An additional token is generated from the adjusted distribution (based on the tokens preceding the first erroneous token) to correct the first erroneous token. If all tokens are accepted, a newly generated token is added (this token is generated by the target model and therefore does not need verification), ensuring that at least one new token is generated each time. Thus, even in the worst case, where the target model runs entirely serially, the number of runs will not exceed the number of times the target model would run serially in the normal mode (each parallel run of the target model generates at least one new token). Of course, it is also possible to generate more tokens, up to a maximum of q+1, depending on how closely the approximation model Mq approximates the target model Mp.

5.3.4 Optimization

In speculative decoding methods, the acceptance rate of draft tokens is significantly affected by the consistency between the output distribution of the draft model and the output distribution of the original large model. Therefore, a large amount of research has focused on improving the draft model.

DistillSpec extracts smaller draft models directly from the target large model. SSD includes automatically identifying sub-models (subsets of model layers) from the target large model as draft models, eliminating the need for separate training of draft models. OSD dynamically adjusts the output distribution of the draft models to match the user query distribution in the online large model service. It achieves this by monitoring rejected draft tokens from the large model and using this data to improve the draft models through distillation. PaSS proposes using the target large model itself as a draft model, taking trainable tokens (lookahead tokens) as input sequences to simultaneously generate subsequent tokens. REST introduces a retrieval-based speculative decoding method, using non-parametric retrieval data storage as the draft model. SpecInfer introduces a collective boosting tuning technique to align the output distribution of a set of draft models through the target large model. Lookahead decoding involves the large model generating n-grams in parallel to generate draft tokens. Medusa fine-tunes several heads of the large model specifically for generating subsequent draft tokens. Eagle employs a lightweight Transformer layer called an autoregressive head to generate draft tokens in an autoregressive manner, integrating rich contextual features from the target large model into the input of the draft model.

Another study focuses on designing more efficient drafting strategies. Traditional methods typically generate a single draft token sequence, which poses a challenge for validation. To address this, Spectr advocates generating multiple draft token sequences and employing a k-sequential draft selection technique to concurrently validate k sequences. This method utilizes speculative sampling to ensure the consistency of the output distribution. Similarly, SpecInfer employs a similar approach. However, unlike Spectr, SpecInfer merges the draft token sequences into a “token tree” and introduces a tree-like attention mechanism for validation. This strategy is called a “token tree verifier.” Due to its effectiveness, the token tree verifier is widely adopted in numerous speculative decoding algorithms. In addition to these efforts, Stage Speculative Decoding and Cascade Speculative Drafting (CS Drafting) propose accelerating drafting by directly integrating speculative decoding into the token generation process.

0x06 Implementation

We will use https://github.com/huggingface/transformers/src/transformers/generation/utils.py for learning.

6.1 Global Loop

Speculative decoding is performed inside the while loop of the _assisted_decoding() function.

 def _assisted_decoding(
     self,
     input_ids: torch.LongTensor,
     candidate_generator: CandidateGenerator,
     logits_processor: LogitsProcessorList,
     stopping_criteria: StoppingCriteriaList,
     generation_config: GenerationConfig,
     synced_gpus: bool,
     streamer: Optional["BaseStreamer"],
     **model_kwargs,
 ) -> Union[GenerateNonBeamOutput, torch.LongTensor]:
     # init values
     do_sample = generation_config.do_sample
     output_attentions = generation_config.output_attentions
     output_hidden_states = generation_config.output_hidden_states
     output_scores = generation_config.output_scores
     output_logits = generation_config.output_logits
     return_dict_in_generate = generation_config.return_dict_in_generate

     # init attention / hidden states / scores tuples
     scores = () if (return_dict_in_generate and output_scores) else None
     raw_logits = () if (return_dict_in_generate and output_logits) else None
     decoder_attentions = () if (return_dict_in_generate and output_attentions) else None
     cross_attentions = () if (return_dict_in_generate and output_attentions) else None
     decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None

     # if model is an encoder-decoder, retrieve encoder attention weights and hidden states
     if return_dict_in_generate and self.config.is_encoder_decoder:
         encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None
         encoder_hidden_states = (
             model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None
         )

     # keep track of which sequences are already finished
     batch_size = input_ids.shape[0]
     unfinished_sequences = torch.ones(batch_size, dtype=torch.long, device=input_ids.device)
     model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs)

     this_peer_finished = False
     is_first_iteration = True  # to preserve the same API in the output as other generation methods
     # while循环里面进行投机解码
     while self._has_unfinished_sequences(this_peer_finished, synced_gpus, device=input_ids.device):

6.2 Outer Logic

This includes obtaining the output of the draft model, calling the algorithm in the paper, and adjusting the token based on the algorithm results.

while self._has_unfinished_sequences(this_peer_finished, synced_gpus, device=input_ids.device):
    cur_len = input_ids.shape[-1]

    #  1. Fetch candidate sequences from a `CandidateGenerator` and move to the correct device
    candidate_input_ids, candidate_logits = candidate_generator.get_candidates(input_ids)

    candidate_input_ids = candidate_input_ids.to(self.device)
    if candidate_logits is not None:
        candidate_logits = candidate_logits.to(self.device)

    candidate_length = candidate_input_ids.shape[1] - input_ids.shape[1]
    is_done_candidate = stopping_criteria(candidate_input_ids, None)

    # 2. Use the original model to obtain the next token logits given the candidate sequence. We obtain
    # `candidate_length + 1` relevant logits from this process: in the event that all candidates are correct,
    # we use this forward pass to also pick the subsequent logits in the original model.

    # 2.1. Prepare the model inputs
    candidate_kwargs = copy.copy(model_kwargs)
    candidate_kwargs = _prepare_attention_mask(
        candidate_kwargs, candidate_input_ids.shape[1], self.config.is_encoder_decoder
    )
    candidate_kwargs = _prepare_token_type_ids(candidate_kwargs, candidate_input_ids.shape[1])
    if "cache_position" in candidate_kwargs:
        candidate_kwargs["cache_position"] = torch.cat(
            (
                candidate_kwargs["cache_position"],
                torch.arange(cur_len, cur_len + candidate_length, device=input_ids.device, dtype=torch.long),
            ),
            dim=0,
        )

    model_inputs = self.prepare_inputs_for_generation(candidate_input_ids, **candidate_kwargs)
    if "logits_to_keep" in model_inputs:
        model_inputs["logits_to_keep"] = candidate_length + 1

    # 2.2. Run a forward pass on the candidate sequence
    # prepare variable output controls (note: some models won't accept all output controls)
    model_inputs.update({"output_attentions": output_attentions} if output_attentions else {})
    model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {})

    outputs = self(**model_inputs)

    # 2.3. Process the new logits
    # .float() is needed to retain precision for later logits manipulations
    new_logits = outputs.logits[:, -candidate_length - 1 :].float()  # excludes the input prompt if present
    new_logits = new_logits.to(input_ids.device)
    next_token_logits = new_logits.clone()
    if len(logits_processor) > 0:
        for i in range(candidate_length + 1):
            new_logits[:, i, :] = logits_processor(candidate_input_ids[:, : cur_len + i], new_logits[:, i, :])

    # 3. Select the accepted tokens. There are two possible cases:
    # Case 1: `do_sample=True` and we have logits for the candidates (originally from speculative decoding)
    # 👉 Apply algorithm 1 from the speculative decoding paper (https://arxiv.org/pdf/2211.17192.pdf).
    if do_sample and candidate_logits is not None:
        valid_tokens, n_matches = _speculative_sampling(
            candidate_input_ids,
            candidate_logits,
            candidate_length,
            new_logits,
            is_done_candidate,
        )

    # Case 2: all other cases (originally from assisted generation) 👉 Compare the tokens selected from the
    # original model logits with the candidate tokens. We can keep the candidate tokens until the first
    # mismatch, or until the max length is reached.
    else:
        if do_sample:
            probs = new_logits.softmax(dim=-1)
            selected_tokens = torch.multinomial(probs[0, :, :], num_samples=1).squeeze(1)[None, :]
        else:
            selected_tokens = new_logits.argmax(dim=-1)

        candidate_new_tokens = candidate_input_ids[:, cur_len:]
        n_matches = ((~(candidate_new_tokens == selected_tokens[:, :-1])).cumsum(dim=-1) < 1).sum()

        # Ensure we don't generate beyond max_len or an EOS token
        if is_done_candidate and n_matches == candidate_length:
            n_matches -= 1
        valid_tokens = selected_tokens[:, : n_matches + 1]

    # 4. Update variables according to the number of matching assistant tokens. Remember: the token generated
    # by the model after the last candidate match is also valid, as it is generated from a correct sequence.
    # Because of this last token, assisted generation search reduces to a normal greedy search/sample if there
    # is no match.

    # 4.1. Get the valid continuation, after the matching tokens
    input_ids = torch.cat((input_ids, valid_tokens), dim=-1)
    if streamer is not None:
        streamer.put(valid_tokens.cpu())
    new_cur_len = input_ids.shape[-1]

    # 4.2. Discard past key values relative to unused assistant tokens
    new_cache_size = new_cur_len - 1
    outputs.past_key_values = _crop_past_key_values(self, outputs.past_key_values, new_cache_size)

    # 5. Update the candidate generation strategy if needed
    candidate_generator.update_candidate_strategy(input_ids, new_logits, n_matches)

6.3 Implementing the Algorithm

The notes state that Algorithm 1 from the paper “Fast Inference from Transformers via Speculative Decoding” has been implemented, which is the following algorithm.

3027

The code is as follows.

def _speculative_sampling(
    candidate_input_ids,
    candidate_logits,
    candidate_length,
    new_logits,
    is_done_candidate,
):
    """
    Applies sampling as in the speculative decoding paper (https://arxiv.org/pdf/2211.17192.pdf, algorithm 1). Returns
    the selected tokens, as well as the number of candidate matches.

    NOTE: Unless otherwise stated, the variable names match those in the paper.
    """
    new_candidate_input_ids = candidate_input_ids[:, -candidate_length:]
    # Gets the probabilities from the logits. q_i and p_i denote the assistant and model probabilities of the tokens
    # selected by the assistant, respectively.
    q = candidate_logits.softmax(dim=-1)
    q_i = q[:, torch.arange(candidate_length), new_candidate_input_ids].squeeze(0, 1)
    p = new_logits.softmax(dim=-1)
    p_i = p[:, torch.arange(candidate_length), new_candidate_input_ids].squeeze(0, 1)
    probability_ratio = p_i / q_i

    # When probability_ratio > 1 (i.e. q_i(x) < p_i(x), or "assistant probability of the candidate token is smaller
    # than the model probability for the same token"), keep the token. Otherwise reject with p = 1 - probability_ratio
    # (= keep with p = probability_ratio). Keep all the tokens until the first rejection
    r_i = torch.rand_like(probability_ratio)
    is_accepted = r_i <= probability_ratio
    n_matches = ((~is_accepted).cumsum(dim=-1) < 1).sum()  # this is `n` in algorithm 1

    # Ensure we don't generate beyond max_len or an EOS token (not in algorithm 1, but needed for correct behavior)
    if is_done_candidate and n_matches == candidate_length:
        # Output length is assumed to be `n_matches + 1`. Since we won't generate another token with the target model
        # due to acceptance on EOS we fix `n_matches`
        n_matches -= 1
        valid_tokens = new_candidate_input_ids[:, : n_matches + 1]
    else:
        # Next token selection: if there is a rejection, adjust the distribution from the main model before sampling.
        gamma = candidate_logits.shape[1]
        p_n_plus_1 = p[:, n_matches, :]
        if n_matches < gamma:
            q_n_plus_1 = q[:, n_matches, :]
            p_prime = torch.clamp((p_n_plus_1 - q_n_plus_1), min=0)
            p_prime.div_(p_prime.sum())
        else:
            p_prime = p_n_plus_1
        t = torch.multinomial(p_prime, num_samples=1).squeeze(1)[None, :]

        # The selected tokens include the matches (if any) plus the next sampled tokens
        if n_matches > 0:
            valid_tokens = torch.cat((new_candidate_input_ids[:, :n_matches], t), dim=-1)
        else:
            valid_tokens = t

    return valid_tokens, n_matches

0x07 Token Tree Verification

Because of the importance of Token Tree Verification, we will explain it in a separate section.

As mentioned earlier, Token Tree Verification enables the target LLM to verify multiple draft sequences in parallel. The idea is to have the draft model output k candidate tokens at each time step, then build a trie from the multiple candidate token sequences using a shared prefix, and prune less frequent nodes from the trie tree. Finally, in a single run, it is verified in parallel using tree attention (child tokens are masked by attention and can only see their parent tokens).

3028

7.1 Problem

7.1.1 Sampling Multiple Sequences

The paper “SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification” found that when large model validation fails, the actually generated tokens are mostly the top-k tokens of the small parameter model. The figure below shows the validation success rate of using greedy and stochastic decoding methods with k from 1 to 5 in the top K on various datasets. It can be seen that although the top-1 accuracy for predicting the next next token hovers around 60%, the final validation success rate is greatly improved when the small parameter model retains the top-5 at each step. If necleus sampling is used, the success rate for the top-3 tokens already exceeds 90%.

3029

Therefore, instead of sampling a single sequence of tokens, we should sample a tree-like token tree. We can guess multiple tokens not just in the first step, but at each step, increasing the probability of success at each step. As long as the additional computational cost is less than the speedup gained from the higher probability, guessing more tokens is acceptable.

7.1.2 Validating Multiple Sequences

However, how do we validate this token tree? That is, how do we organize multiple input sequences? The simplest way to organize multiple input sequences is to directly combine all the tokens from each leaf node to the root node into a sequence and then validate it. This approach has several problems:

Verifying these tokens one by one would involve redundant calculations and would be too time-consuming.
Some studies have found that predicting one chain at a time results in a very rapid decay of probability, making it impossible to predict very long chains and thus failing to fully utilize the parallelism required for large-scale model validation.

Another approach is to create a sequence from each leaf node to the root node, resulting in n sequences from n leaf nodes. These n sequences are then used as prefill for a batch size of n. However, the problem with this approach is that the computation of the root node cannot be reused.

Let’s see how the researchers solved the above problems.

7.2 Approach

7.2.1 SpecInfer, the pioneering work

To address the aforementioned issues, SpecInfer designed a Tree-Based Parallel Decoding mechanism. Its core idea is to jointly predict the LLM output using a series of small Speculative Models (SSMs), and organize these predictions into a Token Tree, where each branch represents a unique candidate Token sequence. Finally, the LLM uses a tree-based parallel decoding mechanism to verify the correctness of all Tokens in the Token Tree in parallel. Here, the tree decoding algorithm can also reuse intermediate results shared between these sequences. SpecInfer uses the LLM as the Token Tree validator rather than an incremental decoder, which significantly reduces the end-to-end latency of generative LLMs while maintaining model quality.

The specific process of SpecInfer is as follows.

First, an output tree is generated for each SSM, that is, a tree is formed by taking several possibilities for each token. Then, these trees are merged into a larger tree. After generating the larger tree, the tree is expanded into several token sequences.

3030

The generated tree is then validated. The tree structure introduces complex dependencies between tokens. If every path from the root to a leaf in the tree is validated using a large model, the large number of leaf nodes can cause the algorithm to degenerate into a primitive scenario of predicting one token at a time. To address this, SpecInfer proposes tree attention to accelerate decoding. The method transforms the ancestor relationships in the tree into visibility relationships of attention-masks, allowing the model to validate multiple sequences simultaneously. As shown in the figure, for such a tree, if a conventional masking method is used, t6 can see t5. However, under the mask matrix in the graph, each token can only see its own prefix, allowing LLM to perform non-interfering validation of multiple sequences in one go.

3031

7.2.2 How to organize a tree

There are various methods for creating tissue trees; see the diagram below for details.

Taking Sequoia in the lower right corner of the diagram as an example, the acceptance vector is p = (p1, p2, …, pk, …), where the probability that the validation algorithm accepts a token at sub-position k is pk. The specific tree construction method is based on the positional acceptance assumption: assuming that token t is the k-th sub-token that has already been accepted, the probability that the validation algorithm accepts token t depends only on the value of k. The score of each child node is the score of all tokens from the root node to that node. $p_k$ (The probability of the algorithm accepting the token at subposition k) is multiplied. The final goal is to maximize the sum of the scores of all nodes in the entire tree, given the number of nodes. The solution to this problem can be represented by the solution to smaller subproblems, and therefore can be solved using dynamic programming. The resulting tree structure satisfies the condition that child nodes with higher predicted probabilities have more descendants.

3032

For example, the diagram below illustrates EAGLE-2’s Token Tree Verification. The numbers on the tree edges represent the confidence scores of the draft model, and the numbers in parentheses within blocks represent the node values. In the expansion phase, we select the top two nodes with the highest values from the current layer (orange blocks) as input to the draft model and connect the generated tokens (green blocks) to the draft tree. In the reordering phase, we select the top eight nodes with the highest values from all nodes (blue blocks), flattening them into a one-dimensional sequence to form the final draft. Then, we construct an attention mask based on the tree structure to ensure that each token can only see its ancestor nodes.

3033

7.3 Attention Mask

The Attention Mask matrix in Medusa is shown in the figure below. The left side shows the candidate sequences, and the corresponding Attention Mask matrix is shown on the right side. In the figure, Head 1 generates 2 possible tokens (It and I) at the next position, and Head 2 generates 3 possible tokens (is, ’, and the) at the next-next position. Because any prediction from the first head can be paired with any prediction from the second head, there are 2 x 3 = 6 possible candidate sequences for the next and next-next positions, ultimately forming a multi-layered tree structure. Each layer of this tree corresponds to a prediction from a Medusa Head. Within this tree, the Attention Mask only restricts the attention of a token to the tokens preceding it.

3034

0xFF Reference

Accelerating Speculative Decoding for Large Models ( Du Lingxiao )

Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation]

[paper reading]

A New Paradigm for Accelerated LLM Inference! A Latest Review of Speculative Decoding by hemingkx

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

Are there any reviews on speculative decoding? (Konoha)

LLM(18): An Overview of LLM Inference Optimization Techniques

A 30,000-word detailed analysis of Tsinghua University’s latest review work: A review of efficient inference using large models .

LLM Inference Acceleration: Speculative Decoding Overview zssloth

A Review of Speculative Decoding Methods—Accelerating Large Model Inference (Part 1) — Jin Yi

Accelerating Speculative Decoding for Large Models ( Du Lingxiao )

DeepSeek Technology Explained (2) - The Past and Present of MTP (Multi-Token Prediction) (by Jiang Fuchun )

[Reading Notes] Multi-token prediction (A Lost Little Bookboy)

Deepseek v3 Technical Report - Step-by-Step Analysis of the Graph - 3 - MTP That’s Hard to Understand - Formulas Contain Spelling Errors - Lost Little Bookboy

A 10,000-word overview of 10+ LLM speculative sampling inference acceleration solutions; AI casual discussion.

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

https://github.com/hemingkx/SpeculativeDecodingPapers

[Tearing Down LLM-Speculative Decoding] Large Models Move Towards the Era of Parallel Decoding ( by Xiaodonggua AIGC)

[Tearing Apart LLM-Medusa] Parallel Decoding Paradigm: Medusa is Here, Get Out of the Way!! (by Xiaodonggua AIGC)

https://zhuanlan.zhihu.com/p/684217993

https://mp.weixin.qq.com/s/PyAKiFzbQNq6w7HmaTnSEw

https://zhuanlan.zhihu.com/p/690504053

https://zhuanlan.zhihu.com/p/699166575

https://zhuanlan.zhihu.com/p/658298728

LLM Inference Acceleration: Medusa - An Inheritance and Development of Blockwise Parallel Decoding ( by Fang Jiarui)

Fang Jiarui: The Renaissance of LLM Reasoning Acceleration: Noam Shazeer and Blockwise Parallel Decoding

Accelerating Large Language Model Decoding with Speculative Sampling Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre and John Jumper All authors from DeepMind

A collection of papers on Speculative Decoding; Gray Pupil Sextant

[Natural Language Processing] [Large Models] Speculative Sampling Accelerates LLM Decoding ( Bai Qiangwei)

A quick understanding and code implementation of speculative decoding algorithms iyayaai

Want to learn about speculative sampling? Let’s take a look at this paper! (Time-Space Cat’s Question and Answer Box)

Equivalence proof of speculative decoding: paperplanet

Big Model Inference Techniques—Speculative Decoding ( by Fang Jiarui )

Why can LLM speculative sampling accelerate model inference ? Venda

A 10,000-word overview of 10+ LLM speculative sampling inference acceleration solutions; AI casual discussion.

Let’s talk about the speculative reasoning service of large-scale model reasoning, Dao Dao Ning.

A 30,000-word detailed analysis of Tsinghua University’s latest review work: A review of efficient inference using large models .

[2401.07851] Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

Stern, Mitchell, Noam Shazeer, and Jakob Uszkoreit. “Blockwise parallel decoding for deep autoregressive models.”Advances in Neural Information Processing Systems31 (2018)

Xia, Heming, et al. “Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding.”arXiv preprint arXiv:2401.07851(2024).

Agrawal, Amey, et al. “Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills.”arXiv preprint arXiv:2308.16369(2023).

Cai, Tianle, et al. “Medusa: Simple llm inference acceleration framework with multiple decoding heads.”arXiv preprint arXiv:2401.10774(2024).

Li, Yuhui, et al. “Eagle: Speculative sampling requires rethinking feature uncertainty.”arXiv preprint arXiv:2401.15077(2024).

Chen, Charlie, et al. “Accelerating large language model decoding with speculative sampling.” arXiv preprint arXiv:2302.01318 (2023).

Leviathan, Yaniv, Matan Kalman, and Yossi Matias. “Fast inference from transformers via speculative decoding.” International Conference on Machine Learning. PMLR, 2023.

Sun, Ziteng, et al. “Spectr: Fast speculative decoding via optimal transport.” Advances in Neural Information Processing Systems 36 (2024).

Miao, Xupeng, et al. “Specinfer: Accelerating generative llm serving with speculative inference and token tree verification.” arXiv preprint arXiv:2305.09781 (2023).

Chen, Zhuoming, et al. “Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding.” arXiv preprint arXiv:2402.12374 (2024).

https://arxiv.org/abs/2401.07851

https://arxiv.org/abs/2308.16369

https://github.com/openppl-public

https://arxiv.org/abs/1811.03115

https://arxiv.org/abs/2211.17192

https://huggingface.co/blog/assisted-generation

https://arxiv.org/abs/2305.09781

https://arxiv.org/abs/2401.10774

https://arxiv.org/abs/2311.08252

https://lmsys.org/blog/2023-11-21-lookahead-decoding/

https://arxiv.org/abs/2401.15077