Exploring the Transformer Series (33) --- DeepSeek MTP

0x00 Overview

The general idea behind MTP (Multi-token Prediction) is to have the model use n independent output heads to predict the next n tokens, with these n independent output heads sharing the same model backbone. By optimizing the decoding stage, the generation of 1-tokens is transformed into the generation of multi-tokens, thereby improving training and inference performance.

Before DeepSeek, there were several MTP solutions, each with its own focus.

The focus is on accelerating decoding during inference. Examples include the papers “MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads” and “EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty.” These solutions improve inference performance by generating multiple tokens at once, thus significantly accelerating the process.
The focus is on improving efficiency during training. For example, the paper “Better & Faster Large Language Models via Multi-token Prediction” specifically addresses this. By generating multiple subsequent tokens at once, this approach allows for the learning of labels at multiple locations simultaneously. This provides richer and denser training signals, effectively improving sample utilization efficiency, training speed, and overall model performance. Deep Seek MTP also falls into this category.

We will now begin our study.

Note: The paper “Blockwise Parallel Decoding for Deep Autoregressive Models” is an early work of MTP, and since it has been introduced earlier, it will not be repeated here.

0x01 EAGLE

The paper “EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty” proposes a speculative sampling framework called EAGLE. This framework performs autoregression at the feature layer and introduces a token sequence one time step ahead to address the uncertainty in feature prediction. The core innovation of EAGLE lies in its addition of the last hidden state (or feature) of the large model to the small model to improve its capabilities. Specifically, during training, the small model needs to input the [feature, token] output at position t-1 of the large model, and during prediction, the small model is required to align its [feature, token] with the large model.

1.1 Research Background

LLM generates text token by token, meaning that generating the next token depends on previously generated tokens. This serial approach results in a computationally intensive and time-consuming autoregressive encoding process, which is the main bottleneck in LLM applications. Therefore, numerous solutions have been developed to improve it. The EAGLE paper uses the following diagram to compare EAGLE with some other speculative solutions. $t_i$ represents the token input for the nth time. $f_i$ represents the output after LLM at the second-to-last layer (i.e., the output before the LM Head). Let’s start from here and look at its research background.

3301

Speculative sampling uses a small draft model to quickly generate multiple tokens, which are then validated in parallel using the original target LLM. Its drawback is that it requires a suitable draft model, and the quality of the draft model directly impacts the speedup.
Lookahead uses n-grams and Jacobi iterations to predict tokens. Its drawbacks include lower draft quality, limited speedup, and it is only suitable for greedy decoding.
Medusa uses multiple MLPs to predict tokens based on the penultimate layer features of the target LLM, as shown in the diagram. It uses $f_2$ to predict $t_4$ and $t_5$ . The drawbacks are that the draft quality is still not high, the speedup effect is limited, and the output distribution cannot be guaranteed to be consistent with the target LLM under non-greedy decoding.
The EAGLE authors believed that predicting the next token using the feature vectors of the target model itself was more accurate. Therefore, the draft model used a structure essentially the same as the target model, utilizing the feature vectors output by the target model as input to the draft model. That is, EAGLE innovatively chose to perform Autoregressive Decoding on $f$ , moving Speculative Decoding forward to the feature layer (i.e., the penultimate layer). The corresponding diagram shows the use of $(f_1,f_2)$ to predict $f_3$ . At the same time, the token sequence $(t_2,t_3)$ takes another step forward, utilizing $p_4 = LM\ Head(f_3)$ to get $t_4$ .

1.2 Approach

The authors of EAGLE put forward two core points:

Performing autoregressive prediction at the feature level and then obtaining the token through the LM Head is simpler and more effective than directly predicting the token. The feature refers to the embedding of the output of the penultimate layer of the LLM, which is the hidden state before entering the LM Head. The hidden state is more regular than the token layer and possesses more hidden knowledge than the final result. Methods that only sample tokens obviously ignore this hidden knowledge.
Uncertainty in the sampling process limits the performance of feature prediction. Because LLM samples the probability distribution of tokens, the output of LLM is random. This randomness makes the prediction of feature sequences uncertain. For example, given the same input “I”, the output may be “always” or “am” sampled according to probability. Different choices at this step will create two completely different meanings and two completely different logics, which leads to uncertainty in feature prediction.

3302

Therefore, the core idea of EAGLE is as follows:

Autoregression is performed at the feature layer. A lightweight autoregressive model is used to predict the feature sequence of the target LLM, instead of directly predicting the token.
Preserving the feature layer can better overcome uncertainties in the sampling process. By introducing the token sequence from the previous time step to address uncertainties in feature prediction, the model can accurately predict the features of the penultimate layer with minimal additional computational cost. That is, when predicting the current feature, not only the previous feature sequence but also the previously sampled token sequence is considered. As shown in the figure above, after outputting I, either am or always is sampled according to probability. When further searching for subsequent outputs of always, if the feature layer output of I can be preserved, the information about am lost during the sampling process can be retained.

1.3 Architecture

Eagle requires training a small draft model, a self-designed model, primarily consisting of an embedding layer, a language model head (LM Head), and an autoregression head composed of fully connected layers and a decoder layer. The authors concatenate the embedding and features $f$ together as input to the draft model. The fully connected layer reduces the dimensionality of the concatenated vector to the feature dimension, and the decoder layer predicts the next feature. This preserves other information lost in the final output token. To validate multiple sequences simultaneously, the paper employs Tree Attention to generate a tree-structured draft, allowing multiple tokens to be generated in a single forward propagation. The only part of the draft model that needs training is the autoregression head; the embedding layer and language model head use the parameters of the target LLM and do not require additional training.

The technical details are roughly shown in the diagram below. EAGLE uses a single transformer layer plus a frozen LLM head (the output head of the large model). Green blocks represent token embeddings, orange blocks represent features, red boxes represent the predictions of the draft model, and blue modules with snowflake-shaped icons represent the parameters used in the target LLM, which are not trained. The three forward propagations in the diagram below use the same model and can share a single KV cache.

Because the first forward propagation cannot be accelerated, a second forward propagation is needed to obtain the features required for subsequent EAGLE operations. This explains why EAGLE needs to start from $t_2$ in the comparison diagram above.

3303

The following figure shows the prediction results for each step.

3304

1.4 Process

EAGLE is also the basic paradigm for draft-and-verify, and its basic process is as follows.

The drafting phase takes the previous feature sequence as input and the token sequence from one time step ahead as input, and outputs a draft tree composed of multiple tokens. The processing in this phase is as follows:

Convert the token sequence into an embedding sequence.
The embedding sequence and the feature sequence are concatenated together.
Use an autoregression head to predict the next feature.
The LM Head is used to convert the predicted features into a probability distribution of tokens, and the next token is sampled from it.
The predicted features and sampled tokens are added to the input sequence to continue autoregressive prediction.

The Verification Phase takes a draft tree as input and outputs a sequence of accepted tokens. EAGLE employs the same strategy as speculative sampling in the Verification Phase. Tokens generated in the draft phase need to pass the target LLM’s verification; only those with the acceptance probability are accepted, otherwise they are rejected and resampled. This mechanism ensures that the final token distribution is consistent with the target LLM. The processing in this phase is as follows:

Forward propagation: Perform a forward propagation on the draft tree using the target LLM to obtain the probability distribution of each token.
Validation: Starting from the root node, tokens in the draft tree are recursively validated level by level. For each token, its acceptance probability is calculated. The acceptance probability depends on the probability prediction of the token by the draft model and the probability prediction of the token by the target LLM. The acceptance probability is typically min(1, p_target(t) / p_draft(t)). This formula means that when the predicted probability of the target model is greater than that of the draft model, the token is accepted; when the predicted probability of the target model is less than that of the draft model, the token is accepted with a certain probability, and the acceptance probability is equal to the ratio of the two probabilities.
Acceptance: If a word is accepted with its acceptance probability, then that word will be added to the final output sequence.
Rejection/Resampling: If a lexical is rejected, it is discarded and a new lexical is resampled based on the probability distribution of the target LLM, p_target(t).
Merging: Finally, the accepted tokens will be merged into a single sequence, which will be the final output.

For a detailed comparison, please refer to the image below.

3305

Because the small model is completely identical to the large model except for the number of transformer layers, including the autoregressive generation process, EAGLE’s biggest advantage is its extreme ease of deployment.

1.5 Training

Two loss statements were used during training.

Classification Loss uses Cross Entropy Loss to measure the difference between the predicted token distribution (features output by the autoregressive head) and the true token distribution (the classification of the original model features after passing through the LM Head). This requires that the probability distributions of the two models be aligned.
Regression Loss uses Smooth L1 Loss to measure the difference between predicted and true features. This requires feature alignment between the two models. Since the smaller model is autoregressive during inference, constraining features makes the input of the smaller model more stable, thus avoiding out-of-distribution (OOD) during testing to some extent. This results in higher accuracy for the smaller model in long-range autoregressive generation.

3306

If only the output tokens are constrained to be consistent without constraining the features, the large model’s features are used during training, but the small model uses its own features from the previous time step during inference. These features deviate from the large model’s features, and during testing, situations may arise where “although the token is predicted correctly, the feature is different from the large model,” thus affecting long-range inference accuracy. See the diagram below for details: EAGLE uses the target model’s features for training, while the draft model uses its own features during inference. In the diagram, $f$ represents a feature, $e$ represents an embedding. Superscripts indicate the source of the variable, and $t$ and $d$ represent the target model and draft model, respectively. Subscripts are used to index the location of features or embeddings. For example, $f_2^t$ represents the features from the target model at position 2. Therefore, other researchers have subsequently improved upon this, such as HASS and CORAL. HASS uses multi-step training to expose the model to the data distribution during the inference phase and increases the amount of data.

3307

Furthermore, the EAGLE authors also mentioned in their paper that the MoE model and Speculative Decoding do not work well together. This is because in the Vanilla Inference stage, each token only requires the weights of two experts. However, the verification stage of Speculative Decoding requires verifying multiple tokens simultaneously, which may lead to activating more expert models and reading more expert weights. This weakens the advantages of MoE, resulting in a decrease in speedup. Moreover, the expert selection process in the MoE model introduces additional dependencies, potentially making parallel computation more difficult.

1.6 Upgrade

1.6.1 EAGLE-2

The authors also upgraded EAGLE, resulting in EAGLE-2. The paper is titled “EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees”. EAGLE-2 proposes dynamic draft tree speculative sampling: the structure of the draft tree is dynamically adjusted based on the confidence level of the draft model, which can improve the inference speed of large language models by up to 5 times, while not changing the output distribution of the large language model, ensuring lossless performance.

Ideas

Methods such as EAGLE and Medusa use static draft trees, implicitly assuming that the acceptance rate of draft tokens is context-independent.

When the preceding text is “10+2”, the next token is difficult to predict. Therefore, EAGLE adds two candidate tokens at this position to increase the draft hit rate; either “10+2=” or “10+2+” is correct.
When the preceding text is “10+2=”, the next token should obviously be “1”. However, EAGLE uses a static draft structure and still adds two candidate tokens, “1” and “3”. “10+2=3” cannot pass the check of the large language model, which is wasteful. EAGLE-2 aims to solve this problem. As shown on the right side of the diagram below, when the preceding text is “10+2=”, EAGLE-2 only adds one candidate token, “1”. The saved token is used to make the draft tree deeper, so that “10+2=12” can pass the check of the large language model, and thus more tokens can be generated at once.

3308

plan

To ensure lossless operation, a draft token is accepted only if all its ancestor nodes are accepted. Therefore, EAGLE-2 defines the value of a node as the product of its acceptance rate and that of its ancestors, approximated by the product of their confidence scores. EAGLE-2 consists of two phases: expansion and rearrangement.

The expansion phase deepens and enlarges the draft tree. During this phase, EAGLE-2 selects the m nodes (tokens) with the highest value at the last level of the draft tree for expansion. These tokens are fed into the draft model, and the output of the draft model is then connected as child nodes to the input nodes, thus deepening and enlarging the draft tree.
During the reordering phase, the draft tree is pruned, discarding some nodes (tokens). In this phase, EAGLE-2 reorders the entire draft tree according to value, retaining the top n nodes (tokens). The confidence level of draft tokens is between 0 and 1. When two nodes have the same value, shallower nodes are preferred. Therefore, the reordered draft tree is always connected, ensuring semantic coherence. Furthermore, the reordered draft tree is smaller, reducing the computational cost of validating the original large language model.

Below is a simple example. In the diagram, the yellow boxes in the Expand stage represent the nodes selected for expansion, and the green boxes represent the predictions of the draft model when these nodes are input. The blue boxes in the Rerank stage represent the nodes that are retained and are then flattened into one dimension as input to the original large language model. To ensure the correctness of the computation results, EAGLE-2 also adjusts the attention mask according to the tree structure, ensuring that each token can only see its ancestor nodes and is not affected by other branches. For example, “a” can only see its ancestors “It” and “is”, but not the “has” from another branch. EAGLE-2 also adjusts the positional encoding to ensure consistency with standard autoregressive decoding.

3309

The diagram below shows the draft process. When prompt P="It", the beamwidth = 2 and the search depth = 3. EAGLE-2 selects the top K = 8 probability markers (purple) as the draft tree.

3310

The following diagram illustrates the verification process.

3311

1.6.2 EAGLE-3

The authors of EAGLE later upgraded the scheme, resulting in EAGLE-3. While speculative sampling methods like EAGLE and Medusa reuse the last layer features of the target model as cues for the draft model, the authors of EAGLE-3 discovered a flaw in this approach. The last layer features of a large language model, after a linear transformation, can yield the distribution of the next token. Since the last layer features only contain information about the next token, they lose the global properties of the target model. Therefore, EAGLE-3 no longer uses the last layer features of the target model as auxiliary information; instead, it mixes low-, mid-, and high-level information from the target model as input to the draft model.

1.7 HASS

To address the inconsistency between the training and decoding phases mentioned above, the paper “LEARNING HARMONIZED REPRESENTATIONS FOR SPECULATIVE SAMPLING” proposes Coordinated Speculative Sampling (HASS), aiming to solve the problem by learning coordinated representations during the training phase. This method consists of two parts:

To enable the draft model to perceive the decoding target during the training phase, HASS extends the idea of ranking distillation in recommendation systems to speculative sampling, namely, coordinated target distillation.
To address the context inconsistency between training and decoding, HASS proposes a multi-step alignment training strategy, namely, coordinated context alignment.

Combining these two parts, HASS significantly improves the inference speed of LLM. It also maintains the high efficiency of training draft models without requiring additional inference overhead.

1.7.1 Motivation

The actual performance of speculative sampling depends on two factors: the decoding cost of the draft model and its alignment with the target LLM. To obtain an efficient draft model highly aligned with the target LLM, previous work has proposed utilizing the contextual information of the target LLM. For example, EAGLE uses the hidden states of the target LLM as input features for the draft model. However, these methods introduce inconsistent contexts during training and decoding, as shown in the figure below. During training, the draft model can always access the hidden states of the target LLM at previous time steps. But during decoding, the draft model cannot access the hidden states of the target LLM at unvalidated time steps, leading to inconsistencies in context between training and decoding. This problem can be viewed as exposure bias at the feature level in speculative sampling.

There is also an inconsistency in objectives between the training and decoding phases. In the decoding phase, the goal of the draft model is to generate tokens that the target LLM will assign high probabilities to. In this case, the draft model should focus more on recalling these high-probability tokens, while the specific order between them can be slightly relaxed. Furthermore, most LLMs employ kernel sampling or top-k sampling during application. In these decoding strategies, high-probability tokens play a more significant role in the output. Therefore, to obtain an efficient draft model, its training objective should take these characteristics of the decoding phase into account. However, existing speculative sampling methods involving the training of draft models generally neglect these decoding objectives.

3312

1.7.2 Scheme

Harmonized Objective Distillation

HASS introduces the concept of ranking distillation from recommender systems, prioritizing tokens that are more important during the decoding of the draft model. Specifically, the goal of ranking distillation is to train the student model to assign higher rankings to items that rank higher in the teacher model. In speculative sampling, the draft model is the student model, while the target LLM is the teacher model. Draft models with similar properties will achieve higher acceptance rates during the decoding phase. Let the set of the K most probable tokens be $\hat \Omega \subset \Omega$ , where $\Omega$ represents the entire vocabulary. HASS uses the following Top-K distillation loss during training:

3313

q and p represent the conditional probability distributions of the target LLM and the draft model predicting the next word, respectively. When combined with EAGLE, the training phase can obtain the values from the hidden states of the target LLM, $\hat \Omega$ . This means that training using Top-K loss has the same training efficiency as EAGLE.

Harmonized Context Alignment

HASS employs a multi-step alignment training strategy to ensure that the draft model maintains contextual consistency between the training and decoding phases. Specifically, HASS divides the training process into n steps, enabling the draft model to utilize contextual features consistent with those in the decoding phase. The process is as follows:

The first step is the same as training with EAGLE. At time step t+1, the draft model is trained with features of the target LLM, $f_t^{(l)}$ , as input and to generate draft model features $f_{t+1}^{s_1}$ . In this step, the attention mask is the same as the causal mask and is not modified.
The second step utilizes features from the first step. In the self-attention mechanism at time step t+1, it uses $f_t^{(s_1)}$ to generate a query. The key and value are determined by $f_{:t}^{(l)} \oplus f_t^{(s_1)}$ , where $\oplus$ indicates a splicing operation. $f_{:t}^{(l)}$ represents features earlier than time step t. The attention mask is modified to ensure $f_t^{(s_1)}$ always sees $f_{i-1}^{(l)}$ as the first feature, as shown in the “HASS Training Step 2” diagram below.
For step j (j ≥ 3), the features generated in the previous step, $f_t^{(s_{j-1})}$ , are used to generate a query for time step t+1, where the key and value are generated by $f_{:t-j+2}^{(l)} \oplus f_{t-j+2}^{(s_1)} \oplus \cdots \oplus f_t^{(s_{j-1})}$ .

The training cost of HASS is n times that of EAGLE, but the decoding cost remains the same.

3314

The training objective function is shown in the figure below.

3315

0x02 Multi-token Prediction

The core idea of the paper “Better & Faster Large Language Models via Multi-token Prediction” is to enable the model to predict multiple future tokens at once during training, rather than just predicting the next token.

2.1 Research Background

The paper argues that traditional token-by-token generation schemes have several problems during the training phase:

The goal of “predicting the next token” is to learn the probability of occurrence of a single token. This is a local awareness training method, which is difficult to learn long-distance dependencies and global semantics.
During training, the loss is calculated and the model is updated only using the prediction result of the next token each time, which is inefficient.
To overcome the locality problem, the model requires a large amount of training data, which leads to low training efficiency.

Therefore, this paper aims to provide the model with richer supervision signals by predicting multiple tokens during the training phase. This forces the model to learn longer token dependencies, thereby better understanding the context and avoiding getting stuck in a local decision-making learning pattern. Simultaneously, predicting multiple tokens at once is equivalent to generating multiple <predict, label> samples in a single estimation, allowing for the collection of multiple losses to update the model. This helps accelerate model convergence and significantly improves sample utilization efficiency. Multiple tokens can also be estimated in parallel during the inference phase, improving inference speed.

As shown in the figure below, the model in the paper is basically the same as that in Blockwise Parallel Decoding, with multiple Heads sharing a Backbone. Head 1 predicts the next token, Head 2 predicts the next after that, Head 3 predicts the third token after that, and Head 4 predicts the fourth token after that. For example, with inputs 1, 2, 3, 4, Head 1’s outputs are 2, 3, 4, 5, and Head 4’s outputs are 5, 6, 7, 8. Compared to BPD, this paper, in addition to making the model structure more specific as transformer blocks, also retains the dependencies between tokens during training.

One point to note here: when inputting [1], the next four words are predicted simultaneously, which are the pink [2,3,4,5] in the diagram. When inputting [1,2], the next four words are predicted simultaneously, which are the pink [3,4,5,6] in the diagram. That is, when predicting [3,4,5,6], the input is [1,2], not just [2]. The diagram simplifies this step. Moreover, the diagram actually groups these data ([1], [1,2], [1,2,3], [1,2,3,4]) into a batch and computes them in parallel.

3316

2.2 Approach

When humans understand language, they typically consider the relationships between multiple words rather than focusing on individual words. This inspired the authors to explore multi-token prediction methods. They extended next-token prediction into a multi-lexer prediction mechanism. Given the same input sequence, the model generates a sequence from the previous word to the next word through a single forward propagation, from $x_{t+1}$ to $x_{t+n}$ . The goal is to generate n tokens. Note that this does not mean selecting n tokens simultaneously from a single probability distribution of the Softmax output. Softmax is designed for categorical distributions, modeling the probability of a single discrete event among multiple mutually exclusive options, and does not support selecting multiple tokens from a single probability distribution. Therefore, Softmax can only generate a single token at each time step. To predict multiple tokens, multiple Softmax layers are needed, each specifically responsible for generating independent tokens.

The loss function for the above multi-token prediction should first be decomposed into multiple token prediction heads, and then each token prediction head will run an independent Softmax to select the corresponding token.

3317

The paper then introduces the concept of intermediate latent representation, $z_{t:1}$ , to represent hidden representations in large language models. This method decouples the input sequence $x_{t:1}$ from the output sequence, allowing the model to encode $x_{t:1}$ as $z_{t:1}$ in a single forward propagation. This representation is reused in all subsequent generation processes.

3318

Subsequently, $x_{t+n:t+1}$ and $z_{t:1}$ are further decomposed into n independent single-step conditional probabilities (as shown in blue), each conditional probability representing a word metagenerating step.

3319

2.3 Principle

The paper analyzes why MTP is effective from two perspectives.

Lookahead reinforces choice points. In text, some tokens are not crucial, allowing for some variations without affecting the meaning of the rest of the text. However, some tokens possess high-level semantic properties, determining the correctness of the text (answer) (the paper refers to these tokens as choice points). In effect, MTP implicitly assigns higher weights to these key tokens that influence subsequent decisions. This mechanism enables the model to better capture these key tokens during text generation, resulting in more coherent and meaningful text.
Information-theoretic argument. MTP increases the model’s focus on the relative mutual information between consecutive tokens. This change makes the model more inclined to focus on long-sequence dependencies, thereby improving its ability to model complex contexts, learning the global structure of language more quickly, and improving sample efficiency.

Regarding the first point, the paper provides an example diagram as follows. Suppose we need to predict three tokens at once, with some “discontinuity” between them, which introduces differences in prediction difficulty. For example, predicting 2345 and BCDE is relatively easy, but the transition from 5 to A is a turning point and is more difficult to predict. These important decision points should implicitly have more weight in the loss function. In the diagram below, the loss term involves three key points: 3→A, 4→B, and 5→C. MTP assigns more weight to decision points than single-step prediction on average, by $\frac{n(n+1)}{2}$ .

3320

Regarding the second point, the paper argues that during the training phase of teacher-forcing, the model can see the ground truth of the next token, but not the real token during the prediction phase. This leads to teacher-forcing performing well in short predictions but ignoring the generated sequence structure of long dependencies, potentially causing error accumulation. Given input $C$ , let $X$ represent the next token and $Y$ represent the second token in the future. Then, single-step prediction considers $H(X)$ , and multi-step prediction with $n=2$ considers $H(X) + H(Y)$ , as detailed below:

3321

Because the token will reappear when predicting the next position, we remove $H(Y|X)$ . We can then observe that 2-token prediction increases the importance of mutual information $I(X;Y)$ , with a weight of 2. Therefore, MTP has a greater advantage when the predicted $X$ is more relevant to the subsequent text content.

2.4 Scheme

The image below shows the network structure of the model. During training, MTP uses the same model body and four independent output heads, allowing the model to predict four tokens simultaneously. Specific details are as follows:

The backbone network is a pre-trained decoder-only multi-layer Transformer network used to extract feature representations of the input text. There are $t$ input tokens, $x_{t:1} = x_t, ..., x_1$ . After backbone network calculations, the final result will be $x_{t:1}$ encoded as hidden layer representation $z_{t:1}$ and output it.
The output $z_{t:1}$ is connected to multiple independent output heads, which work in parallel. Each head is responsible for estimating one token. $Head_i$ maps the hidden representation in the middle, $z_{t:1}$ , to $x_{t+i}$ . Head1 is responsible for predicting the next token, Head2 is responsible for predicting the next-next token, and so on. In the diagram below, during training, a shared transformer main network is connected to four parallel prediction heads, each predicting the next token based on the input token $t_i$ , predicting the subsequent $t_{i+1}$ , $t_{i+2}$ , $t_{i+3}$ , and $t_{i+4}$ . Specifically, the input consists of four tokens: 1, 2, 3, and 4. The first head (1) predicts the next four tokens from position 1 (corresponding to token 1): 2, 3, 4, and 5. The second head predicts the next four tokens from position 2 (corresponding to token 2): 3, 4, 5, and 6. During inference, you can either keep only the first head, which can predict tokens 2, 3, 4, and 5 from token 1, or add the other three heads to speed up the inference process.
The Head is a Transformer layer, and each Head’s Transformer layer is independent and not shared. The result after processing by this layer is denoted as $f_{h_i}(z_{t:1})$ .
Finally, $f_{h_i}(z_{t:1})$ is fed into a word projection layer (comprising one embedding projection matrix and one softmax layer) to predict the probability distribution of each word. Finally, a token is generated using a sampling method (e.g., greedy, beam search). Note that this word projection layer uses the projection matrix of the original pre-trained network (original model), which is shared across multiple heads.

3322

Let’s take a look at the design considerations behind shared components and independent components:

Shared $f_s$ . This method requires only a single forward propagation to obtain the result, $z_{t:1}$ . This generates n tokens, which is more computationally efficient than traditional next-token prediction.
Shared solution embedding matrix $f_u$ . The embedding matrix is very large, with a dimension of d×V (d is the dimension of the hidden layer, and V is the size of the vocabulary, usually 50,000 to 200,000). Sharing parameters can greatly reduce the number of parameters and have a limited impact on performance.
Independent Output Head. This is the only independent part of the architecture. Each lexical requires an independent Softmax, thus preventing the sharing of all components. Using independent output heads decouples the generation processes of the n lexical units. On one hand, this design supports parallel lexical generation, improving training efficiency; however, on the other hand, independent lexical generation may lead to a lack of coherence or consistency in the output. Furthermore, the model may experience mode collapse, tending to generate generic, high-frequency words rather than nuanced responses, thereby reducing the diversity and richness of the output.

2.5 Training

The training phase does not incur additional overhead. Only an auxiliary loss function is added. Furthermore, multiple heads compute the loss in parallel, which improves sample utilization efficiency and accelerates model convergence. The derivation of the loss function is as follows.

3323

2.6 Discussion

The authors conducted extensive experiments to demonstrate the benefits of using multiple tokens for simultaneous prediction during training, with the following specific conclusions:

Training methods that predict multiple tokens simultaneously are only effective for language models with a large number of parameters.
Using multiple tokens to predict simultaneously during inference can speed things up.
Training methods for multi-token prediction can improve the understanding of global context.
The specific design for adding a parallel token to generate the head depends on the dataset.
Predicting multiple tokens simultaneously is useful for summarizing tasks, but performs poorly for multiple-choice and math problems.

0x03 DeepSeek MTP

DeepSeek’s previous MTP implementation had a problem: the n tokens were generated independently, which could cause the model to overemphasize local patterns and ignore long-range dependencies, ultimately leading to inconsistent output or even mode collapse. To address this issue, DeepSeek achieves multi-token prediction by maintaining the complete causal chain of each token’s prediction. This approach improves prediction efficiency and also allows the model to have better contextual understanding, paying attention to more tokens.

3.1 Architecture

The following diagram shows the network structure of DeepSeek MTP. Some specific structures and labels in the diagram are as follows:

The arrow represents the causal chain, that is, the causal flow chain of the token.
$t_i$ represents the i-th token.
The main model predicts the next token, and MTP Module 1 predicts the token after that, $Next^2\ Token\ Prediction$ . MTP Module 2 predicts the third token, $Next^3\ Token\ Prediction$ .
The embedding and header layers are shared.
The Main Model has a total of L Transformer blocks, while the MTP Module only has one Transformer block (which may lead to insufficient information extraction during prediction).
The blue text in the image means that the output comes from the last layer of the Transformer Block. This way, the prediction loss of subsequent words and tokens can be backpropagated to all transformer blocks, maximizing the coverage of all neurons in the main model.

3324

Note: DeepSeek MTP is primarily used during the training phase; remember this, otherwise it will cause difficulties in understanding the illustrations in the paper. The above figure illustrates parallel training.

3.2 Process

The network structure diagram above is rather difficult to understand, so let’s take a single token as an example to see the specific process.

When the Main Module inputs a token $t_1$ , the main model will predict $t_2$ . MTP Module 1 will rely on the latent vector $h_1^0$ (the first token in the output of the Main Model) and the input token $t_2$ to predict $t_3$ . MTP Module 2 will rely on the latent vector $h_1^1$ (the first token in the output of MTP Module 1) and the input token $t_3$ to predict $t_4$ . That is, for the entire model, the input token $t_1$ can be used to predict $t_2$ , $t_3$ , and $t_4$ .

3325

When the Main Module inputs tokens $t_1$ and $t_2$ , the main model will predict $t_3$ . MTP Module 1 will rely on the latent vector $h_2^0$ (the second token in the output of the Main Module) and the input token $t_3$ to predict $t_4$ . MTP Module 2 will rely on the latent vector $h_2^1$ (the second token in the output of MTP Module 1) and the input token $t_4$ to predict $t_5$ .

3326

When the Main Module inputs tokens $t_1$ , $t_2$ , and $t_3$ , the main model will predict $t_4$ . MTP Module 1 will rely on the latent vector $h_3^0$ (the 3rd token in the output of the Main Module) and the input token $t_4$ to predict $t_5$ . MTP Module 2 will rely on the latent vector $h_3^1$ (the third token in the output of MTP Module 1) and the input token $t_5$ to predict $t_6$ .

3327

Therefore, we know that for the i-th token input to the Main Module, $t_i$ , given the kth prediction depth, MTP Module k will be based on the hidden layer output of the $(k-1)$ th MTP Module, $h_i^{k-1}$ , and the token at position $(i+k)$ input to this module, $t_{i+k}$ , to make a prediction and predict the token at the $(k+1)$ th position, $Next^{k+1}$ . It can also be seen that DeepSeek’s implementation adds causal chain connections compared to previous methods, and adds residual links at the embedding layer.

Next, let’s look at the case where the entire sample sequence is input simultaneously. In the figure below, each input token in the main module is marked with a different color, and its target token, or prediction token, as well as the auxiliary input tokens of the two MTP Modules, are also marked with the same color.

3328

As shown in the diagram above, the input tokens of the i-th MTP Module are the $(i+1)$ -nth tokens, where n is the total length generated so far. It requires not only the token embeddings but also the hidden states of the tokens calculated in the previous model. For example, the input to MTP Module 1 is the embeddings of tokens 2 to 5 and the hidden states of tokens 2 to 5 output from the last layer of the main model. This means that MTP has several advantages:

Dense supervision signals. MTP extends the supervision signals for each token from single-step prediction to multi-step prediction, enabling each token to participate in k+1 predictions (1 time for the main model and k times for the MTP module), thus improving data utilization by a factor of k.
Explicit long-range dependency learning. Through sequential modules (such as MTP module 1 predicting the next 1 step, module 2 predicting the next 2 steps), the model is forced to pre-plan contextual relationships of different spans (such as the association between the current token and tokens in the next 10 steps), thereby strengthening the ability to model long-range dependencies.
The causal chain expands layer by layer. In the decoder-only architecture, each MTP module ensures that the prediction depends only on the preceding information through Masked Self-Attention, forming a causal chain expansion from the short term to the long term.
In reality, it’s generated serially. When completing the prefill for DeepSeek-V3, the hidden states of the last layer must be output before the first MTP can be prefilled; the first MTP outputs the hidden states of the last layer before the second MTP can be prefilled, and so on. Therefore, multiple prefill calculations for multiple MTPs are serial. This means that each additional MTP module requires an extra round of serial prefilling and an extra key-value cache for each inference iteration. Inference with a main model and N smaller models can severely impact computational scheduling efficiency, which may explain why DeepSeek-V3 only outputs one MTP module.

Let’s take a look at the comparison between DeepSeek MTP and EAGLE.

EAGLE focuses more on accelerating the inference stage, while Deep Seek MTP aims to improve training quality and thus enhance the capabilities of LLM, especially long-range prediction and speedup. Therefore, Deep Seek MTP doesn’t employ many special techniques during training; instead, smaller models are trained on the same massive amounts of data as larger models, an advantage that post-training strategies like EAGLE cannot match.
EAGLE uses the same model for all three forward propagations and can share a single key-value cache. In the diagram above, MTP module 1 and MTP module 2 are not the same model and cannot share a key-value cache. That is, EAGLE uses only one draft model for autoregressive inference during multi-step inference; DeepSeek MTP, on the other hand, uses multiple draft models for sequential inference during multi-step inference, with each MTP layer being a draft model. Introducing dependencies between MTP modules disrupts parallelism but also makes text generation more coherent and better suited for scenarios such as dialogue and reasoning.

3.3 Formula

We will then analyze a single MTP Module using the formulas in the paper. The Transformer input for the i-th token is as follows:

3329

in:

T is the total length of the sequence.
k is the index of the MTP module.
is a combination of two parts:
- $h_i^{k-1}$ is the representation from the previous step, specifically the representation output by the nth token at the nth prediction depth. RMSNorm(Emb(h_i^{k-1})) has the function of representing the context of the previous token.
- $Emb(t_{i+k})$ is the embedding layer output of the $(i+k)$ th token. RMSNorm(Emb(t_{i+k})) has the function of using the ground truth of the original input, that is, $t_{i+k}$ . This is similar to the Teacher Forcing pattern, which provides better prediction results for the next token.

By concatenating the previous layer’s representation with the current token embedding, and gradually integrating short-term semantics and long-term dependencies, richer intermediate representations can be generated. Furthermore, the MTP Module’s Transformer Block has only one layer and relatively few parameters. It is precisely because of the aid of richer intermediate representations that accurate prediction of some tokens can be achieved with only a few parameters.

The RMSNorm operator normalizes two representation vectors, making their values more comparable. A 2D representation is then generated using the concatenation operator [·;·].
Finally, through the linear projection matrix $M_k$ , the dimensions are mapped from 2d back to d for use by the Transformer block.

For the nth token $t_i$ , given the nth predicted depth, the token’s path through MTP Module k is as follows:

The input token is first fed into a shared embedding layer. After a series of processing steps (Main model and the previous MTP Module), the hidden output of layer k−1 is obtained, $h_i^{k-1}$ . This is the representation of the nth token at the n−1 prediction depth, corresponding to number 1 on the graph.
The hidden layer output of layer k−1, $h_i^{k-1}$ , performs normalization processing, corresponding to label 2.1 on the figure.
The input auxiliary token of MTP Module k, namely the token at position i+k, is embedded to obtain the embedding (corresponding to number 2.2 in the figure). Then, the embedding is normalized (corresponding to number 2.3 in the figure).
Because the transformer block only accepts a vector of one token, the two normalized results need to be concatenated (corresponding to the ; in diagram 2.4) and then input into the linear layer $M_k$ . After performing a linear transformation (mapping the concatenated 2D vectors to a d-dimensional vector, which is the input dimension of the TRM layer), we obtain $h_i^{'k}$ .
Then is fed into the Transformer layer to obtain the output at the nth predicted depth: (corresponding to number 3 in the figure). Regarding the function of subscript slices, there are two possibilities:
- Since T is the total sequence length, the input token position i at the k prediction depth should satisfy i+k≤T. Therefore, the range of i that the k prediction head can accept is i≤T−k, that is, i∈[1,T−k], which is the slice range.
- Subscript 1:T-k indicates that it contains representations from token 1 to token T-k.
Extract the i-th element from it, $h_i^k$ . The probability of the vocabulary in dimension V is calculated by transforming it through a mapping matrix OutHead shared by all modules, and then processing it through softmax(.). The label of $h_i^k$ is the token at position i+1+k.

3330

3.4 Implementation

We will use the vLLM code for our learning.

3.4.1 MTP Module

The class DeepSeekMultiTokenPredictorLayer is an implementation of the MTP Module. Its formula and architecture correspond to the diagram below.

3331

The code for DeepSeekMultiTokenPredictorLayer is as follows.

class DeepSeekMultiTokenPredictorLayer(nn.Module):

    def __init__(
        self,
        config: PretrainedConfig,
        prefix: str,
        model_config: ModelConfig,
        cache_config: Optional[CacheConfig] = None,
        quant_config: Optional[QuantizationConfig] = None,
    ) -> None:
        super().__init__()
        self.embed_tokens = VocabParallelEmbedding( # 图上标号2.2
            config.vocab_size,
            config.hidden_size,
        )

        self.enorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) # 图上标号2.1
        self.hnorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) # 图上标号2.3
        self.eh_proj = nn.Linear(config.hidden_size * 2, # 图上标号2.5
                                 config.hidden_size,
                                 bias=False)
        self.shared_head = SharedHead(config=config, quant_config=quant_config) # 图上标号4
        self.mtp_block = DeepseekV2DecoderLayer(config, prefix, model_config, # 图上标号30
                                                cache_config, quant_config)

    def forward(
        self,
        input_ids: torch.Tensor,
        positions: torch.Tensor,
        previous_hidden_states: torch.Tensor,
        inputs_embeds: Optional[torch.Tensor] = None,
        spec_step_index: int = 0,
    ) -> torch.Tensor:
        if inputs_embeds is None:
            inputs_embeds = self.embed_tokens(input_ids)
        assert inputs_embeds is not None
        # masking inputs at position 0, as not needed by MTP
        inputs_embeds[positions == 0] = 0
        inputs_embeds = self.enorm(inputs_embeds)
        previous_hidden_states = self.hnorm(previous_hidden_states)

        hidden_states = self.eh_proj(
            torch.cat([inputs_embeds, previous_hidden_states], dim=-1))

        hidden_states, residual = self.mtp_block(positions=positions,
                                                 hidden_states=hidden_states,
                                                 residual=None)
        hidden_states = residual + hidden_states
        return hidden_states

3.4.2 Output Head

The SharedHead class is an implementation of Output Head.

class SharedHead(nn.Module):

    def __init__(
        self,
        config: PretrainedConfig,
        quant_config: Optional[QuantizationConfig] = None,
    ) -> None:
        super().__init__()
        self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
        self.head = ParallelLMHead(config.vocab_size,
                                   config.hidden_size,
                                   quant_config=quant_config)

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        return self.norm(hidden_states)

The specific usage is in the compute_logits() function of the DeepSeekMultiTokenPredictor class.

logits = self.logits_processor(mtp_layer.shared_head.head,
                               mtp_layer.shared_head(hidden_states),
                               sampling_metadata)

The code for ParallelLMHead is as follows.

class ParallelLMHead(VocabParallelEmbedding):
    """Parallelized LM head.

    Output logits weight matrices used in the Sampler. The weight and bias
    tensors are padded to make sure they are divisible by the number of
    model parallel GPUs.

    Args:
        num_embeddings: vocabulary size.
        embedding_dim: size of hidden state.
        bias: whether to use bias.
        params_dtype: type of the parameters.
        org_num_embeddings: original vocabulary size (without LoRA).
        padding_size: padding size for the vocabulary.
    """

    def __init__(self,
                 num_embeddings: int,
                 embedding_dim: int,
                 bias: bool = False,
                 params_dtype: Optional[torch.dtype] = None,
                 org_num_embeddings: Optional[int] = None,
                 padding_size: int = DEFAULT_VOCAB_PADDING_SIZE,
                 quant_config: Optional[QuantizationConfig] = None,
                 prefix: str = ""):
        super().__init__(num_embeddings, embedding_dim, params_dtype,
                         org_num_embeddings, padding_size, quant_config,
                         prefix)
        self.quant_config = quant_config
        if bias:
            self.bias = Parameter(
                torch.empty(self.num_embeddings_per_partition,
                            dtype=params_dtype))
            set_weight_attrs(self.bias, {
                "output_dim": 0,
                "weight_loader": self.weight_loader,
            })
        else:
            self.register_parameter("bias", None)

    def tie_weights(self, embed_tokens: VocabParallelEmbedding):
        """Tie the weights with word embeddings."""
        # GGUF quantized embed_tokens.
        if self.quant_config and self.quant_config.get_name() == "gguf":
            return embed_tokens
        else:
            self.weight = embed_tokens.weight
            return self

    def forward(self, input_):
        del input_
        raise RuntimeError("LMHead's weights should be used in the sampler.")

3.4.3 Transformer Block

DeepseekV2DecoderLayer is a Transformer Block, corresponding to number 3 in the diagram above.

class DeepseekV2DecoderLayer(nn.Module):

    def __init__(
        self,
        config: PretrainedConfig,
        prefix: str,
        model_config: ModelConfig,
        cache_config: Optional[CacheConfig] = None,
        quant_config: Optional[QuantizationConfig] = None,
    ) -> None:
        super().__init__()
        self.hidden_size = config.hidden_size
        rope_theta = getattr(config, "rope_theta", 10000)
        rope_scaling = getattr(config, "rope_scaling", None)
        max_position_embeddings = getattr(config, "max_position_embeddings",
                                          8192)
        # DecoderLayers are created with `make_layers` which passes the prefix
        # with the layer's index.
        layer_idx = int(prefix.split(sep='.')[-1])
        self.layer_idx = layer_idx
        if model_config.use_mla:
            attn_cls = DeepseekV2MLAAttention
        else:
            attn_cls = DeepseekV2Attention
        self.self_attn = attn_cls(
            config=config,
            hidden_size=self.hidden_size,
            num_heads=config.num_attention_heads,
            qk_nope_head_dim=config.qk_nope_head_dim,
            qk_rope_head_dim=config.qk_rope_head_dim,
            v_head_dim=config.v_head_dim,
            q_lora_rank=config.q_lora_rank
            if hasattr(config, "q_lora_rank") else None,
            kv_lora_rank=config.kv_lora_rank,
            rope_theta=rope_theta,
            rope_scaling=rope_scaling,
            max_position_embeddings=max_position_embeddings,
            cache_config=cache_config,
            quant_config=quant_config,
            prefix=f"{prefix}.self_attn",
        )

        if (config.n_routed_experts is not None
                and layer_idx >= config.first_k_dense_replace
                and layer_idx % config.moe_layer_freq == 0):
            self.mlp = DeepseekV2MoE(
                config=config,
                quant_config=quant_config,
                prefix=f"{prefix}.mlp",
            )
        else:
            self.mlp = DeepseekV2MLP(
                hidden_size=config.hidden_size,
                intermediate_size=config.intermediate_size,
                hidden_act=config.hidden_act,
                quant_config=quant_config,
                prefix=f"{prefix}.mlp",
            )
        self.input_layernorm = RMSNorm(config.hidden_size,
                                       eps=config.rms_norm_eps)
        self.post_attention_layernorm = RMSNorm(config.hidden_size,
                                                eps=config.rms_norm_eps)
        self.routed_scaling_factor = config.routed_scaling_factor

    def forward(
        self,
        positions: torch.Tensor,
        hidden_states: torch.Tensor,
        residual: Optional[torch.Tensor],
    ) -> torch.Tensor:
        # Self Attention
        if residual is None:
            residual = hidden_states
            hidden_states = self.input_layernorm(hidden_states)
        else:
            hidden_states, residual = self.input_layernorm(
                hidden_states, residual)
        hidden_states = self.self_attn(
            positions=positions,
            hidden_states=hidden_states,
        )

        if hidden_states.dtype == torch.float16:
            # Fix FP16 overflow
            # We scale both hidden_states and residual before
            # rmsnorm, and rmsnorm result would not affect by scale.
            hidden_states *= 1. / self.routed_scaling_factor
            if self.layer_idx == 0:
                # The residual is shared by all layers, we only scale it on
                # first layer.
                residual *= 1. / self.routed_scaling_factor

        # Fully Connected
        hidden_states, residual = self.post_attention_layernorm(
            hidden_states, residual)
        hidden_states = self.mlp(hidden_states)

        if isinstance(self.mlp,
                      DeepseekV2MLP) and hidden_states.dtype == torch.float16:
            # Fix FP16 overflow
            # Scaling the DeepseekV2MLP output, it is the input of
            # input_layernorm of next decoder layer.
            # The scaling of DeepseekV2MOE output would be done in the forward
            # of DeepseekV2MOE
            hidden_states *= 1. / self.routed_scaling_factor

        return hidden_states, residual

3.4.4 MTP Function

3332

DeepSeekMultiTokenPredictor can be understood as a collection of several MTP Modules.

class DeepSeekMultiTokenPredictor(nn.Module):

    def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
        super().__init__()
        config = vllm_config.model_config.hf_config
        self.mtp_start_layer_idx = config.num_hidden_layers
        self.num_mtp_layers = config.num_nextn_predict_layers
        # to map the exact layer index from weights
        self.layers = torch.nn.ModuleDict({
            str(idx):
            DeepSeekMultiTokenPredictorLayer(
                config,
                f"{prefix}.layers.{idx}",
                model_config=vllm_config.model_config,
                cache_config=vllm_config.cache_config,
                quant_config=vllm_config.quant_config,
            )
            for idx in range(self.mtp_start_layer_idx,
                             self.mtp_start_layer_idx + self.num_mtp_layers)
        })

        self.logits_processor = LogitsProcessor(config.vocab_size)

    def forward(
        self,
        input_ids: torch.Tensor,
        positions: torch.Tensor,
        previous_hidden_states: torch.Tensor,
        inputs_embeds: Optional[torch.Tensor] = None,
        spec_step_idx: int = 0,
    ) -> torch.Tensor:
        current_step_idx = (spec_step_idx % self.num_mtp_layers)
        return self.layers[str(self.mtp_start_layer_idx + current_step_idx)](
            input_ids,
            positions,
            previous_hidden_states,
            inputs_embeds,
            current_step_idx,
        )

    def compute_logits(
        self,
        hidden_states: torch.Tensor,
        sampling_metadata: SamplingMetadata,
        spec_step_idx: int = 0,
    ) -> torch.Tensor:
        current_step_idx = (spec_step_idx % self.num_mtp_layers)
        mtp_layer = self.layers[str(self.mtp_start_layer_idx +
                                    current_step_idx)]
        logits = self.logits_processor(mtp_layer.shared_head.head,
                                       mtp_layer.shared_head(hidden_states),
                                       sampling_metadata)
        return logits

DeepSeekMTP encapsulates DeepSeekMultiTokenPredictor, presenting MTP functionality to the outside world.

class DeepSeekMTP(nn.Module):

    def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
        super().__init__()
        self.config = vllm_config.model_config.hf_config
        self.model = DeepSeekMultiTokenPredictor(vllm_config=vllm_config,
                                                 prefix=maybe_prefix(
                                                     prefix, "model"))

    def forward(
        self,
        input_ids: torch.Tensor,
        positions: torch.Tensor,
        previous_hidden_states: torch.Tensor,
        intermediate_tensors: Optional[IntermediateTensors] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        spec_step_idx: int = 0,
    ) -> torch.Tensor:
        hidden_states = self.model(input_ids, positions,
                                   previous_hidden_states, inputs_embeds,
                                   spec_step_idx)
        return hidden_states

    def compute_logits(
        self,
        hidden_states: torch.Tensor,
        sampling_metadata: SamplingMetadata,
        spec_step_idx: int = 0,
    ) -> Optional[torch.Tensor]:
        return self.model.compute_logits(hidden_states, sampling_metadata,
                                         spec_step_idx)

    def load_weights(self, weights: Iterable[Tuple[str,
                                                   torch.Tensor]]) -> Set[str]:
        stacked_params_mapping = [
            ("gate_up_proj", "gate_proj", 0),
            ("gate_up_proj", "up_proj", 1),
        ]

        expert_params_mapping = FusedMoE.make_expert_params_mapping(
            ckpt_gate_proj_name="gate_proj",
            ckpt_down_proj_name="down_proj",
            ckpt_up_proj_name="up_proj",
            num_experts=self.config.n_routed_experts)

        params_dict = dict(self.named_parameters())
        loaded_params: Set[str] = set()
        for name, loaded_weight in weights:
            if "rotary_emb.inv_freq" in name:
                continue
            spec_layer = get_spec_layer_idx_from_weight_name(self.config, name)
            if spec_layer is None:
                continue
            name = self._rewrite_spec_layer_name(spec_layer, name)
            for (param_name, weight_name, shard_id) in stacked_params_mapping:
                # Skip non-stacked layers and experts (experts handled below).
                if weight_name not in name:
                    continue
                # We have mlp.experts[0].gate_proj in the checkpoint.
                # Since we handle the experts below in expert_params_mapping,
                # we need to skip here BEFORE we update the name, otherwise
                # name will be updated to mlp.experts[0].gate_up_proj, which
                # will then be updated below in expert_params_mapping
                # for mlp.experts[0].gate_gate_up_proj, which breaks load.
                if (("mlp.experts." in name) and name not in params_dict):
                    continue
                name = name.replace(weight_name, param_name)
                # Skip loading extra bias for GPTQ models.
                if name.endswith(".bias") and name not in params_dict:
                    continue

                param = params_dict[name]
                weight_loader = param.weight_loader
                weight_loader(param, loaded_weight, shard_id)
                break
            else:
                for mapping in expert_params_mapping:
                    param_name, weight_name, expert_id, shard_id = mapping
                    if weight_name not in name:
                        continue
                    name = name.replace(weight_name, param_name)

                    param = params_dict[name]
                    weight_loader = param.weight_loader
                    weight_loader(param,
                                  loaded_weight,
                                  name,
                                  shard_id=shard_id,
                                  expert_id=expert_id)
                    break
                else:
                    # Skip loading extra bias for GPTQ models.
                    if name.endswith(".bias") and name not in params_dict:
                        continue

                    param = params_dict[name]
                    weight_loader = getattr(param, "weight_loader",
                                            default_weight_loader)
                    weight_loader(param, loaded_weight)
            loaded_params.add(name)
        return loaded_params

    def _rewrite_spec_layer_name(self, spec_layer: int, name: str) -> str:
        """
        Rewrite the weight name to match the format of the original model.
        Add .mtp_block for modules in transformer layer block for spec layer
        """
        spec_layer_weight_names = [
            "embed_tokens", "enorm", "hnorm", "eh_proj", "shared_head"
        ]
        spec_layer_weight = False
        for weight_name in spec_layer_weight_names:
            if weight_name in name:
                spec_layer_weight = True
                break
        if not spec_layer_weight:
            # treat rest weights as weights for transformer layer block
            name = name.replace(f"model.layers.{spec_layer}.",
                                f"model.layers.{spec_layer}.mtp_block.")
        return name

3.5 Training

In the DeepSeek model, multi-term prediction is primarily used during the training phase. Therefore, a cross-entropy loss function is applied to each MTP module to calculate the loss of each MTP module head. Then, the loss values of all MTP modules are integrated into an additional training objective. Specifically:

$t_i$ represents the ground-truth token at position i, while $p_i^k[t_i]$ is the k-th MTP module pair and the predicted probability for $t_i$ .
2+k:T+1 is the range of the label. 2+k is the starting index. See the following diagram as an example. Input $t_1$ . At that time, the Main Model predicts $t_2$ . MTP Model 1 predicts the next next token, that is, it predicts $t_3$ . And so on, the first token predicted by MTP Model k is $t_{2+k}$ . T+1 is the end index. By default, all sequence samples add an eos token to the original sequence, so the token index is the sequence length T+1.

3333

It’s also important to note that MTP still uses Teacher Forcing mode for training. Normally, it should use the output of the previous state (i.e., the output in the diagram), $t'_2$ , as the input to the MTP Module, but in sequence modeling training, directly using the ground truth from the samples as input yields better results. This is because using the predicted state, $t'_2$ , as input causes prediction errors to accumulate over time, leading to degraded performance. However, comparisons can still be made for learning, although these comparisons are only approximate or logically significant. If using standard Teacher forcing, predictions in MTP Module 1 for $t_6$ would mean that the MTP Module input effective for predicting $t_6$ should be $t_2$ , $t_3$ , $t_4$ , and $t_5$ . In MTP, it is actually just $t_5$ ; this token is effective.

3.6 Reasoning

MTP is designed primarily to accelerate convergence during training and make fuller use of training samples. Predicting 4 tokens at once is equivalent to doubling the amount of training data. It is best suited for the pre-training stage where the model urgently needs to converge. In the inference stage, DeepSeek V3 generally offers two methods:

Only the main model is retained for prediction. The entire MTP Model header is discarded, turning the model into a Predict Next Token Main Model. Then, the model is deployed for inference, just like a normal LLM model inference. There is no speedup effect.
MTP technology can be combined with speculative decoding to accelerate inference. By retaining the MTP model for self-speculative decoding, the multi-head prediction capability is fully utilized, further improving inference performance.

0xFF Reference

DeepSeek Technology Explained (2) - The Past and Present of MTP (Multi-Token Prediction) (by Jiang Fuchun )
Better & Faster Large Language Models via Multi-token Prediction
A collection of papers on Speculative Decoding; Gray Pupil Sextant
[Paper Interpretation] EAGLE: A speculative sampling framework for autoregression at the feature layer (tomsheep)
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
[Reading Notes] Multi-token prediction (A Lost Little Bookboy)
From EAGLE to MTP: A Visual Guide to DeepSeek-V3 Multi-Token Prediction Implementation and Thinking [Marginal Effect](javascript:void(0)😉 solrex)
DeepSeek V3
Blockwise Parallel Decoding for Deep Autoregressive Models
Better & Faster Large Language Models via Multi-token Prediction
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty‌
Deepseek v3 Technical Report - Step-by-Step Analysis of the Graph - 3 - MTP That’s Hard to Understand - Formulas Contain Spelling Errors - Lost Little Bookboy
[Deepseek v3 Technical Report Study] 3. Multi-Token Prediction Duludulu
Does DeepSeek-V2 MLA KV Cache really save money? (2) pika-jy
Does DeepSeek-V2 MLA KV Cache really save money? pika-jy
DeepSeek-V3 MTP Engineering Implementation Thoughts by Geek Boge
Speculative Reasoning Extra Chapter 1: Feature Layer Specculative Decoding CV Algorithm and MLSys
Speculative Decoding – What makes efficient speculative decoding? (Hexagonal Close-packed Orange)
LEARNING HARMONIZED REPRESENTATIONS FOR SPECULATIVE SAMPLING
CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Lossless acceleration up to 5x, EAGLE-2 allows the RTX 3060 to outpace the A100 in build speed ( Machine Heart Pro).
“DeepSeek-V3 Technical Analysis”: Multi-Token Prediction (MTP) Technology (Baihai IDP )
Xiaohongshu proposes HASS, a new state-of-the-art algorithm for accelerating large-scale model inference.
LEARNING HARMONIZED REPRESENTATIONS FOR SPECULATIVE SAMPLING
Why is DeepSeek-MTP such a well-designed, high-performance platform that doesn’t require all-nighters?
Better & Faster Large Language Models via Multi-token Prediction
Multi-Token Prediction in DeepSeek
Understanding the Shadow of the Multi-Token Prediction (MTP) Broom in DeepSeek
[Deepseek Technology Principles] Part 2: The Most Detailed Illustrated Explanation of the MTP Model Structure and Inspirational Insights - Luo Ji
[Paper Interpretation] MTP: Enabling LLM to Predict Multiple Tokens at Once (Tomsheep)
LLM Inference Acceleration (1): Multi-Token Prediction YueDa
[LLM Speculative Reasoning] Speculative Sampling Beyond Medusa – An Interpretation of the EAGLE 1/2 Paper by Ajay
DeepSeek-V3 MTP Engineering Implementation Thoughts by Geek Boge

| [Transformer Series] (33) DeepSeek MTP

Exploring the Transformer Series (33) --- DeepSeek MTP

Exploring the Transformer Series (33) --- DeepSeek MTP

0x00 Overview

0x01 EAGLE

1.1 Research Background

1.2 Approach

1.3 Architecture

1.4 Process

1.5 Training

1.6 Upgrade

1.6.1 EAGLE-2

1.6.2 EAGLE-3

1.7 HASS

1.7.1 Motivation

1.7.2 Scheme

0x02 Multi-token Prediction

2.1 Research Background

2.2 Approach

2.3 Principle

2.4 Scheme

2.5 Training

2.6 Discussion

0x03 DeepSeek MTP

3.1 Architecture

3.2 Process

3.3 Formula

3.4 Implementation

3.4.1 MTP Module

3.4.2 Output Head

3.4.3 Transformer Block

3.4.4 MTP Function

3.5 Training

3.6 Reasoning

0xFF Reference