Transformer Systems · Transformer Systems
Exploring the Transformer Series (28) --- DeepSeek MLA
DeepSeek MLA: low-rank KV compression, weight absorption, decoupled RoPE, resource tradeoffs, implementation details, and conversions from GQA and MHA.
Exploring the Transformer Series (28) --- DeepSeek MLA
0x00 Overview
The basic idea of MLA (Multi-head Latent Attention) is to compress the attention input h_t into a low-dimensional latent vector c_t^{KV} whose dimension is d_c, and d_c is much smaller than the original dimension. When attention needs to be calculated, this latent vector can be mapped back to a higher-dimensional space. Therefore, only the latent vector c_t^{KV} needs to be stored. This can significantly reduce memory usage.
This process can be described more formally using the following formula. c_t^{KV} represents the latent vector. W^{DKV} is a compression matrix, where the superscript D represents “down projection”, that is, a dimensionality reduction operation. It is responsible for compressing h_t from the original multi-head dimension to d_c. W^{UK} and W^{UV} are projection matrices responsible for mapping the shared latent vector c_t^{KV} back to a higher-dimensional space. Only this latent vector c_t^{KV} needs to be stored. This allows us to obtain the key and value corresponding to different text features without having to store the corresponding key and value for each text feature.
Similarly, we can map the query vector to a potential low-dimensional vector and then map it back to the original high-dimensional space. Moreover, MLA incorporates weight absorption techniques, reducing computational overhead.

Note:
- The complete list of articles is here. It’s estimated to eventually have around 35 articles. This list will be updated after each subsequent article is published. (Cnblogs Exploring Transformer Series: Article List)
- This series is a study and interpretation of papers, blogs, and code, drawing on many articles from online friends. I would like to express my gratitude to them, and their names will be listed in the references. Because there are so many references in this series, there may be omissions in the citations. If the original authors or other friends find any omissions, please point them out, and I will add them to the references.
0x01 Principle
1.1 Problem
A major obstacle for standard Transformers is the space consumption of the key-value (KV) cache: multi-head attention mechanisms require each attention head to store the historically generated key and value vectors separately, that is, the KV cache. As sequence length increases, the storage requirements of the KV cache grow exponentially, leading to a sharp increase in memory consumption. However, GPU memory is often very limited, and a large KV cache results in fewer requests being processed simultaneously, meaning a smaller batch size. To reduce KV cache requirements, researchers have proposed methods like Multi-Query Attention (MQA) and Group-Query Attention (GQA). While these methods reduce cache requirements, they also impact model performance. During attention computation, data from all KV caches is read and only used once or a few times, resulting in extremely low MFU (Mean Functionality Fusion). Furthermore, since each request has its own KV cache, this problem cannot be solved by increasing the batch size.
Therefore, a key issue is how to reduce the key-value cache during inference, thereby enabling inference of longer contexts on fewer devices, or increasing the batch size with the same context length to achieve faster inference speed or greater throughput, ultimately reducing inference costs.
1.2 Current Situation
Let’s first summarize the current status of various solutions to see if there’s room for improvement. The following diagram illustrates the approaches for MHA, GQA, MQA, and MLA.

The diagram shows MHA, GQA, MQA, and MLA from left to right. The shaded bars represent results cached in video memory. MHA, GQA, and MQA all require caching the KVCache in video memory. The characteristics of these schemes are as follows.
- MHA: Like the Q matrix, the MHA KVCache operates on a “one-to-one” model in terms of attention heads. MHA breaks down an attention calculation into multiple attention heads, each using independent Q, K, and V matrices. Both K and V need to be stored. The number of parameters cached for each token in the KV Cache is
2 n_h d_h l. GQA and MQA have smaller attention head dimensions than the Q matrix. - MQA: All query headers share the same single key and value headers, therefore only the shared K and V need to be stored. The number of parameters that need to be cached for each token in the KV Cache is
2 d_h l. When calculating attention, the shared single K-head and V-head are broadcast to each query head, and then each head is calculated individually. - GQA: Divide all Q headers into
ggroups. Q headers in the same group share a K header and a V header. Therefore, the number of parameters that each token needs to cache in the KV Cache is2 n_g d_h l. When calculating attention, the key-value head is copied to all Q heads in the same group for calculation.
n_h is the number of heads. n_g is the number of GQA groups. d_h is the hidden layer dimension. l is the number of model layers. h_t ∈ R^d represents the input of the nth token in an attention layer.
1.3 Improvement Ideas
MLA is an improvement on the MHA, GQA, and MQA schemes. Its idea is to enhance information compression capabilities, corresponding to number 1 in the diagram below, and enrich information expression capabilities, corresponding to number 2 in the diagram below. In fact, the two numbers also correspond to two key points in the data flow from input to Q, K, and V, which are two sides of the same coin: while enhancing the matrix’s expressive capabilities, it also makes the compression capabilities greater.

This leads to a common dilemma for researchers: they need to achieve both lower compression, reducing the KV cache resource overhead during inference, and stronger expressiveness, alleviating the performance loss caused by MQA and MGA. In other words, the new solution should have expressiveness as close as possible to MHA.
1.3.1 Enhance information compression capabilities
Ideas
From a certain perspective, MQA and GQA also belong to the concept of low-rank compression. MQA compresses 2 n_h to 2, and GQA compresses to 2 n_h / g. However, it’s difficult to balance compression capability and performance, so GQA performs better than MQA.
Therefore, we need to consider whether we can go a step further beyond “enhancing information compression capabilities while maintaining effectiveness.” Since MQA has already achieved near-perfect performance in its key-value headers, we can’t reduce the number of key-value headers. This necessitates considering the key-value pairs themselves. Currently, both GQA and MQA require caching both key and value, K and V, and they are different. So, is it possible to merge these two values into one? Is it possible to make each cached key-value pair smaller than before? Inspired by LoRA, an M × N matrix can be approximated by the multiplication of an M × k matrix and a k × N matrix. If I split a K or V matrix into the product of two smaller matrices, I can reduce the memory usage of the KV cache.
plan
The core of MLA is to perform low-rank joint compression of attention keys and values to reduce the size of the key-value (KV) cache during inference, thereby improving inference efficiency. Unlike GQA and MQA, which directly compress the KVCache head dimension, MLA uses a down-projection matrix W^{DKV}. The key and value of multiple attention heads are projected into a low-dimensional shared latent vector space, replacing the traditional head-by-head storage method.
Specifically, MLA transforms the KV matrix into a low-rank form: representing the original matrix as the product of two smaller matrices, equivalent to latent vectors.
- The HiddenState of the input matrix is first subjected to a low-rank transformation, compressing a HiddenState with shape
[S,H]into a latent vector with shape[S,CH],c_t^{KV}, whereCH ≪ H. H is the token dimension. - The compressed KV vector
c_t^{KV}is stored in video memory as a KV cache, which achieves the goal of reducing the KV size. In the V2 paper,K_tchanges fromW^K h_ttoW^{UK} W^{DKV} h_t. What was originally cached wasW^K h_t. What is being cached now is theW^{DKV} h_tpart ofK_t.
question
However, there is a problem: if a large K/V matrix is simply split into two smaller matrices for caching, the complete K matrix still needs to be calculated during inference, which defeats the purpose of caching, since the purpose of caching is to reduce computation.
The question then becomes: Is there a way to reduce cache size without increasing computation during inference?
1.3.2 Enriching Information Expression
Ideas
We can observe that in MQA and GQA attention calculations, only a simple broadcast or copy mechanism is used to copy the key-value header to the corresponding query header for computation. Taking GQA as an example, GQA aims to reduce KV cache usage; it stores key-value pairs, C^{KV}. The following formula shows how to obtain k, and the operation of v is omitted here.
- First, it divides the vector in half into two parts, which are designated as K and V respectively.
- Then each portion is further divided into
gportions. - Each copy is replicated
h/gtimes to aggregate enough K and V values forhAttention Heads.
Here, W^{UK} is a combination of simple linear changes such as simple replication, and its expressive power is limited, so its compression dimension is not large.
k = W^{UK} C^{KV} = W^{UK} [k^1, ..., k^g, v^1, ..., v^g]
= [k^1, ..., k^1, k^2, ..., k^2, ..., k^g, ..., k^g]
Since MQA and GQA have limited information representation capabilities, could we introduce a matrix transformation to replace these simple linear transformation operations, splitting and copying? For example, by adaptively learning for each q, we could enrich the information representation of this layer.
plan
We have obtained the latent vector c_t^{KV}. Then the upprojection matrix of each head can be used during inference: W^{UK} for “key” and W^{UV} for “value”, from this latent vector c_t^{KV} to reconstruct K and V.
Specifically, MLA will:
- Load compressed KVCache latent vectors
c_t^{KV}. - Then through the upper projection matrix
W^{UK}andW^{UV}, perform two ascending-rank transformations to convert them into K and V matrices with shapes[S,H], respectively. This recovers the Key and Value of each head from the latent vector by mapping this latent vector back to the high-dimensional space. The up projection matricesW^{UK}andW^{UV}perform two rank-up transformations with a much greater impact than a combination of simple linear changes in GQA. - Perform MHA calculation. In this way, MLA caches only the latent vectors during inference, instead of the complete key-value pairs.
MLA is essentially a lossy compression of key-value information, but it can be trained to learn how to improve the density of stored information while preserving key details as much as possible. This avoids the information loss associated with grouped query attention and multi-query attention, thus achieving better performance while reducing key-value caching. From the computational characteristics of the MLA operator, both of these issues are addressed simultaneously.
- On the one hand, low-rank compression significantly reduces the KV cache resource overhead during inference. Reducing the KV cache during inference enables inference of longer contexts on fewer devices, or increases the batch size with the same context length, achieving faster inference speed or greater throughput, ultimately reducing inference costs.
- On the other hand, the multi-head attention mechanism after MLA decompression can provide high computational intensity, proportional to the number of heads, which helps to fully utilize the computing resources of the GPU and alleviate the performance loss caused by MQA and MGA. MLA compresses KVCache through low-rank transformation, which, according to the formula, introduces additional rank-raising transformation calculations and requires storing the activation values of the rank-raising transformation calculations. However, based on the commutative property of matrix multiplication, the matrix multiplication weights of the rank-raising transformation can be fused with other weights, and then the attention calculation can be directly completed in the attention kernel without introducing additional computational and storage overhead.
1.3.2 Resolving Location Coding Conflicts
However, compression and RoPE positional encoding are conflicting, that is, after matrix absorption, c_t^{KV} has no location-related information. The reason is that RoPE is location-sensitive for both keys and queries. In this case, relying solely on c_t^{KV} to compress the KV-Cache is not a viable approach, so additional information is needed to represent the positional relationship between q and k. To overcome this challenge, DeepSeek proposes a compromise: using W^{QR} and W^{KR}. Two matrices are used to represent the features extracted related to ROPE, adding an extra dimension d_h^R to both q and k to add ROPE encoding. The previous d_h dimension does not use ROPE encoding, and the total length becomes d_h + d_r. In other words, MLA adopts the concept of MQA, constructing a cache variable shared by all heads, c_t^{KV} and k_i^R. This significantly reduces the KV cache size. c_t^{KV} is the low-dimensional vector in the low-rank parametric decomposition, after the Down process and before the Up process, and k_i^R can be viewed as an MQA version of RoPE.
See the image below for details.

1.4 Architecture Diagram & Process
In contrast, the following diagram shows the mathematical formula for MHA, which requires caching 2 n_h d_h l elements for each token. If it’s 1000 Questions 72B, then it needs 2 × 80 × 64. Here q_{t,i}, k_{t,i}, and v_{t,i} are all represented using column vectors. t is the t-th token, j is the index of the 1st to tth tokens, and i is the index of the head being iterated.

The following diagram shows the architecture of MLA and its formulas.

In the diagram, the yellow area is primarily used to calculate Q, the Q matrix in Attention. The green area is mainly used to calculate the position-insensitive part of K. The purple area calculates the position-sensitive part of K. The gray area aggregates K, and the red area calculates V. The specific process is as follows:
- Dimensionality reduction compression of query, Q:
ttokens in the input sequence,h_t, through a downward projection matrixW^{DQ}, are compressed into compressed latent vectorsc_t^{Q}. Its dimension is much smaller than the dimension of the input token. This corresponds to number 37 on the diagram. - Joint compression of key, K, and value, V: the
t-th token in the input sequence,h_t, through a downward projection matrixW^{DKV}, is compressed into compressed latent vectorsc_t^{KV}. Its dimensiond_cis much smaller than the dimensiondof the input token. During the inference phase, MLA only needs to cachec_t^{KV}. That is, KV cache only hasd_c × lelements, wherelis the model layer number. This corresponds to number 41 in the diagram. - Decoupled RoPE Strategy: To improve the model’s sensitivity to contextual information in the sequence, MLA applies the Decoupled Rotated Position Encoding technique. Because RoPE is incompatible with low-rank KV compression matrices, MLA introduces an additional query vector
q_t^Rand shared key vectork_t^R. This method carries RoPE information, avoiding the coupling problem between RoPE and the low-rank compression matrix, and resolving the contradiction between positional information and inference efficiency. This roughly corresponds to labels 39 and 43 in the diagram. - Recovery information: When performing attention calculations,
c_t^{KV}respectively passes through the upper projection matrixW^{UK}andW^{UV}to reconstruct the keys and values, corresponding to numbers 42 and 45 in the diagram. The keys at each attention head are then compared with the shared key vector carrying RoPE information,k_t^R. The key-value inputs for MHA are concatenated, which corresponds to number 44 in the figure.c_t^Qpasses through the upper projection matrixW^{UQ}andW^{QR}to reduce dimensions and generate query vectorq_t^C, corresponding to number 38 in the diagram, and query vector carrying RoPE informationq_t^R, corresponding to number 39 in the figure. The two are concatenated to form the query vector input of MHA, which corresponds to number 40 in the figure. - Attention calculation. This corresponds to number 46 on the diagram.
- Finally, the inputs from multiple heads are concatenated and then subjected to a linear mapping
W^O. The final output is obtained. This corresponds to number 47 on the diagram.
The characteristics of MLA can be seen from the image:
From a qualitative perspective, it can save memory because:
- Before entering the standard MHA algorithm, the longer KV vector is replaced with a compressed vector. Previously, both K and V vectors were cached. Now, only the compressed vector is stored.
- It not only compresses K and V, but can also reconstruct them into K and V, not the K and V under standard MHA.
Quantitatively speaking, each Transformer layer only caches the vector shown in the blue box in the above formula: c_t^{KV} and k_t^R. The rest can be recovered using “matrix absorption”. The sizes of these two vectors are:
c_t^{KV}dimension isd_c = 4 × d_h.d_his the vector dimension of a single head.c_t^{KV}is shared by multiple heads.k_t^Rdimension isd_h^R = d_h / 2.k_t^Ris shared by multiple heads.
Compared to MQA, each layer has one d_h dimension of k and one d_h dimension of v, 2 d_h in total. MLA is equivalent to a 2.25 times increase in storage. Compared to MHA, 2 n_h d_h, but n_h will be greater than 2.25, so the cache will definitely be reduced.
1.5 Code
The image below shows the definition of MLA in the DeepSeek V3 source code.
class MLA(nn.Module):
def __init__(self, args: ModelArgs):
super().__init__()
#隐藏层维度
self.dim = args.dim
# 注意力头的总数量
self.n_heads = args.n_heads
# 计算每个并行进程的本地注意力头数量
self.n_local_heads = args.n_heads // world_size
# 对应 query 压缩后的隐向量的维度 d'_c
self.q_lora_rank = args.q_lora_rank # q的低秩压缩的维度
# 对应 key-value 压缩后的隐向量维度 d_c
self.kv_lora_rank = args.kv_lora_rank # kv的低秩压缩的维度
# 表示query和key的向量中,不应用旋转位置编码部分的头维度, $d_h$
self.qk_nope_head_dim = args.qk_nope_head_dim
# 对应$d_h^R$,表示应用了旋转位置编码的query和key的一个头的维度。
self.qk_rope_head_dim = args.qk_rope_head_dim
# $d_h + d_h^R$, 注意力头大小为非rope部分大小加上rope部分大小
self.qk_head_dim = args.qk_nope_head_dim + args.qk_rope_head_dim
# value 的一个注意力头的隐藏层维度
self.v_head_dim = args.v_head_dim
if self.q_lora_rank == 0:
# 不适用低秩分解,回归到传统MHA
self.wq = ColumnParallelLinear(self.dim, self.n_heads * self.qk_head_dim)
else:
# 其实就是$W^{DQ}$,用来生成$c_t^Q$
# 下采样矩阵,得到压缩后的q向量
self.wq_a = Linear(self.dim, self.q_lora_rank)
self.q_norm = RMSNorm(self.q_lora_rank)
# $W^{UQ}$
# 上采样矩阵,用来恢复q向量
self.wq_b = ColumnParallelLinear(self.q_lora_rank, self.n_heads * self.qk_head_dim)
# $ [W^{DKV}; W^{KR}] $
# 下采样矩阵,得到压缩后的kv向量
self.wkv_a = Linear(self.dim, self.kv_lora_rank + self.qk_rope_head_dim)
self.kv_norm = RMSNorm(self.kv_lora_rank)
# 上采样矩阵,用来恢复kv向量
# $ [W^{UK}; W^{UV}] $
self.wkv_b = ColumnParallelLinear(self.kv_lora_rank, self.n_heads * (self.qk_nope_head_dim + self.v_head_dim))
# output权重矩阵
self.wo = RowParallelLinear(self.n_heads * self.v_head_dim, self.dim)
# 计算1/sqrt(d_k)
self.softmax_scale = self.qk_head_dim ** -0.5
if args.max_seq_len > args.original_seq_len:
mscale = 0.1 * args.mscale * math.log(args.rope_factor) + 1.0
self.softmax_scale = self.softmax_scale * mscale * mscale
if attn_impl == "naive": # native模式下,kvcache存储的是没有压缩的数据,大小为d_h + d_h^R, 不但没有节省,反而增加了显存消耗
self.register_buffer("k_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.n_local_heads, self.qk_head_dim), persistent=False)
self.register_buffer("v_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.n_local_heads, self.v_head_dim), persistent=False)
else:
# 在非native模式下,存储的是压缩的c,大小为d_c
self.register_buffer("kv_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.kv_lora_rank), persistent=False)
self.register_buffer("pe_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.qk_rope_head_dim), persistent=False)
Clearly, the MLA operator is an attention mechanism tailored to the characteristics of modern GPU hardware. By rebalancing storage and computation, it can fully leverage the advantages of modern GPUs. We will now analyze several core implementation points of MLA in detail.
0x02 Key Points
The core elements of MLA are as follows:
- Low-rank key-value joint compression reduces the resource consumption of the key-value cache. During attention computation, the compressed vectors undergo an up-dimensional transformation, thereby enhancing the model’s expressive power.
- Weight absorbing reduces the computational cost of upward projection.
- The conflict between RoPE and weight absorption is resolved by using the decoupled RoPE strategy.
2.1 Low-rank KV joint compression
2.1.1 Low-rank decomposition
Low-rank matrix factorization (LVGF) is a particularly effective matrix factorization method for discovering low-dimensional structures in data. The core idea of LVGF is to decompose a large matrix into the product of two or more smaller, simpler matrices, which typically have lower ranks.
Using low-rank decomposition in neural network layers generally trades memory cost for computational cost. Variations of this approach are popular in scenarios like LoRA fine-tuning, where the constraint is total memory cost rather than computational overhead or inference speed. The advantage is that the compressed matrices use fewer parameters and are, to some extent, more expressive, with an increased number of layers. They can ultimately be roughly approximated or equivalent to a larger matrix, so theoretically, we can multiply the weights of these matrices to recover an approximation of the original matrix.
The downside is that we now have to perform the operation twice every time we use these matrices, that is, for each compression and decompression layer, we double the total number of matrix multiplications in exchange for making them smaller. And because we restrict them to matrices of rank r or lower, we obviously lose some of the representational power of the original matrices.
2.1.2 Approach
Traditional attention mechanisms directly map the input X to the attention head dimension of a QKV. MQA and GQA indirectly compress the head dimension of the KV cache through a sharing mechanism. The core idea of MLA is to represent KV in a LoRA-like manner. Specifically, a compressed space is built during prefilling, compressing the HiddenSize dimension of the input matrix. That is, the input X is first mapped to a latent vector c and stored. Simply put, if there is a matrix with dimension n × n, it can be decomposed into the multiplication of two n × d matrices, where d ≪ n. This reduces storage requirements. Before calculating attention during the decoding stage, c is restored to the original QKV dimension through an upprojection matrix. This reduces the caching of attention keys and values during inference, thereby improving inference efficiency.
There’s actually another issue here: according to this low-rank scheme, W^Q, W^K, and W^V all matrices were converted to low-rank matrices. Since there’s a substitution of full-rank matrices with low-rank matrices, performance issues might arise. Since DeepSeek performed the substitution effectively, it indicates W^Q, W^K, and W^V may be redundant and possess significant low-rank characteristics.

In implementation, the weight matrices of Q, K, and V are typically fused to improve the computational and memory efficiency of the GPU. Unlike performing independent projections, using a combined weight matrix optimizes the computation process.
2.1.3 Projecting downwards
The diagram above illustrates the specific process of downward projection, in which h_t is the input vector. W^{DKV} and W^{DQ} are compression matrices used for dimensionality reduction. c_t^{KV} and c_t^Q are the compressed KV latent vectors and Q latent vectors, respectively. The dimension of the latent vectors is much smaller than the product of the dimension of the input vector and the number of self-attention heads. c_t^{KV} is independent of which specific head, index i, needs to be cached. Essentially, we no longer directly cache the key/value pairs. We no longer directly cache the same vector as h_t, but cache c_t^{KV} and dynamically recover k_t and v_t through calculation.

- For KV, construct a shared dimensionality reduction mapping matrix
W^{DKV}. It is used to reduce the dimensionality of the model input. W^{DKV}projects the inputh_t, the hidden state, onto latent vectorc_t^{KV}. This is the joint latent vector of the key and value. It compresses a HiddenState with shape[S,H]to a shape of[S,d_c].d_cis much smaller than the original dimensions of multi-head key and value. MLA does not retain the full hidden dimensions, but instead reduces their size.- The compressed key-value vectors are stored in the video memory as a key-value cache. During inference, only the latent vectors of each layer need to be cached,
c_t^{KV}, because the attention head of each layer shares this parameter. Due toc_t^{KV}, the dimension of K is much smaller than that of K and V. Therefore, in MLA, the KV Cache parameters generated at each step of token inference change from2 n_h d_h ltod_c l. This greatly reduces the memory footprint of the KV cache. - For Q, use a dimension reduction mapping matrix
W^{DQ}. This is used to reduce the dimensionality of the model input. This is unrelated to reducing the KV cache. It primarily aims to reduce the number of parameters and the GPU memory occupied by corresponding activations during training.
2.1.4 Upward projection
When MHA is required during the Decode phase, the KVCache will be loaded and then W^{UK} and W^{UV} project c_t^{KV} upwards to restore a larger size. This larger size can be matched with the original input h_t, and the dimensionality can also be adjusted based on the attention head configuration. DeepSeek expands the dimensions of the key-value pair back to d = d_h n_h. As can be seen from the diagram, the new k_t^C and v_t^C are divided into equal parts n_h, and each attention head has a separate KV vector, consistent with the number of KV vectors in MHA.
See the image below for details. W^{UK}, W^{UV}, and W^{UQ} are all projection matrices used for dimensionality increase. Note that RoPE is omitted here. It will be expanded and updated later in conjunction with RoPE.

Combining downward and upward projections, we can see that W^Q, W^K, and W^V are actually split into two parts and compressed into LoRA form. In this form, MLA is MQA with an extension to LoRA, and the computational complexity is reduced from d × d to 2 × d × d_c. This method of compressing information and then restoring the original dimensions, compared to the previous single-matrix form, greatly helps the network learn more effective information. It achieves better results with the same low-rank decomposition, which is the fundamental reason why MLA further compresses the KV cache than GQA.
The diagram below shows how to split the data: the top part is MLA, and the bottom part is MQA for comparison.

In fact, the paper “TransMLA: Multi-Head Latent Attention Is All You Need” analyzes the expressive power of MLA. The paper points out that traditional GQA models share the same key-value pairs within the same group when calculating attention, which limits their expressive power. MLA, however, overcomes this limitation through low-rank decomposition and a unique projection matrix design.
See the image below for details. In MLA, taking W_K^b as an example, if the vectors here are orthogonal, then each channel multiplied by X W_k^a will later produce different outputs across channels. In contrast, GQA ensures that all heads within the same group produce the same output. This structural difference leads to MLA’s stronger expressive power when the key-value buffer size is the same.

2.1.5 Complete Process
The complete comparison process is shown in the diagram below. The top of the diagram shows the overall approach. The bottom shows a comparison between MLA and GQA, which is divided into two parts: the upper part uses formulas to see how MLA enhances expressiveness, and the lower part shows the complete process.

2.2 Weight Absorption
2.2.1 Current Status
We have already saved the compressed latent vectors by downprojecting, which reduces the memory footprint of the KV cache. We’ve also enhanced expressiveness by using the upprojection matrix. However, MLA emphasizes that the number of activations should also be reduced. Currently, we don’t see how to reduce the number of activations. This is because although the compressed KV cache occupies less memory, it still requires upprojection during each inference step. W^{UK} and W^{UV} must recalculate k_{t,i} and v_{t,i} from the cached c_t^{KV}. In terms of the number and dimensions of key-value pairs, it’s on par with MHA, and even more so than GQA and MQA. The upsampled key-value cache is enormous, potentially leading to OutOfMemoryError. It not only requires a significant amount of memory, but k_{t,i} and v_{t,i} still exist, and it introduces new computational overhead, leading to a computational bottleneck.
2.2.2 Weight Absorption
Since the computational cost is too high each time, DeepSeek wondered if it could reduce this computational cost, which also reduces the memory usage of new key-value pairs, while preserving compressed latent vectors. They then introduced the weight absorption technique. That is, the authors optimized these formulas using the properties of matrix associativity, avoiding recalculating the key and value for each query. Below is the original text from their paper:

Note: Matrix absorption computation refers to the process of using linear algebra techniques such as the associative law of matrix multiplication or low-rank decomposition to change the order of matrix multiplication and recombine certain matrix factors, so that matrix products that originally needed to be calculated independently are merged together, avoiding the generation of large matrices, thereby reducing computational complexity and memory overhead.
For example, given three matrices A ∈ R^{m,k}, B ∈ R^{k,p}, and C ∈ R^{p,n}, as can be seen from matrix multiplication, (A × B) × C = A × (B × C). However, their computational complexity is different. (A × B) × C has computational complexity 2 × m × k × p + 2 × m × p × n = 2 × m × p × (k + n), while A × (B × C) has computational complexity 2 × m × k × n + 2 × k × p × n = 2 × n × k × (m + p). When n is significantly smaller than both m and p, the second computation order performs much better than the first.
2.2.3 Derivation
KQ merger
Let’s examine the specific form of Dot-Attention to see how we can circumvent this problem through a simple yet ingenious identity transformation. First, the training phase proceeds as usual, with limited optimization space. Then, during the inference phase, we use the following formula, without positional encoding, to see that during inference, if we combine {W^{UQ}}^⊤ W^{UK} into a single position-independent matrix W as the projection matrix of Q, then we can use c_t to replace the original k_t. This avoids repeatedly calculating the intermediate results q and k.

Here, transpose ⊤ represents swapping the last two dimensions of the tensor shape. The shapes of each tensor are shown below. Note that num_heads is extracted as a single dimension because the final attention weights are independent between heads.
C^Q: [batch_size, 1, q_len, q_lora_rank]W^{UQ}: [num_heads, q_lora_rank, qk_nope_head_num]W^{UK}: [num_heads, kv_lora_rank, qk_nope_head_num]C^K: [batch_size, 1, kv_len, kv_lora_rank]
Each time we cache c_t^{KV}, all of them can directly participate in the calculation without explicitly calculating K. Furthermore, the W matrix can be determined beforehand using {W^{UQ}}^⊤ W^{UK}. It is calculated automatically by the neural network.
The code is described as follows:
"""来源:https://mathmach.com/8b428574/"""
# 消融W_UK
W_UQ = tf.reshape(W_UQ, [q_lora_dim, num_head, head_dim])
W_UQ = tf.transpose(W_UQ, perm=[1, 0, 2]) # [num_head, q_lora_dim, head_dim]
W_UK = tf.reshape(W_UK, [kv_lora_dim, num_head, head_dim])
W_UK = tf.transpose(W_UK, perm=[1, 2, 0]) # [num_head, head_dim, kv_lora_dim]
W_UQUK = W_UQ * W_UK # [num_head, q_lora_dim, kv_lora_dim]
# 计算qk内积
c_Q = tf.reshape(c_Q, [batch_size, q_seq_len, q_lora_dim])
c_KV = tf.reshape(c_KV, [batch_size, kv_seq_len, kv_lora_dim])
c_KV = tf.transpose(c_KV, perm=[0, 2, 1]) # [batch_size, kv_lora_dim, kv_seq_len]
c_Q_product_W_UQUK = tf.einsum('bij,hjk->bhik', c_Q, W_UQUK) # [batch_size, num_head, q_seq_len, kv_lora_dim]
q_product_k = tf.einsum('bhik,bkj->bhij', c_Q_product_W_UQUK, c_KV) # [batch_size, num_head, q_seq_len, kv_seq_len]
VO merge
In addition, traditional methods require first calculating the Value vector v_t^C. Then, attention is calculated and projected onto the final output layer. We can directly absorb W^{UV} into W^O. Here, the final output calculation is simplified. The absorption formula is as follows.
(p · (c_kv · W^{UV})) · W^O
= (p · c_kv) · (W^{UV} · W^O)
= (softmax(q_nope · c_kv + q_pe · k_pe) · c_kv) · W^{UV} · W^O
This can be described in code as follows:
q_pe = W_QR(c_q)
q_nope = W_UQ_UK(c_q)
output = W_UV_O(MQA(q_pe, q_nope, c_kv, k_pe))
Note that we need to carefully ensure mathematical identity through transposition and other means. See the diagram below. Each attention head can be ablated into a matrix. Therefore, in actual code, a high-dimensional matrix can be used to ablate all heads into a single matrix, as illustrated in the code below.

Code description:
"""来源:https://mathmach.com/8b428574/"""
# 消融W_UV
W_O = tf.reshape(W_O, [num_head, head_dim, hidden_dim])
W_UV = tf.reshape(W_UV, [kv_lora_dim, num_head, head_dim])
W_UV = tf.transpose(W_UV, perm=[1, 0, 2]) # [num_head, kv_lora_dim, head_dim]
W_OUV = W_UV * W_O # [num_head, kv_lora_dim, hidden_dim]
# 计算u
q_R = RoPE(c_Q * W_QR) # [batch_size, q_seq_len, num_head, rope_dim]
k_R = RoPE(h * W_KR) # [batch_size, kv_seq_len, rope_dim]
q_product_k_rope = tf.einsum('bijk,bhk->bijh', q_R, k_R) # [batch_size, q_seq_len, num_head, kv_seq_len]
q_product_k_rope = tf.transpose(q_product_k_rope, perm=[0, 2, 1, 3]) # [batch_size, num_head, q_seq_len, kv_seq_len]
attention_weight = tf.softmax((q_product_k + rope_score) / tf.sqrt(head_dim + rope_dim)) # [batch_size, num_head, q_seq_len, kv_seq_len]
c_KV = tf.transpose(c_KV, perm=[0, 2, 1]) # [batch_size, kv_lora_dim, kv_seq_len]
attention_weight_product_c_KV = tf.einsum('bijk,bhk->bijh', attention_weight, c_KV) # [batch_size, num_head, q_seq_len, kv_lora_dim]
u = tf.einsum('bijh,ihd->bjd', attention_weight_product_c_KV, W_OUV) # [batch_size, q_seq_len, hidden_dim]
Combination
Combining the current mergers, we get the following:
O = A W^O
= ϕ(QK^T) V W^O
= ϕ[H W^Q (C^{KV} W^{UK})^T] C^{KV} W^{UV} W^O
= ϕ[H (W^Q {W^{UK}}^T C^{{KV}^T}] C^{KV} (W^{UV} W^O)
Thus, during reasoning, W^{UK} can be combined with W^{UQ} and W^{DQ}, while W^{UV} and W^O can also be combined. After matrix merging, the entire calculation process for KV is performed in a low-dimensional space, avoiding the need to re-calculate and decompress C^{KV} back into a higher-dimensional space. Furthermore, all the matrices mentioned above are model weights, which remain unchanged during the inference process and can be considered constants. If deploying an inference service, these two matrices can be multiplied when loading the model, saving two matrix multiplications for each subsequent inference. In reality, there is no additional computational overhead.
2.2.4 Discussion
train
The paper repeatedly mentions the use of weight absorption during the inference phase, which is easy to understand because the weight matrix is fixed at this point.
So why not combine W^{UK} and W^{UV} directly during the training phase? The reasons are roughly as follows:
- From the perspective of gradient update, omitting weight scavenging simplifies optimization.
- From a projection perspective, KV sharing
W^{DKV}imposes a constraint on the space. Weight tying enables the model to converge better, improves its generalization ability, and also enhances the model’s stability.
Therefore, MLA is similar to MHA during the training phase. Except for an additional low-rank projection step and adding RoPE only in some dimensions, the head dimensions of MLA, Q, and K change from d_k to d_k + d_R, which is the same as MHA.
MHA
Secondly, since weight absorption is so good, why didn’t MHA perform weight absorption?
Let’s first look at the characteristics of the reasoning stage.
First, the calculation formula in MHA is as follows. In the standard MHA implementation, the query, key, and value embeddings are calculated separately. Then, the weight matrix of self-attention is calculated using the query embedding and key embedding. Finally, this weight matrix is multiplied by the value embedding to obtain the final result.
Z = softmax( q_t^T k_i / √d_k ) v_t W^O
= softmax( h_t^T (W^Q)^T W^K h_i / √d_k ) h_i W^V W^O
It looks like (W^Q)^T W^K and W^V W^O both have the potential to be absorbed. Secondly, during the decoding computation, the input Q often contains only one token, which naturally provides an opportunity to simplify the calculation.

Currently, it seems that MHA offers many advantages for matrix absorption. However, the reality is not as simple as that. We use q_t^T k_i as an example to analyze why it is not suitable for absorption and why MLA can improve efficiency.
For a single head, n_h = 1, corresponding to matrix multiplication [1,d] × [d,d_h] × [d_h,d] × [d,1]. There are several possibilities:
- Standard KV Cache.
- Bundle
(W^Q)^T W^Ktogether and apply the combined weights tox. - Bundle
(W^Q)^T W^Ktogether, but only cachex, not the weights ofkandv.
Based on the above analysis, the standard key-value cache is already optimal in terms of space overhead and computation. Although we can reduce the key-value cache by half by caching only x, the combined matrix will increase the computational load when calculated at runtime, which is not a good solution in the long run.
Let’s take a look at MLA. W^K performs a low-rank transformation. Compared with MHA, the storage and computing perspectives are both improved, because the key-value cache becomes the compressed latent vector plus the RoPE component, and W^{UK} can be merged into W^Q, while W^{UV} can be merged into W^O.
No merger
In the specific implementation process, choices need to be made based on the actual situation. For example, Li Weihua has a wonderful discussion on this topic in https://developnotes.readthedocs.io/zh-cn/latest/deepseek.html#id1.
Consider the following operation: Y = X A B, C = A B. Here X ∈ R^{m×d} are the input hidden states. A ∈ R^{d×d_c} and B ∈ R^{d_c×n} are weight matrices. C ∈ R^{d×n} is the matrix after absorption.
Direct calculation Y = X A B has flops 2 m d d_c + 2 m n d_c = 2 m d_c (d + n). After merging, C = A B and Y = X C have flops 2 m d n. If d_c is smaller, then d n > d_c (d + n), and the computational load is too large, so weight absorption is not necessarily required.
Alternatively, let’s look at the actual MLA code. The known configuration is as follows:
"hidden_size": 5120, # 隐藏层的大小
"kv_lora_rank": 512, # KV压缩维度
"q_lora_rank": 1536, # Query压缩维度
"qk_rope_head_dim": 64, # 解耦Query和Key的每个头部维度
"qk_nope_head_dim":128 #
As you can see, putting W^{UQ} W^{UK} together significantly increases the computational load after merging. During prefilling, you shouldn’t actually perform “absorption”. Therefore, Absorb should be understood as the associative law of matrix multiplication, which prioritizes the combination of certain matrices and caches compressed latent vectors c_t^{KV}.
2.3 Decoupling RoPE
To improve the model’s sensitivity to contextual information in the sequence, MLA employs the Decoupled Rotation Position Encoding technique. However, we have missed a crucial step in our analysis so far: position encoding. This is because RoPE is incompatible with low-rank KV compression matrices, and it conflicts with weight absorption, making seamless switching impossible. To address this issue, MLA introduces an additional query vector q_t^R and shared key vector k_t^R to carry RoPE information. As can be seen from the architecture diagram, DeepSeek’s q and k each have two parts, namely [q_t^R, q_t^C] and [k_t^R, k_t^C].
- One part is the compressed part:
[q_t^C]and[k_t^C]. - One section includes RoPE location encoding. This means there’s a separate RoPE path:
[q_t^R]and[k_t^R].
Finally, the two parts are concatenated to form the Q and K matrices. This decouples RoPE from the low-rank compressed matrix, resolving the conflict between location information and inference efficiency.
2.3.1 RoPE Background
The code below is a summary of Llama 3’s attention calculation. In RoPE rotational position encoding, both the query and key are position-dependent. Before performing attention calculation, the code first applies W^K and related matrices to obtain Q and K. Then, RoPE, multiplication by a rotation matrix, is applied to Q and K to incorporate relative position information into Q and K.
class Attention(nn.Module):
def forward(self, x: torch.Tensor, start_pos: int, freqs_cis: torch.Tensor, mask: Optional[torch.Tensor],):
bsz, seqlen, _ = x.shape
# 获取Q、K和V
xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)
xq = xq.view(bsz, seqlen, self.n_local_heads, self.head_dim)
xk = xk.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)
xv = xv.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)
# 施加RoPE
xq, xk = apply_rotary_emb(xq, xk, freqs_cis=freqs_cis)
# 处理KV Cache
self.cache_k = self.cache_k.to(xq)
self.cache_v = self.cache_v.to(xq)
keys = self.cache_k[:bsz, : start_pos + seqlen]
values = self.cache_v[:bsz, : start_pos + seqlen]
# 计算注意力,分开计算了RoPE部分的q和k的注意力计算再求和
scores = torch.matmul(xq, keys.transpose(2, 3)) / math.sqrt(self.head_dim)
scores = F.softmax(scores.float(), dim=-1).type_as(xq)
output = torch.matmul(scores, values) # (bs, n_local_heads, seqlen, head_dim)
output = output.transpose(1, 2).contiguous().view(bsz, seqlen, -1)
return self.wo(output)
2.3.2 Problem
It cannot be directly applied to low-rank compression.
Let’s first see if we can apply RoPE to low-rank compression vectors, that is, RoPE is directly absorbed by low-rank compression vectors K and V.
Because the low-rank representations of K and V are already compressed, the compression operation may have lost some information. The RoPE matrix, however, is position-sensitive to the key and value, and directly applying R_m and R_n to c_t^Q and c_t^{KV} is no longer equivalent to applying positional encoding to the complete Q and K, and cannot directly and effectively reflect the relative positional relationship of the original Q and K.
Incompatible with weight absorption
Let’s take a closer look at whether RoPE can be absorbed by the weights when applied to the original K and V.
If we want Q and K to carry position information, we will multiply them by the corresponding position encoding matrices:
Q̂ = R_m Q
K̂ = R_n K
If we then calculate Q^T K, it becomes:
S = Q^T R_m^T R_n K
DeepSeek-V2 compresses both Q and K, so the whole process becomes:
S = (W^{UQ} c_t^Q)^T R_m^T R_n W^{UK} c_t^{KV}
= {c_t^Q}^⊤ {W^{UQ}}^⊤ R_{m-n} W^{UK} c_t^{KV}
Currently, the formula includes an additional matrix related to the token position difference R_{m-n}. This matrix changes with relative positions, it is not a fixed matrix, and cannot be calculated in advance. Furthermore, matrix multiplication does not follow the commutative law, making it impossible to move R_{m-n} to another part of the formula. Therefore, during reasoning, W^{UQ} and W^{UK} cannot directly interact. That means W^{UK} cannot be integrated into W^Q.
The diagram below provides a more precise explanation. The top represents NoPE, and the bottom represents RoPE.

2.3.3 Solution
To address the incompatibility issue between RoPE and low-rank key-value joint compression in MLA, the DeepSeek team proposed a strategy to decouple RoPE. For a head, a high-dimensional vector represents its text information, and a low-dimensional vector represents its rotation position encoding information. The high-dimensional vector is called nope, and the low-dimensional vector is called rope.
- Information storage section:
(q_t^C, k_t^C). This section stores most of the business information and is compressed. - Location information section:
(q_t^R, k_t^R). It is further divided into two parts.- Use shared keys
k_t^R ∈ R^{d_h^R}to carry RoPE information. - Use additional multi-head queries
q_{t,i}^R ∈ R^{d_h^R}to carry RoPE location information.
- Use shared keys
Finally, these four variables are concatenated separately for attention calculation. This eliminates the need for positional encoding of the Key during inference, avoiding the coupling problem between RoPE and the low-rank compression matrix, resolving the conflict between positional information and inference efficiency, and improving inference efficiency.
The final product calculation is shown in Figure 4.1, where the first term, Figure 4.2, is calculated under the case without RoPE, and only c_t^{KV} needs to be cached during inference. The latter, labeled 4.3, caches only one shared cache for all attention heads, k_t^R. That is, during the inference phase, the KV Cache generated by a single Token contains two parts.
- Compressed latent vectors of key-value pairs need to be cached:
c_t^{KV}, with dimensiond_c. - Shared key vector carrying RoPE information:
k_t^R, with dimensiond_h^R.

The concat process increases the dimensionality of the Q and K vectors. To handle this increased dimensionality, the model can choose to increase the number of attention heads or adjust the processing dimension of each head. The image below provides a clear comparison.

2.3.5 Combining with weight absorption
Let’s take a look at how to handle it after combining weight absorption. Here we need to add nope and rope as well. The formula evolves as follows.
2.4 Resource Consumption
2.4.1 Number of parameters
MLA’s approach is derived from LoRA, which emphasizes reducing the number of parameters, and MLA does indeed achieve this reduction. With DeepSeek-V3’s parameter configuration, the two low-rank matrices have the following parameter counts: 2 × d_c × d = 2 × 512 × 7168. The parameter matrix of a normal MHA has the following number of parameters: d × d = 7168 × 7168.
The specific parameters are as follows:
"vocab_size": 129280,
"dim": 7168,
"inter_dim": 18432,
"n_heads": 128,
"q_lora_rank": 1536,
"kv_lora_rank": 512,
"qk_nope_head_dim": 128,
"qk_rope_head_dim": 64,
"v_head_dim": 128,
The number of parameters for each matrix is as follows:
W^{DKV}: dim * kv_lora_rank = 7168 * 512W^{UK}: kv_lora_rank * qk_rope_head_dim * n_heads = 512 * 128 * 128W^{UV}: kv_lora_rank * qk_nope_head_dim * n_heads = 512 * 128 * 128W^{KR}: dim * qk_rope_head_dim = 7168 * 64W^{DQ}: dim * q_lora_rank = 7168 * 1536W^{UQ}: q_lora_rank * qk_nope_head_dim * n_heads = 1536 * 128 * 128W^{QR}: q_lora_rank * qk_rope_head_dim * n_heads = 1536 * 64 * 128W^O: n_heads * v_head_dim * hidden_size = 128 * 128 * 7168
2.4.2 Memory Usage
However, MLA emphasizes the reduction of KV-cache, that is, the reduction of KV activation values. Compared with classic MHA, GQA, and MQA, the actual cached vector of MLA is:
c_t^{KV}dimension isd_c.k_t^Rdimension isd_h / 2.
As shown in the figure below, we can see that MLA has a strong advantage in optimizing key-value cache and ensuring model performance.

Compared to MHA, MLA uses fewer elements per layer cache. Compared to GQA, MLA’s KV cache size is significantly smaller. Compared to MQA, MLA requires 2.25 times more storage, but its performance and effectiveness are significantly better than MQA, and even better than MHA and GQA. It truly achieves both reduced inference costs and guaranteed model performance.
2.4.3 Computational complexity
Compared to MHA, the head dimensions of Q and K in MLA become d_c + d_h^R, and the head dimension of V becomes d_c. The following are some of the hyperparameters of DeepSeek V3:
d_k(hidden dimension/model dimension): 7168n_h(Number of attention heads): 128d_h(Dimensions of each attention head): 128d_c(KV compression dimension): 512, that is4 d_hd_h^R(RoPE head related dimensions): 64, that isd_h / 2
Since the Q/K head size of each head in MLA increases significantly, the computational cost of MLA inference also increases. So why does it still improve inference efficiency? Actually, MLA improves efficiency because it leverages the characteristic of LLM inference where the bottleneck is memory access rather than computation.
2.4.4 Information Transfer
Some researchers, after rereading MLA, believe that its function is actually “information transfer,” that is, transferring information unique to the key-value header to the corresponding key-value header, while storing shared information between the key-value headers in the key-value cache.
- Objective of the improvement: To save on key-value cache while minimizing compression of key-value information in the head.
- Improvement Background: The reason for saving the K and V values of all attention heads corresponding to the token is that each
k_headhas different information, which will be used to perform attention calculation with the correspondingq_head. - Improvement ideas: Extract the common information from all the
k-headers of a token and compress it into the KV Cache. Transfer the unique information from each head in K to the corresponding Q head.

Furthermore, GQA requires a strict match between the number of groups and the hardware scale, limiting the flexibility of model deployment. MLA, on the other hand, can dynamically adapt to different hardware configurations through latent spatial projection and decoupled weight merging.
2.5 Parallelism
In the decoding phase of large model inference, MLA cannot utilize tensor parallelism. Therefore, current open-source implementations primarily rely on data parallelism to process MLA, meaning that the KVCache for different requests is stored on different GPUs. The DeepSeek-V3 paper mentions using tensor parallelism and sequence parallelism.
- Tensor parallelism: MHA typically achieves tensor parallelism by splitting the
head_numdimension. MLA, however, has its own characteristics. - Data parallelism: This means splitting the data according to requests, and storing the compressed vectors of the latent space for different requests on different GPUs.
- Sequence Parallelism: MLA uses sequence parallelism for assistance. That is, the KVCache is split according to the sequence dimension, and local attention is calculated using queries on each card, and then the results are reduced.
0x03 Calculation Process
Let’s break down the computational process of MLA during the inference phase.
3.1 Formula
First, we present the formulas corresponding to the transformation processes of Q, K, and V. The subsequent analysis will follow these formulas.

3.2 Original Process
We convert the above formula into a flowchart, the details of which are as follows:
- It is divided into three paths from top to bottom: Q, K, and V.
- Both Q and K are further subdivided into two paths: the upper path with green weights and activation values corresponds to the latent vector and low-rank part; the lower path with gray gradient weights and activation values corresponds to decoupled RoPE.
- The data flow of K’s bottom lane and V’s bottom lane is somewhat intertwined.
- “Cache” refers to data that is cached during the inference phase, and it consists of two parts: KV Joint Implicit Vector
c_t^{KV}, and the keyk_t^Rfor which RoPE was applied separately.
Here, we assume the number of heads n_h is 2, meaning the matrix size is not scaled proportionally.

3.3 Absorption
3.3.1 Process
The second step involves applying the weight absorption process described in the paper, resulting in the following figure:
- The data to be cached during the inference phase remains unchanged.
W^{UK}is absorbed intoW^{UQ}.- The calculation logic for Q remains unchanged, but the shapes of the weights and activation values have been adjusted accordingly.
- The top path of K directly eliminates one linear mapping calculation logic, turning it into a repeated copy, similar to K’s bottom lane.
W^{UV}is absorbed intoW^O.- The V-route linear mapping degenerates into the logic of repeated copying.
- The final output mapping calculation logic remains unchanged, but the shapes of the weights and activation values are adjusted accordingly.

3.3.2 Absorption Results
We have organized the above diagram and obtained the following results of absorption.

3.3.3 MQA Form
The computational logic of the MLA inference phase is actually very similar to an MQA. Let’s compare them, excluding RoPE.

For MLA, considering only the first part of the Attention calculation, both use a separate Q and shared K, where C^{KV} is equivalent to repeatedly copying the single-head key-value pairs several times before executing normal MHA.
The difference is that here the dimension of the vector dot product is d_c instead of d. The actual setting in the paper is d_c = 4 d. In other words, Multi-Head Latent Attention is essentially Multi-Query Attention with the head dimension increased by four times.
To make it clearer, we can further absorb and merge some weights in the diagram to obtain the following diagram.

- The calculation of Q degenerates into a common multi-head linear mapping.
- For each head, a portion of its dimensions remains unchanged, corresponding to the green part.
- Apply the RoPE transformation to another dimension of each head, corresponding to the red portion.
- The calculation of K degenerates into a single-head linear mapping.
- Similarly, RoPE transformation is applied only to a subset of dimensions.
- V then directly uses the portion of K that has not undergone the RoPE transformation, and copies it repeatedly.
0x04 Code
We primarily use the V2 code for analysis because it’s more clearly structured. It’s also important to note that the DeepSeek code differs from the paper in several places. The DeepSeekV2Attention implementation in V2 is essentially the same as the native implementation in V3, and doesn’t actually save on KV-Cache. The non-native version in V3, however, is consistent with the paper and saves GPU memory.
4.1 Configuration
We have extracted some relevant configuration information as follows. In the naive implementation, there is a 512-dimensional Latent KV, c_t^{KV}. It is mapped back to 128 heads, each head having 128 dimensions, k^C and v^C. Then concatenate the position vector k^R. The final inputs, q, k, and v are fed into a standard Multi Head Attention system for Attention calculation. Additionally, the code also uses norms, which are mentioned in the paper.

The specific configuration information is as follows:
- Compressed dimensions of keys and values:
d_cis set to 512, with the original embedding dimensiond = 5120, resulting in a ratio of1/10. - Query compression dimensions:
d'_cis set to 1536, with a ratio of0.3.
"num_hidden_layers": 60, # Transformer层的数量
"hidden_size": 5120, # 隐藏层的大小
"num_attention_heads": 128, # 注意力头的数量
"kv_lora_rank": 512, # KV压缩维度
"q_lora_rank": 1536, # Query压缩维度
"qk_rope_head_dim": 64, # 解耦Query和Key的每个头部维度
"n_shared_experts": 2, # MoE层中的共享专家数量
"n_routed_experts": 160, # MoE层中的路由专家数量
"moe_intermediate_size": 1536, # 每个MoE专家的中间隐藏层的维度
"num_experts_per_tok": 6, # 每个token激活的专家数量
"routed_scaling_factor": 16.0, # 路由专家的缩放因子
"rms_norm_eps": 1e-06 # RMS归一化的epsilon值
4.2 Definition
Given an input vector h_t ∈ R^{B×L×5120}, where B is the batch size and L is the sequence length:
class DeepseekV2Attention(nn.Module):
"""Multi-headed attention from 'Attention Is All You Need' paper"""
def __init__(self, config: DeepseekV2Config, layer_idx: Optional[int] = None):
super().__init__()
self.config = config
self.layer_idx = layer_idx
self.attention_dropout = config.attention_dropout
self.hidden_size = config.hidden_size
self.num_heads = config.num_attention_heads
self.max_position_embeddings = config.max_position_embeddings
self.rope_theta = config.rope_theta
# 对应 query 压缩后的隐向量的维度 d'_c
self.q_lora_rank = config.q_lora_rank
# query和key的隐藏向量中,应用rope部分的维度,对应d_h^R
self.qk_rope_head_dim = config.qk_rope_head_dim
# 对应 key-value 压缩后的隐向量维度 d_c
self.kv_lora_rank = config.kv_lora_rank
# value 的一个注意力头的隐藏层维度
self.v_head_dim = config.v_head_dim
# 向量中不应用rope部分的维度
self.qk_nope_head_dim = config.qk_nope_head_dim
# 每一个注意力头的维度应该是nope和rope两部分之和
self.q_head_dim = config.qk_nope_head_dim + config.qk_rope_head_dim
self.is_causal = True
self.q_a_proj = nn.Linear(
self.hidden_size, config.q_lora_rank, bias=config.attention_bias
)
self.q_a_layernorm = DeepseekV2RMSNorm(config.q_lora_rank)
self.q_b_proj = nn.Linear(
config.q_lora_rank, self.num_heads * self.q_head_dim, bias=False
)
self.kv_a_proj_with_mqa = nn.Linear(
self.hidden_size,
config.kv_lora_rank + config.qk_rope_head_dim,
bias=config.attention_bias,
)
self.kv_a_layernorm = DeepseekV2RMSNorm(config.kv_lora_rank)
self.kv_b_proj = nn.Linear(
config.kv_lora_rank,
self.num_heads
* (self.q_head_dim - self.qk_rope_head_dim + self.v_head_dim),
bias=False,
)
self.o_proj = nn.Linear(
self.num_heads * self.v_head_dim,
self.hidden_size,
bias=config.attention_bias,
)
self._init_rope()
self.softmax_scale = self.q_head_dim ** (-0.5)
if self.config.rope_scaling is not None:
mscale_all_dim = self.config.rope_scaling.get("mscale_all_dim", 0)
scaling_factor = self.config.rope_scaling["factor"]
if mscale_all_dim:
mscale = yarn_get_mscale(scaling_factor, mscale_all_dim)
self.softmax_scale = self.softmax_scale * mscale * mscale
The corresponding information is as follows. Breaking the entire computation process into four parts, q_nope, k_nope, k_pe, and q_pe, is to decouple RoPE. The two variables ending in pe are used to store the rotation position encoding information. Deepseek-V2 compresses the kv cache into a single small matrix, which is then decompressed later.
# q = q.view(bsz, q_len, num_heads, q_head_dim).transpose(1, 2)
# q_nope, q_pe = torch.split(q, [qk_nope_head_dim, qk_rope_head_dim], dim=-1)
q_pe : torch.Size([16, 128, 1, 64])
q_nope : torch.Size([16, 128, 1, 128])
# query_states = k_pe.new_empty(bsz, num_heads, q_len, q_head_dim)
query_states : torch.Size([16, 128, 1, 192])
# kv = .view(bsz, kv_seq_len, num_heads, qk_nope_head_dim + v_head_dim).transpose(1, 2)
# k_nope, value_states = torch.split(kv, [qk_nope_head_dim, v_head_dim], dim=-1)
value_states : torch.Size([16, 128, 1024, 128])
k_nope : torch.Size([16, 128, 1024, 128])
# k_pe = k_pe.view(bsz, kv_seq_len, 1, qk_rope_head_dim).transpose(1, 2)
k_pe : torch.Size([16, 1, 1024, 64])
# key_states = k_pe.new_empty(bsz, num_heads, kv_seq_len, q_head_dim)
key_states : torch.Size([16, 128, 1024, 192])
4.3 Operation Q
We combined all the Q-related code for analysis. The overall process is as follows: when the model processes the hidden state calculated by the previous layer, hidden_size = 5120, it first compresses q to q_lora_rank = 1536, then expands it to the output dimension of q_b_proj, num_heads × q_head_dim, and finally splits it into two parts: q_pe and q_nope.
4.3.1 Variable Definition
Q projection matrix in MLA, W^Q, undergoes a low-rank decomposition, which generates two matrices, q_a_proj and q_b_proj.
- The size of
q_a_projis[hidden_size, q_lora_rank] = [5120, 1536], corresponding toW^{DQ}. - The size of
q_b_projis[q_lora_rank, num_heads * q_head_dim] = [1536, 24576], corresponding toW^{UQ}andW^{QR}stored together.
self.num_heads = config.num_attention_heads # 128
self.q_lora_rank = config.q_lora_rank # 1536
self.q_head_dim = config.qk_nope_head_dim + config.qk_rope_head_dim # 128 + 64
# 对query进行压缩,即down-projection
self.q_a_proj = nn.Linear(
self.hidden_size, config.q_lora_rank, bias=config.attention_bias
)
self.q_a_layernorm = DeepseekV2RMSNorm(config.q_lora_rank)
# 对压缩后的query映射成高维,即up-projection
self.q_b_proj = nn.Linear(
config.q_lora_rank, self.num_heads * self.q_head_dim, bias=False
)
4.3.2 Variable Operations
In DeepSeek-V2, the Q vector also uses low-rank compression.
- First, project the input vector into a low-dimensional space of 1536 dimensions:
c_t^Q = W^{DQ} h_t. - Then, project it into the multi-head vector space to obtain the first part of the Q vector,
q_t^C. - Then project it into the RoPE-related vector space to obtain the second part of the Q vector,
q_t^R. - Finally, concatenate the two parts to obtain the full query vector
q_t = [q_t^C, q_t^R].

The specific implementation is as follows:
# hidden_states对应公式中的h_t,hidden_states的shape是(batch_size, seq_length, hidden_size),其中 hidden_size为 5120,是num_head * q_head_dim
bsz, q_len, _ = hidden_states.size()
# 下面两行代码对应第37、38号公式,先降维再升维。q_b_proj维度是[1536, 24576],q_a_proj维度是[5120, 1536],是W^Q [5120, 24576]矩阵的低秩分解。即[5120, 24576] -> [5120, 1536] * [1536, 24576]
# 首先,使用全连接层(self.q_a_proj)对输入的隐状态(hidden_states)进行降维投影
# 然后,使用全连接层(self.q_b_proj)对压缩的向量进行上投影
q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))
# 重塑为多头形式,是第40号公式的前置准备操作,或者说是40号公式的反向操作
# q_pe 要扔给 RoPE模块,所以需要重整下形状
q = q.view(bsz, q_len, self.num_heads, self.q_head_dim).transpose(1, 2)
# 把最后一维切分成nope和rope两部分
q_nope, q_pe = torch.split(q, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1)
# 第39号公式,给q和k施加RoPE
q_pe, k_pe = apply_rotary_pos_emb(q_pe, k_pe, cos, sin, position_ids)
# 初始化查询状态(query_states)的张量
query_states = k_pe.new_empty(bsz, self.num_heads, q_len, self.q_head_dim)
# 下面两行对应第40号公式
query_states[:, :, :, : self.qk_nope_head_dim] = q_nope # 128
query_states[:, :, :, self.qk_nope_head_dim :] = q_pe # 64
4.4 Operating KV
We combined all the key-value related code for analysis. For the KV matrix design, the model uses a compressed KV matrix design, only 576 dimensions, performing dimensionality reduction followed by dimensionality increase during training. During model inference, the amount that needs to be cached becomes compressed_kv, which is then increased in dimension using kv_b_proj to obtain the calculated k and v results.
4.4.1 Variable Definition
Similar to the Q vector, the KV vector also undergoes a low-rank decomposition, generating two matrices: kv_a_proj_with_mqa and kv_b_proj.
self.kv_lora_rank = kv_lora_rank # 512,key和value各占256维度
self.qk_rope_head_dim = config.qk_rope_head_dim # 64
self.q_head_dim = config.qk_nope_head_dim + config.qk_rope_head_dim # 128 + 64
self.v_head_dim = config.v_head_dim # 128
self.hidden_size = config.hidden_size # 5120
self.kv_a_proj_with_mqa = nn.Linear(
self.hidden_size,
config.kv_lora_rank + config.qk_rope_head_dim,
bias=config.attention_bias,
)
self.kv_a_layernorm = DeepseekV2RMSNorm(config.kv_lora_rank)
self.kv_b_proj = nn.Linear(
config.kv_lora_rank,
self.num_heads
* (self.q_head_dim - self.qk_rope_head_dim + self.v_head_dim),
bias=False,
)
4.4.2 Variable Operations
There are a few differences between the KV vector calculation and the formula. Some matrix operations are performed together, producing both K and V vectors at the same time, and then separated later.

Dimensional analysis shows that kv_lora_rank is 4 times that of qk_nope_head_dim, and K and V share the latent state. qk_rope_head_dim is only half the size of qk_nope_head_dim. Combined, 4 + 1/2 = 9/2, which is the source of the MLA KVCache per Token size shown in the diagram below.

The specific code implementation is as follows:
# 使用MQA(Multi-Query Attention)对输入的隐状态进行处理,得到压缩后的键值对表示(compressed_kv),对应41号公式和43号(还没有加 rope)。
compressed_kv = self.kv_a_proj_with_mqa(hidden_states)
# 将压缩后的键值对表示分为两部分:低秩压缩的键值对部分和经过位置编码的键部分(k_pe)
compressed_kv, k_pe = torch.split(
compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1
)
# k_pe 要传给 RoPE模块,所以需要重整下形状
k_pe = k_pe.view(bsz, q_len, 1, self.qk_rope_head_dim).transpose(1, 2)
kv = (
self.kv_b_proj(self.kv_a_layernorm(compressed_kv))
.view(bsz, q_len, self.num_heads, self.qk_nope_head_dim + self.v_head_dim)
.transpose(1, 2)
)
k_nope, value_states = torch.split(
kv, [self.qk_nope_head_dim, self.v_head_dim], dim=-1
)
kv_seq_len = value_states.shape[-2]
if past_key_value is not None:
kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
q_pe, k_pe = apply_rotary_pos_emb(q_pe, k_pe, cos, sin, position_ids)
key_states = k_pe.new_empty(bsz, self.num_heads, q_len, self.q_head_dim)
key_states[:, :, :, : self.qk_nope_head_dim] = k_nope
key_states[:, :, :, self.qk_nope_head_dim :] = k_pe
4.5 Attention operations
4.5.1 Variable Definition
o_proj corresponds to matrix W^O. The size is [num_heads * v_head_dim, hidden_size] = [128 * 128, 5120].
self.v_head_dim = config.v_head_dim # 128
self.num_heads = config.num_attention_heads # 128
self.hidden_size = config.hidden_size # 5120
self.o_proj = nn.Linear( # 对应第47号公式
self.num_heads * self.v_head_dim,
self.hidden_size,
bias=config.attention_bias,
)
4.5.2 Variable Operations
After generating the QKV vectors, the process is essentially the same as standard MHA calculation. The only difference is that only the q_pe and k_pe parts are roped.
a = softmax((q_t^⊤ k_t + Mask) / √192)
= softmax((q_t^{C⊤} k_t^C + q_t^{R⊤} k_t^R + Mask) / √(128 + 64))
Then, the weighted sum over V is calculated, and all heads are flattened to obtain the Attention output:
o = a · v_t ∈ R^{B×L×H×128} ≅ R^{B×L×16384}
u = W^O o ∈ R^{B×L×5120}
The specific code is:
# 更新和拼接历史 KVCache,将当前位置之前的压缩后的kv以及应用过rope的k的部分拼接进去,可以看到这里存储的是展开后的 MHA KVCache
if past_key_value is not None:
cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
key_states, value_states = past_key_value.update( # 更新kv cache
key_states, value_states, self.layer_idx, cache_kwargs
)
# 后续就是标准的 MHA 代码,代码 Q^T*K*V*O
attn_weights = (
torch.matmul(query_states, key_states.transpose(2, 3)) * self.softmax_scale
)
if attention_mask is not None:
attn_weights = attn_weights + attention_mask
# upcast attention to fp32
attn_weights = nn.functional.softmax(
attn_weights, dim=-1, dtype=torch.float32
).to(query_states.dtype)
attn_weights = nn.functional.dropout(
attn_weights, p=self.attention_dropout, training=self.training
)
attn_output = torch.matmul(attn_weights, value_states)
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.reshape(bsz, q_len, self.num_heads * self.v_head_dim)
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
4.6 Forward Propagation
We have extracted the complete forward propagation code below for your better understanding.
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
**kwargs,
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
bsz, q_len, _ = hidden_states.size()
q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))
q = q.view(bsz, q_len, self.num_heads, self.q_head_dim).transpose(1, 2)
q_nope, q_pe = torch.split(
q, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1
)
compressed_kv = self.kv_a_proj_with_mqa(hidden_states)
compressed_kv, k_pe = torch.split(
compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1
)
k_pe = k_pe.view(bsz, q_len, 1, self.qk_rope_head_dim).transpose(1, 2)
kv = (
self.kv_b_proj(self.kv_a_layernorm(compressed_kv))
.view(bsz, q_len, self.num_heads, self.qk_nope_head_dim + self.v_head_dim)
.transpose(1, 2)
)
k_nope, value_states = torch.split(
kv, [self.qk_nope_head_dim, self.v_head_dim], dim=-1
)
kv_seq_len = value_states.shape[-2]
if past_key_value is not None:
kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
q_pe, k_pe = apply_rotary_pos_emb(q_pe, k_pe, cos, sin, position_ids)
query_states = k_pe.new_empty(bsz, self.num_heads, q_len, self.q_head_dim)
query_states[:, :, :, : self.qk_nope_head_dim] = q_nope
query_states[:, :, :, self.qk_nope_head_dim :] = q_pe
key_states = k_pe.new_empty(bsz, self.num_heads, q_len, self.q_head_dim)
key_states[:, :, :, : self.qk_nope_head_dim] = k_nope
key_states[:, :, :, self.qk_nope_head_dim :] = k_pe
if past_key_value is not None:
cache_kwargs = {"sin": sin, "cos": cos}
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs
)
attn_weights = (
torch.matmul(query_states, key_states.transpose(2, 3)) * self.softmax_scale
)
if attention_mask is not None:
attn_weights = attn_weights + attention_mask
attn_weights = nn.functional.softmax(
attn_weights, dim=-1, dtype=torch.float32
).to(query_states.dtype)
attn_weights = nn.functional.dropout(
attn_weights, p=self.attention_dropout, training=self.training
)
attn_output = torch.matmul(attn_weights, value_states)
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.reshape(bsz, q_len, self.num_heads * self.v_head_dim)
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
The corresponding example is shown below.

4.7 V3 Code
We also provide the V3 code below. The native version in V3 does not actually save KV-Cache, it even adds storage, while the non-native version in V3, consistent with the paper, saves GPU memory.
# from: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py
class MLA(nn.Module):
def __init__(self, args: ModelArgs):
super().__init__()
self.dim = args.dim
self.n_heads = args.n_heads
self.n_local_heads = args.n_heads // world_size
self.q_lora_rank = args.q_lora_rank
self.kv_lora_rank = args.kv_lora_rank
self.qk_nope_head_dim = args.qk_nope_head_dim
self.qk_rope_head_dim = args.qk_rope_head_dim
self.qk_head_dim = args.qk_nope_head_dim + args.qk_rope_head_dim
self.v_head_dim = args.v_head_dim
if self.q_lora_rank == 0:
self.wq = ColumnParallelLinear(self.dim, self.n_heads * self.qk_head_dim)
else:
self.wq_a = Linear(self.dim, self.q_lora_rank)
self.q_norm = RMSNorm(self.q_lora_rank)
self.wq_b = ColumnParallelLinear(self.q_lora_rank, self.n_heads * self.qk_head_dim)
self.wkv_a = Linear(self.dim, self.kv_lora_rank + self.qk_rope_head_dim)
self.kv_norm = RMSNorm(self.kv_lora_rank)
self.wkv_b = ColumnParallelLinear(self.kv_lora_rank, self.n_heads * (self.qk_nope_head_dim + self.v_head_dim))
self.wo = RowParallelLinear(self.n_heads * self.v_head_dim, self.dim)
self.softmax_scale = self.qk_head_dim ** -0.5
if attn_impl == "naive":
self.register_buffer("k_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.n_local_heads, self.qk_head_dim), persistent=False)
self.register_buffer("v_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.n_local_heads, self.v_head_dim), persistent=False)
else:
self.register_buffer("kv_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.kv_lora_rank), persistent=False)
self.register_buffer("pe_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.qk_rope_head_dim), persistent=False)
The native implementation is intuitive and suitable for learning, but not suitable for the decoding stage because the decoding stage requires a key-value cache. Therefore, the naive implementation might be used in the prefill stage, but a better computation method is needed during the decoding phase, namely the non-native version.
The specific comparison is shown in the image below.

0x05 Optimize the code
The DeepSeek code does not provide specific solutions for certain functions, such as compression optimization and weight absorption. Therefore, we mainly use the solution provided by Professor Zhang Mingxing, https://github.com/madsys-dev/deepseekv2-profile/tree/main, as an example for learning.
5.1 Compression Optimization
In the current V2 code, the KV Cache in Attention still caches the full set of keys and values, decompressed from the latent vectors, instead of the compressed sets mentioned in the paper, compressed_kv and k_pe, which means that the KV Cache is not actually reduced.
We can make the following modification to also cache the k_pe after RoPE in the KV Cache.
# 将当前位置之前的压缩后的kv(c_t^{kv})以及应用过rope的k的部分拼接到KV Cache前面
if past_key_value is not None:
# 得到的应该是
# compressed_kv: [B, kv_seq_len, d_c]
# k_pe: [B, 1, kv_seq_len, qk_rope_head_dim]
compressed_kv, k_pe = past_key_value.update(compressed_kv, k_pe)
Teacher Zhang Mingxing provided a more detailed solution.
# CacheCompressed
def forward(self, hidden_states_q: torch.Tensor, q_position_ids: torch.LongTensor, compressed_kv: torch.Tensor):
...
kv_seq_len = compressed_kv.size(1)
compressed_kv, k_pe = torch.split(
compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1
)
k_pe = k_pe.view(bsz, kv_seq_len, 1, self.qk_rope_head_dim).transpose(1, 2)
kv = self.kv_b_proj(compressed_kv) \
.view(bsz, kv_seq_len, self.num_heads, self.qk_nope_head_dim + self.v_head_dim) \
.transpose(1, 2)
k_nope, value_states = torch.split(kv, [self.qk_nope_head_dim, self.v_head_dim], dim=-1)
...
def compress_kv(self, hidden_states_kv: torch.Tensor, kv_position_ids: torch.LongTensor) -> torch.Tensor:
bsz, kv_seq_len, _ = hidden_states_kv.size()
compressed_kv = self.kv_a_proj_with_mqa(hidden_states_kv)
compressed_kv, k_pe = torch.split(
compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1
)
compressed_kv = self.kv_a_layernorm(compressed_kv)
k_pe = k_pe.view(bsz, kv_seq_len, 1, self.qk_rope_head_dim).transpose(1, 2)
cos, sin = self.rotary_emb(k_pe)
k_pe = apply_rotary_pos_emb(k_pe, cos, sin, kv_position_ids).view(bsz, kv_seq_len, self.qk_rope_head_dim)
return torch.cat([compressed_kv, k_pe],dim=-1)
5.2 Weight Absorption
When calculating MLA, it is still necessary to store the decompressed complete KV cache, which can easily cause an OOM crash. The DeepSeek-V2 paper proposes that the decompression matrix of the KV can be absorbed into the Q-projection and Out-projection, so that the final Attention result can be calculated directly without decompressing the KV cache.

5.2.1 absorbed_cache_compressed.py
Unlike the paper, this code uses the weights belonging to K in kv_b_proj, corresponding to W^{UK}, absorbed into q_nope, corresponding to q^C. Moreover, this is done at runtime, not pre-absorbed. The weights belonging to V in kv_b_proj, corresponding to W^{UV}, are absorbed into attn_out.
W^{UK}
Regarding the absorption of K, the non-RoPE part in the formula for calculating the attention score can be expanded as follows:
q_t^{C⊤} k_t^C
= (W^{UQ} c_t^Q)^⊤ W^{UK} c_t^{KV}
= c_t^{Q⊤} W^{UQ⊤} W^{UK} c_t^{KV}
The code is:
# Absorbed_CacheCompressed
def forward(hidden_states_q: torch.Tensor, q_position_ids: torch.LongTensor, compressed_kv: torch.Tensor):
...
kv_b_proj = self.kv_b_proj.weight.view(self.num_heads, -1, self.kv_lora_rank)
q_absorb = kv_b_proj[:, :self.qk_nope_head_dim,:]
out_absorb = kv_b_proj[:, self.qk_nope_head_dim:, :]
cos, sin = self.rotary_emb(q_pe)
q_pe = apply_rotary_pos_emb(q_pe, cos, sin, q_position_ids)
qk_head_dim = self.kv_lora_rank + self.qk_rope_head_dim
query_states = k_pe.new_empty(bsz, self.num_heads, q_len, qk_head_dim)
query_states[:, :, :, : self.kv_lora_rank] = torch.einsum('hdc,bhid->bhic', q_absorb, q_nope)
query_states[:, :, :, self.kv_lora_rank :] = q_pe
...
W^{UV}
The absorption of V is slightly more complex. For clarity, we use Einstein’s summation convention to describe the process.
v_t = einsum('hdc,blc->blhd', W_UV, c_t_KV) # (1)
o = einsum('bqhl,blhd->bqhd', a, v_t) # (2)
u = einsum('hdD,bhqd->bhD', W_o, o) # (3)
u = einsum('hdc,blc,bqhl,hdD->bhD', W_UV, c_t_KV, a, W_o)
o_ = einsum('bhql,blc->bhqc', a, c_t_KV) # (4)
o = einsum('bhqc,hdc->bhqd', o_, W_UV) # (5)
u = einsum('hdD,bhqd->bhD', W_o, o) # (6)
5.2.2 Move Elision
After implementing the above optimizations, the concatenation process here generates a large amount of useless data copying and broadcasting, and also consumes a lot of GPU memory, leading to OutOfMemoryError. Therefore, we adopt the MoveElision optimization strategy, which omits the process of concatenating the RoPE and non-RoPE parts, and instead directly calculates the Attention Score of the quantitative part separately and adds them together.
def forward(...):
...
# 吸收后 attn_weights 直接基于 compressed_kv 计算不用展开
attn_weights = torch.matmul(q_pe, k_pe.transpose(2, 3)) + torch.einsum('bhqc,blc->bhql', q_nope, compressed_kv)
attn_weights *= self.softmax_scale
...
# Absorbed_CacheCompressed_MoveElision
def forward(...):
...
The code comparison is as follows:

5.2.3 Materializing Projection Matrices
The DeepSeek-V2 paper states:

However, it seems unnecessary to change the order and preprocess the model parameters, multiplying W^{UK} and W^{UQ} beforehand, and similarly multiplying W^{UV} and W^O. Therefore, Professor Zhang believes that this optimization step is not very necessary.
The specific code is as follows:
def forward(self, hidden_states_q: torch.Tensor, q_position_ids: torch.LongTensor, compressed_kv: torch.Tensor):
'''
Attention masks and past cache are removed.
Input:
- hidden_states_q: [bsz, q_len, hidden_size]
- compressed_kv: [bsz, kv_len, kv_lora_rank]
- position_ids: [bsz, q_len]
'''
bsz, q_len, _ = hidden_states_q.size()
q_b_proj_rope, q_absorbed, out_absorbed = self.get_absorbed_proj()
q = self.q_a_layernorm(self.q_a_proj(hidden_states_q))
q_nope = torch.einsum('bqc,hdc->bhqd', q, q_absorbed)
q_pe = torch.einsum('bqc,hdc->bhqd', q, q_b_proj_rope)
cos, sin = self.rotary_emb(q_pe)
q_pe = apply_rotary_pos_emb(q_pe, cos, sin, q_position_ids)
kv_seq_len = compressed_kv.size(1)
compressed_kv, k_pe = torch.split(
compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1
)
k_pe = k_pe.view(bsz, 1, kv_seq_len, self.qk_rope_head_dim)
attn_weights = (torch.matmul(q_pe, k_pe.mT) + torch.matmul(q_nope, compressed_kv.unsqueeze(-3).mT)) * self.softmax_scale
attn_weights = nn.functional.softmax(
attn_weights, dim=-1, dtype=torch.float32
).to(q_nope.dtype)
attn_output = torch.einsum('bhql,blc->bhqc', attn_weights, compressed_kv)
attn_output = torch.einsum('bhqc,dhc->bqd', attn_output, out_absorbed)
return attn_output
5.3 Fusion Operator
Furthermore, if different processing is applied to the prefill and decode stages, the logic followed by the prefill and decode stages will be different during inference.
- During inference, the Prefill does not perform matrix absorbing because matrix absorbing in the Prefill increases the computational cost.
- Decoding involves matrix absorbing, and the ops of matrix absorbing are much smaller than those of not absorbing. This is because the length of Q is 1 at this point.
After weight absorption, the formula is as follows:
(p · (c_kv · W^{UV})) · W^O
= (p · c_kv) · (W^{UV} · W^O)
= (softmax(q_nope · c_kv + q_pe · k_pe) · c_kv) · W^{UV} · W^O
This can be described in code as follows, which means that an MQA operator can be designed to implement it.
q_pe = W_QR(c_q)
q_nope = W_UQ_UK(c_q)
output = W_UV_O(MQA(q_pe, q_nope, c_kv, k_pe))
5.4 Reordering of Matrix Multiplication (Supplement @ 2025-04-19)
Content Reference: DeepSeek V3 Inference: MLA and MOE Analysis Arthur.
The specific features are as follows:
- Solution source: SGlang, applied to DeepSeek-V2.
- Scheme Features: By altering the computation order based on the associative law of matrix multiplication, the computational efficiency of the attention mechanism is optimized.
- Solution details:
- Original calculation order:
q_nope k_nope + q_rope k_rope. - The order of improvement uses
(q_nope^T W^{UK}) c.
- Original calculation order:
This change utilizes the associative law of matrix multiplication, allowing computations to be reorganized across different dimensions during the decoding phase. When n_q = 1, the optimized method can significantly reduce the amount of computation.

0x06 Conversion
6.1 GQA
Group Query Attention (GQA) is a variant of MHA designed to reduce the overhead of key-value caching. It divides the query header into multiple groups, each sharing a key-value pair. This approach reduces the size of the key-value cache by decreasing the number of key-value headers, but may sacrifice the model’s expressiveness. GQA can be viewed as a special case of MLA. Because GQA is generated through copying, while MLA is not subject to this constraint, it offers greater expressiveness.
Although MLA has proven its efficiency and effectiveness in Deepseek V2/V3/R1, many major model providers still rely on GQA. To promote wider adoption of MLA, the paper “TransMLA: Multi-Head Latent Attention Is All You Need” proposes TransMLA, a post-training method that converts widely used GQA-based pre-trained models, such as LLaMA, Qwen, and Mixtral, into MLA-based models. After conversion, the model can undergo additional training to enhance its expressive power without increasing the size of the key-value cache.
6.1.1 Approach
The paper first proves that for the same key-value caching overhead, MLA’s expressive power is always greater than GQA’s. Specifically, any GQA configuration can be equivalently converted to an MLA representation, but the reverse is not true. This conclusion provides a theoretical basis for converting GQA-based models to MLA-based models.
During the equivalence transformation, the TransMLA method first copies the key matrix from GQA to match the number of query headers. Then, it decomposes this copied key matrix into the product of two smaller matrices, thus obtaining the low-rank representation in MLA. In this way, TransMLA can convert a GQA-based model to an MLA-based model without increasing the KV cache size.
6.1.2 Scheme
The first step is to duplicate the key matrix to match the number of query heads. In GQA, to ensure that Q and K, and V, have the same number of heads during standard multi-head attention computation, K needs to be expanded from K_{n_k} to K_{n_q}. There are actually two ways to do this:
- Define replication factor
s = n_q / n_k, divide K inton_kblocks, copy each blockstimes, and concatenate them to obtain the extended matrixK'. - Another approach is to move the copying operation to the parameter side, which is essentially using MHA instead of GQA. Copy the projection matrix before calculating K.
See Figure (a) and Figure (b) below for details.

Because W'_K is formed by copying W_K, its maximum degree of freedom is n_k d_h. Therefore, its rank is at most n_k d_h. To understand this more formally, let’s examine it through singular value decomposition.

Furthermore, GQA requires a strict match between the number of groups and the hardware scale, for example 8 cards for g = 8, limiting the flexibility of model deployment. MLA, on the other hand, can dynamically adapt to different hardware configurations through latent spatial projection and decoupled weight merging. To compensate for performance loss, GQA requires increasing the size of the FFN layer, which increases model complexity, while MLA maintains performance without additional compensation through low-rank projection and dynamic routing.
6.2 MHA
Enabling LLMs, such as Llama, originally trained for MHA to quickly adapt for inference in MLA without retraining from scratch is both meaningful and challenging. The paper “Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs” presents the first data-efficient fine-tuning method, MHA2MLA, for converting from MHA to MLA. This method comprises two key components:
- For partial-RoPE, the paper removes RoPE from the dimensions of queries and keys that contribute less to attention scores.
- For the low-rank approximation, the paper introduces a joint SVD approximation based on pre-trained parameters of the keys and values.
These carefully designed strategies enable MHA2MLA to recover performance using only a very small fraction of the data, significantly reducing inference costs, while seamlessly integrating with compression techniques such as KV cache quantization.

6.2.1 Partial-RoPE
To achieve the migration from standard MHA to MLA, the paper proposes a partial-RoPE fine-tuning strategy, which removes RoPE from the dimension of the target proportion and converts it to NoPE.
MHA
MHA’s Full-RoPE encodes position information into the query and key through rotations at a specific frequency, as shown in the figure below.

Disassembly
In MLA, k_i depends on [k_{i,nope}; k_{i,rope}]. Therefore, we first need to make the MHA k_{i,rope} break down into two parts as well: one without RoPE encoding and one with RoPE.
DeepSeek’s MLA actually does not use RoPE encoding in each original head dimension d_h. Instead, it adds another dimension using RoPE encoding, d_h^R. But for now, we can only consider the entire length as d_h, disassemble k_{i,rope}, and take out the d_r part, where d_r ≪ d_h, for partial RoPE encoding.
When converting from Full-RoPE to Partial-RoPE, which subspace do we choose for rotation coding? The paper proposes four strategies:
- High-frequency retention: retain the
rfastest rotating subspaces. - Low-frequency retention: retain the
rslowest rotating subspaces. - Uniform sampling: select
rsubspaces with equal intervals. - Head-wise 2-norm Contribution Selection: sort all subspaces according to their 2-norm scores and keep the top
r.

Selected d_h dimensions perform RoPE location encoding on d_r, and the remaining d_h - d_r should be treated as the position-free encoded part in MLA, namely q_nope.
6.2.2 Low-rank approximation
For MHA, k_i = W_k x_i, v_i = W_v x_i. We have already used one of the four methods above to find the part that needs to be RoPE’d, so we can then take the corresponding part from W_k to get W^{KR}.
We also extract the non-RoPE parameters from the data:
k_{i,nope} = W_{k,nope} x_i
v_{i,nope} = W_{v,nope} x_i
Our goal is to construct MLA’s W^{DKV} from W_{k,nope} and W_{v,nope}.
After converting from Full RoPE to Partial RoPE, the second component for obtaining the KV cache in MLA is obtained: c_{i,kv}. The paper proposes two SVD-based strategies, decoupled SVD and joint SVD, as shown in the figure below.

- Decoupling SVD,
SVD_split: perform truncated SVD decomposition separately forW_{k,nope}andW_{v,nope}. - Joint SVD,
SVD_joint: for reservationK_nope, the interaction between V and the connection matrix[W_{k,nope}, W_v]performs a joint decomposition. This decomposition method is more in line with the standard MLA format.
At this point, we have completed the processing of the key and value parts. Unlike MLA in DeepSeek, the query part does not undergo low-rank decomposition. Instead, it also decomposes the nope and rope parts in the key corresponding to the query into two parts.
0xFF Reference
DP MLA For DeepSeek In Sglang is Xiao Xiao.
DeepSeek V3, R1, Janus-Pro series modeling methods for interpreting durian pastries
[LLM Algorithm] MLA technology shines in DeepSeek-R1; Tsinghua TransMLA converts GQA to MLA with a single click. SmartMindAI
First Efficient Parameter Fine-Tuning Framework: Using DeepSeek’s MLA in Any LLMs AcademicDaily00 [AcademicDaily]
[LLM Algorithm] MLA technology shines in DeepSeek-R1; Tsinghua TransMLA converts GQA to MLA with a single click. SmartMindAI
DeepSeekV2’s MLA (Multi-head Latent Attention) Explains the Mission of a Drop of Water
DeepSeek Model Interpretation: Scaling Law, MLA, MoE JMXGODLZ
Still using MHA? MLA is here! A summary and reflection on MLA for DeepSeek-v2 (rainbow)
A Comprehensive Guide to DeepSeek-V2 (Chinese Model for Transformer Modification): Detailed Explanation of MoE, GRPO, MLA v_JULY_v
DeepSeekV2’s MLA (Multi-head Latent Attention) Explains the Mission of a Drop of Water
MLA Learning Notes (including matrix absorption analysis during inference) - A powerful tool for saving large model key-value caches (BBuf)
A Brief Read of DeepSeek-V2 Technical Report AGI Dream Factory
Developing DeepSeek-V2 from scratch using PyTorch
Illustrated Explanation of Mixtral 8 * 7b Inference Optimization Principles and Source Code Implementation by Mengyuan
From MHA to MLA: Attention Optimization: Discussing DeepSeek’s Pinduoduo-level Inference Prices [zartbot]
Continuing our discussion on MLA, DeepSeek-MoE, and SnowFlake Dense-MoE, Zabot’s Eraser [zartbot]
Some Analysis on MHLA (Multi-Head Latent Attention) (Zhengxiao Du)
[LM Base] About MLA (including code) in DeepSeek-V2 (by Mo Ran)
deepseek-v2 MLA Deep Analysis of Single-Word Zhuo
Deepseek-V2 Technical Explanation (Captain)
What are your thoughts on DeepSeek’s release of the MoE large model, DeepSeek-V2? (Zheng Huabin)
The Extreme Pull-Off Between Caching and Performance: From MHA, MQA, GQA to MLA (Su Jianlin)
DeepSeek-V2 High-Performance Inference (1): Tenfold Speed-Up of MLA Operator via Matrix Absorption ZHANG Mingxing
Speed reading Deepseek v2 (Part 1) – Understanding MLA Bruce’s Wanderings
Still using MHA? MLA is here! A summary and reflection on MLA for DeepSeek-v2 (rainbow)
What are your thoughts on DeepSeek’s release of the MoE large model, DeepSeek-V2? - Zhihu (zhihu.com)
Deepseek-V2 Technical Report Interpretation! The Most Detailed on the Entire Internet! (qq.com) [BaoBao Algorithm Notes] 2
DeepSeek-V2 High-Performance Inference Optimization Notes: MLA Optimization madsys-dev
GQA paper reading and related thoughts clvsit
LLM acceleration techniques: Multi Query Attention deephub
Large Model Fundamentals | Attention Mechanisms | MHA | Sparsity | MQA | GQA Controllers of Wellness
Attention optimization: Flash Attention and Paged Attention, MQA, and GQA miangangzhen
Write LoRA code from scratch
Large Model Lightweight Fine-Tuning (LoRA): Training Speed and Memory Usage Analysis - Top Secret Ambush
MLKV: Cross-layer KV Cache Sharing, Reducing Memory Footprint (AI Chat)
Continuing our discussion on MLA, DeepSeek-MoE, and SnowFlake Dense-MoE, Zabot’s Eraser [zartbot]
[Deep Learning] DeepSeek Core Architecture - MLA: Analyzing the Technical Details of Low-Rank Joint Compression Optimization of KV Caching and Improvement of Inference Efficiency Zhao Nanxia
Deep Seek-R1 Model Architecture In-Depth Analysis (Part 2): MLA
SGLang DP MLA Feature Interpretation BBuf
[LLM Paper Explained] MLA Technology Shows Its Power in DeepSeek-R1, Tsinghua TransMLA Converts GQA to MLA with One Click AI-PaperDaily
TransMLA: Multi-Head Latent Attention Is All You Need
SGLang DP MLA Feature Interpretation BBuf
Learning and thoroughly understanding the DeepSeek MLA algorithm from a code perspective
The most detailed guide online! DeepSeekMLA Multi-Head Latent Variable Attention: From Algorithm Principles to Code Implementation
Deepseek Technology Explained (1) - Thorough Understanding of MLA (Multi-Head Latent Attention) Jiang Fuchun
[Code Learning] Learning Deepseek-v2 Inference Code - MLA - Part 1
[Code Learning] Learning Deepseek-v2 Inference Code - MLA - Part 3
[Code Learning] Learning Deepseek-v2 Inference Code - MLA - Part 4
[Code Learning] Learning Deepseek-v2 Inference Code - MLA - Part 2
The Extreme Pull-Off Between Caching and Performance: From MHA, MQA, GQA to MLA (Su Jianlin)
DeepSeek Open Sources FlashMLA: A Detailed Explanation of MLA from Principles to Code Du Lingxiao
First parameter-efficient fine-tuning framework: Using DeepSeek’s MLA in any LLMs
How to convert MHA in a pre-trained model into MLA? Du Lingxiao
Finally figured out the multi-head latent attention mechanism in DeepSeek!! — Programmer Xiaohan
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 High-Performance Inference (1): Tenfold Speed-Up of MLA Operators through Matrix Absorption
Kernel design of DeepSeek MLA in FlashInfer
A detailed explanation of DeepSeek MLA matrix ablation (formath, 2025-02-24)
sglang mla code analysis hcy
DP MLA For DeepSeek In Sglang is Xiao Xiao.
SGLang DP MLA Feature Interpretation BBuf
MLA and Matrix Absorption in DeepSeek V2/V3 ariesjzj
Kernel design of DeepSeek MLA in FlashInfer yzh119
Finally figured out the multi-head latent attention mechanism in DeepSeek!! Programmer Xiaohan
What are the highlights of FlashMLA, the project open-sourced on the first day of DeepSeek Open Source Week? (SIY.Z)
MLA Learning Notes: A Powerful Tool for Saving Key-Value Cache in Large Models (Including Matrix Absorption Analysis During Inference)
The Qwen architecture was transformed into Deepseek, and the R1 plan was replicated by Meng Fanxu.
DeepSeek V2 “Multi-Head Latent Attention” Paper Interpretation (Part 1): Large Model Coffee Time
Does Deepseek MLA absolutely require code ingestion? (Code mover)
DeepSeek V3 Reasoning: MLA and MOE Analysis Arthur
Some fragmented memories triggered by DeepSeek MLA (YyWangCS)
Discuss the matrix calculation order in deep learning performance optimization (YyWangCS)
[Deepseek v3 Technical Report Study] 1. MLA Duludulu
Can concat in attention be replaced with addition? Zhai Feiyue
sglang mla code analysis hcy
MLA achieves understanding of David
SGLang MLA implementation of BBuf parsing
DeepSeek V3 Reasoning: MLA and MOE Analysis Arthur
Understanding the position and role of FlashMLA in the DeepSeek MLA calculation process
The Mystery of MLA Absorption: Little Zhu Pulling the Aircraft Carrier
MLA Principles Introduction (Simplified Version) opter
DeepSeek-V3/R1 Inference Efficiency Analysis (v0.17) zartbot
DeepSeek V3/R1 Inference Efficiency Analysis (2): DeepSeek Full-Power Reverse Engineering Analysis by Han Shen
DeepSeek V3/R1 Inference Efficiency Analysis (3): Decode Configuration Generalization Discussion (Han Shen)
DeepSeek V3/R1 Inference Efficiency Analysis (1): Some Irresponsible Estimates of the Decoding Throughput Limit of DeepSeek V3/R1 (Han Shen)
MoE Inference On AnyScale MoE-On-AnyScale
Understanding the computational characteristics of prefill and decode based on chunked prefill: Chayenne Zhao
The Architectural Issues Behind the Separation of LLM and PD - Geek Boge
DeepSeek MLA Inference Optimization Watson
DeepSeek-V3 MTP Engineering Implementation Thoughts by Geek Boge
A few thoughts: Why is DeepPhone so fast? (Clouds open)
Should prefill and decode be separated onto different cards? (Chayenne Zhao)
1. DeepSeek Model Learning Notes Li Weihua
DeepSeek-V3 (671B) Model Parameter Decomposition Calculation ZihaoZhao
In-depth analysis of vLLM: Deekseek and vLLM-1 stephenxi
DeepSeek MLA Inference Process and Code Implementation in SGLang (Durian Pastry)
The evolution path from MHA to MQA to GQA to MLA : If I were given an AI
The Annotated Transformer https://nlp.seas.harvard.edu/2018/04/03/attention.html
Attention Is All You Need https://arxiv.org/pdf/1706.03762.pdf
Fast Transformer Decoding: One Write-Head is All You Need https://arxiv.org/pdf/1911.02150.pdf
https://www.researchgate.net/figure/led-dot-product-self-attention-mechanism_fig1_363923096
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints https://arxiv.org/pdf/2305.13245.pdf
How Attention works in Deep Learning: understanding the attention mechanism in sequence models https://theaisummer.com/attention/
A simple overview of RNN, LSTM and Attention Mechanism https://medium.com/swlh/simple-overview-of-rnn-lstm-and-attention-mechanism-9e844763d07b
A Brief Discussion on Transformer Initialization, Parameterization, and Standardization https://spaces.ac.cn/archives/0
https://theaisummer.com/self-attention/
https://zhuanlan.zhihu.com/p/626820422
Are Sixteen Heads Really Better than One? https://arxiv.org/pdf/1905.10650.pdf
This post is all you need (Volume 1) – Unveiling the Transformer Layer by Layer https://zhuanlan.zhihu.com/p/420820453
The Illustrated Transformer https://jalammar.github.io/illustrated-transformer/
Multi-Query Attention is All You Need https://blog.fireworks.ai/multi-query-attention-is-all-you-need
Sequence Parallelism and Tensor Parallelism in DeepSeek MLA (YyWangCS)