Exploring the Transformer Series (28) --- DeepSeek MLA

0x00 Overview

The basic idea of MLA (Multi-head Latent Attention) is to compress the attention input h_t into a low-dimensional latent vector c_t^{KV} whose dimension is d_c, and d_c is much smaller than the original dimension. When attention needs to be calculated, this latent vector can be mapped back to a higher-dimensional space. Therefore, only the latent vector c_t^{KV} needs to be stored. This can significantly reduce memory usage.

This process can be described more formally using the following formula. c_t^{KV} represents the latent vector. W^{DKV} is a compression matrix, where the superscript D represents “down projection”, that is, a dimensionality reduction operation. It is responsible for compressing h_t from the original multi-head dimension to d_c. W^{UK} and W^{UV} are projection matrices responsible for mapping the shared latent vector c_t^{KV} back to a higher-dimensional space. Only this latent vector c_t^{KV} needs to be stored. This allows us to obtain the key and value corresponding to different text features without having to store the corresponding key and value for each text feature.

Similarly, we can map the query vector to a potential low-dimensional vector and then map it back to the original high-dimensional space. Moreover, MLA incorporates weight absorption techniques, reducing computational overhead.

2801

Note:

The complete list of articles is here. It’s estimated to eventually have around 35 articles. This list will be updated after each subsequent article is published. (Cnblogs Exploring Transformer Series: Article List)
This series is a study and interpretation of papers, blogs, and code, drawing on many articles from online friends. I would like to express my gratitude to them, and their names will be listed in the references. Because there are so many references in this series, there may be omissions in the citations. If the original authors or other friends find any omissions, please point them out, and I will add them to the references.

0x01 Principle

1.1 Problem

A major obstacle for standard Transformers is the space consumption of the key-value (KV) cache: multi-head attention mechanisms require each attention head to store the historically generated key and value vectors separately, that is, the KV cache. As sequence length increases, the storage requirements of the KV cache grow exponentially, leading to a sharp increase in memory consumption. However, GPU memory is often very limited, and a large KV cache results in fewer requests being processed simultaneously, meaning a smaller batch size. To reduce KV cache requirements, researchers have proposed methods like Multi-Query Attention (MQA) and Group-Query Attention (GQA). While these methods reduce cache requirements, they also impact model performance. During attention computation, data from all KV caches is read and only used once or a few times, resulting in extremely low MFU (Mean Functionality Fusion). Furthermore, since each request has its own KV cache, this problem cannot be solved by increasing the batch size.

Therefore, a key issue is how to reduce the key-value cache during inference, thereby enabling inference of longer contexts on fewer devices, or increasing the batch size with the same context length to achieve faster inference speed or greater throughput, ultimately reducing inference costs.

1.2 Current Situation

Let’s first summarize the current status of various solutions to see if there’s room for improvement. The following diagram illustrates the approaches for MHA, GQA, MQA, and MLA.

2802

The diagram shows MHA, GQA, MQA, and MLA from left to right. The shaded bars represent results cached in video memory. MHA, GQA, and MQA all require caching the KVCache in video memory. The characteristics of these schemes are as follows.

MHA: Like the Q matrix, the MHA KVCache operates on a “one-to-one” model in terms of attention heads. MHA breaks down an attention calculation into multiple attention heads, each using independent Q, K, and V matrices. Both K and V need to be stored. The number of parameters cached for each token in the KV Cache is 2 n_h d_h l. GQA and MQA have smaller attention head dimensions than the Q matrix.
MQA: All query headers share the same single key and value headers, therefore only the shared K and V need to be stored. The number of parameters that need to be cached for each token in the KV Cache is 2 d_h l. When calculating attention, the shared single K-head and V-head are broadcast to each query head, and then each head is calculated individually.
GQA: Divide all Q headers into g groups. Q headers in the same group share a K header and a V header. Therefore, the number of parameters that each token needs to cache in the KV Cache is 2 n_g d_h l. When calculating attention, the key-value head is copied to all Q heads in the same group for calculation.

n_h is the number of heads. n_g is the number of GQA groups. d_h is the hidden layer dimension. l is the number of model layers. h_t ∈ R^d represents the input of the nth token in an attention layer.

1.3 Improvement Ideas

MLA is an improvement on the MHA, GQA, and MQA schemes. Its idea is to enhance information compression capabilities, corresponding to number 1 in the diagram below, and enrich information expression capabilities, corresponding to number 2 in the diagram below. In fact, the two numbers also correspond to two key points in the data flow from input to Q, K, and V, which are two sides of the same coin: while enhancing the matrix’s expressive capabilities, it also makes the compression capabilities greater.

2803

This leads to a common dilemma for researchers: they need to achieve both lower compression, reducing the KV cache resource overhead during inference, and stronger expressiveness, alleviating the performance loss caused by MQA and MGA. In other words, the new solution should have expressiveness as close as possible to MHA.

1.3.1 Enhance information compression capabilities

Ideas

From a certain perspective, MQA and GQA also belong to the concept of low-rank compression. MQA compresses 2 n_h to 2, and GQA compresses to 2 n_h / g. However, it’s difficult to balance compression capability and performance, so GQA performs better than MQA.

Therefore, we need to consider whether we can go a step further beyond “enhancing information compression capabilities while maintaining effectiveness.” Since MQA has already achieved near-perfect performance in its key-value headers, we can’t reduce the number of key-value headers. This necessitates considering the key-value pairs themselves. Currently, both GQA and MQA require caching both key and value, K and V, and they are different. So, is it possible to merge these two values into one? Is it possible to make each cached key-value pair smaller than before? Inspired by LoRA, an M × N matrix can be approximated by the multiplication of an M × k matrix and a k × N matrix. If I split a K or V matrix into the product of two smaller matrices, I can reduce the memory usage of the KV cache.

plan

The core of MLA is to perform low-rank joint compression of attention keys and values to reduce the size of the key-value (KV) cache during inference, thereby improving inference efficiency. Unlike GQA and MQA, which directly compress the KVCache head dimension, MLA uses a down-projection matrix W^{DKV}. The key and value of multiple attention heads are projected into a low-dimensional shared latent vector space, replacing the traditional head-by-head storage method.

Specifically, MLA transforms the KV matrix into a low-rank form: representing the original matrix as the product of two smaller matrices, equivalent to latent vectors.

The HiddenState of the input matrix is first subjected to a low-rank transformation, compressing a HiddenState with shape [S,H] into a latent vector with shape [S,CH], c_t^{KV}, where CH ≪ H. H is the token dimension.
The compressed KV vector c_t^{KV} is stored in video memory as a KV cache, which achieves the goal of reducing the KV size. In the V2 paper, K_t changes from W^K h_t to W^{UK} W^{DKV} h_t. What was originally cached was W^K h_t. What is being cached now is the W^{DKV} h_t part of K_t.

question

However, there is a problem: if a large K/V matrix is simply split into two smaller matrices for caching, the complete K matrix still needs to be calculated during inference, which defeats the purpose of caching, since the purpose of caching is to reduce computation.

The question then becomes: Is there a way to reduce cache size without increasing computation during inference?

1.3.2 Enriching Information Expression

Ideas

We can observe that in MQA and GQA attention calculations, only a simple broadcast or copy mechanism is used to copy the key-value header to the corresponding query header for computation. Taking GQA as an example, GQA aims to reduce KV cache usage; it stores key-value pairs, C^{KV}. The following formula shows how to obtain k, and the operation of v is omitted here.

First, it divides the vector in half into two parts, which are designated as K and V respectively.
Then each portion is further divided into g portions.
Each copy is replicated h/g times to aggregate enough K and V values for h Attention Heads.

Here, W^{UK} is a combination of simple linear changes such as simple replication, and its expressive power is limited, so its compression dimension is not large.

k = W^{UK} C^{KV} = W^{UK} [k^1, ..., k^g, v^1, ..., v^g]
  = [k^1, ..., k^1, k^2, ..., k^2, ..., k^g, ..., k^g]

Since MQA and GQA have limited information representation capabilities, could we introduce a matrix transformation to replace these simple linear transformation operations, splitting and copying? For example, by adaptively learning for each q, we could enrich the information representation of this layer.

plan

We have obtained the latent vector c_t^{KV}. Then the upprojection matrix of each head can be used during inference: W^{UK} for “key” and W^{UV} for “value”, from this latent vector c_t^{KV} to reconstruct K and V.

Specifically, MLA will:

Load compressed KVCache latent vectors c_t^{KV}.
Then through the upper projection matrix W^{UK} and W^{UV}, perform two ascending-rank transformations to convert them into K and V matrices with shapes [S,H], respectively. This recovers the Key and Value of each head from the latent vector by mapping this latent vector back to the high-dimensional space. The up projection matrices W^{UK} and W^{UV} perform two rank-up transformations with a much greater impact than a combination of simple linear changes in GQA.
Perform MHA calculation. In this way, MLA caches only the latent vectors during inference, instead of the complete key-value pairs.

MLA is essentially a lossy compression of key-value information, but it can be trained to learn how to improve the density of stored information while preserving key details as much as possible. This avoids the information loss associated with grouped query attention and multi-query attention, thus achieving better performance while reducing key-value caching. From the computational characteristics of the MLA operator, both of these issues are addressed simultaneously.

On the one hand, low-rank compression significantly reduces the KV cache resource overhead during inference. Reducing the KV cache during inference enables inference of longer contexts on fewer devices, or increases the batch size with the same context length, achieving faster inference speed or greater throughput, ultimately reducing inference costs.
On the other hand, the multi-head attention mechanism after MLA decompression can provide high computational intensity, proportional to the number of heads, which helps to fully utilize the computing resources of the GPU and alleviate the performance loss caused by MQA and MGA. MLA compresses KVCache through low-rank transformation, which, according to the formula, introduces additional rank-raising transformation calculations and requires storing the activation values of the rank-raising transformation calculations. However, based on the commutative property of matrix multiplication, the matrix multiplication weights of the rank-raising transformation can be fused with other weights, and then the attention calculation can be directly completed in the attention kernel without introducing additional computational and storage overhead.

1.3.2 Resolving Location Coding Conflicts

However, compression and RoPE positional encoding are conflicting, that is, after matrix absorption, c_t^{KV} has no location-related information. The reason is that RoPE is location-sensitive for both keys and queries. In this case, relying solely on c_t^{KV} to compress the KV-Cache is not a viable approach, so additional information is needed to represent the positional relationship between q and k. To overcome this challenge, DeepSeek proposes a compromise: using W^{QR} and W^{KR}. Two matrices are used to represent the features extracted related to ROPE, adding an extra dimension d_h^R to both q and k to add ROPE encoding. The previous d_h dimension does not use ROPE encoding, and the total length becomes d_h + d_r. In other words, MLA adopts the concept of MQA, constructing a cache variable shared by all heads, c_t^{KV} and k_i^R. This significantly reduces the KV cache size. c_t^{KV} is the low-dimensional vector in the low-rank parametric decomposition, after the Down process and before the Up process, and k_i^R can be viewed as an MQA version of RoPE.

See the image below for details.

2804

1.4 Architecture Diagram & Process

In contrast, the following diagram shows the mathematical formula for MHA, which requires caching 2 n_h d_h l elements for each token. If it’s 1000 Questions 72B, then it needs 2 × 80 × 64. Here q_{t,i}, k_{t,i}, and v_{t,i} are all represented using column vectors. t is the t-th token, j is the index of the 1st to tth tokens, and i is the index of the head being iterated.

2805

The following diagram shows the architecture of MLA and its formulas.

2806

In the diagram, the yellow area is primarily used to calculate Q, the Q matrix in Attention. The green area is mainly used to calculate the position-insensitive part of K. The purple area calculates the position-sensitive part of K. The gray area aggregates K, and the red area calculates V. The specific process is as follows:

Dimensionality reduction compression of query, Q: t tokens in the input sequence, h_t, through a downward projection matrix W^{DQ}, are compressed into compressed latent vectors c_t^{Q}. Its dimension is much smaller than the dimension of the input token. This corresponds to number 37 on the diagram.
Joint compression of key, K, and value, V: the t-th token in the input sequence, h_t, through a downward projection matrix W^{DKV}, is compressed into compressed latent vectors c_t^{KV}. Its dimension d_c is much smaller than the dimension d of the input token. During the inference phase, MLA only needs to cache c_t^{KV}. That is, KV cache only has d_c × l elements, where l is the model layer number. This corresponds to number 41 in the diagram.
Decoupled RoPE Strategy: To improve the model’s sensitivity to contextual information in the sequence, MLA applies the Decoupled Rotated Position Encoding technique. Because RoPE is incompatible with low-rank KV compression matrices, MLA introduces an additional query vector q_t^R and shared key vector k_t^R. This method carries RoPE information, avoiding the coupling problem between RoPE and the low-rank compression matrix, and resolving the contradiction between positional information and inference efficiency. This roughly corresponds to labels 39 and 43 in the diagram.
Recovery information: When performing attention calculations, c_t^{KV} respectively passes through the upper projection matrix W^{UK} and W^{UV} to reconstruct the keys and values, corresponding to numbers 42 and 45 in the diagram. The keys at each attention head are then compared with the shared key vector carrying RoPE information, k_t^R. The key-value inputs for MHA are concatenated, which corresponds to number 44 in the figure. c_t^Q passes through the upper projection matrix W^{UQ} and W^{QR} to reduce dimensions and generate query vector q_t^C, corresponding to number 38 in the diagram, and query vector carrying RoPE information q_t^R, corresponding to number 39 in the figure. The two are concatenated to form the query vector input of MHA, which corresponds to number 40 in the figure.
Attention calculation. This corresponds to number 46 on the diagram.
Finally, the inputs from multiple heads are concatenated and then subjected to a linear mapping W^O. The final output is obtained. This corresponds to number 47 on the diagram.

The characteristics of MLA can be seen from the image:

From a qualitative perspective, it can save memory because:

Before entering the standard MHA algorithm, the longer KV vector is replaced with a compressed vector. Previously, both K and V vectors were cached. Now, only the compressed vector is stored.
It not only compresses K and V, but can also reconstruct them into K and V, not the K and V under standard MHA.

Quantitatively speaking, each Transformer layer only caches the vector shown in the blue box in the above formula: c_t^{KV} and k_t^R. The rest can be recovered using “matrix absorption”. The sizes of these two vectors are:

c_t^{KV} dimension is d_c = 4 × d_h. d_h is the vector dimension of a single head. c_t^{KV} is shared by multiple heads.
k_t^R dimension is d_h^R = d_h / 2. k_t^R is shared by multiple heads.

Compared to MQA, each layer has one d_h dimension of k and one d_h dimension of v, 2 d_h in total. MLA is equivalent to a 2.25 times increase in storage. Compared to MHA, 2 n_h d_h, but n_h will be greater than 2.25, so the cache will definitely be reduced.

1.5 Code

The image below shows the definition of MLA in the DeepSeek V3 source code.

class MLA(nn.Module):
    def __init__(self, args: ModelArgs):
        super().__init__()
        #隐藏层维度
        self.dim = args.dim
        # 注意力头的总数量
        self.n_heads = args.n_heads
        # 计算每个并行进程的本地注意力头数量
        self.n_local_heads = args.n_heads // world_size
        # 对应 query 压缩后的隐向量的维度 d'_c
        self.q_lora_rank = args.q_lora_rank # q的低秩压缩的维度
        # 对应 key-value 压缩后的隐向量维度 d_c
        self.kv_lora_rank = args.kv_lora_rank # kv的低秩压缩的维度
        # 表示query和key的向量中，不应用旋转位置编码部分的头维度, $d_h$
        self.qk_nope_head_dim = args.qk_nope_head_dim
        # 对应$d_h^R$，表示应用了旋转位置编码的query和key的一个头的维度。
        self.qk_rope_head_dim = args.qk_rope_head_dim
        # $d_h + d_h^R$, 注意力头大小为非rope部分大小加上rope部分大小
        self.qk_head_dim = args.qk_nope_head_dim + args.qk_rope_head_dim
        # value 的一个注意力头的隐藏层维度
        self.v_head_dim = args.v_head_dim

        if self.q_lora_rank == 0:
            # 不适用低秩分解，回归到传统MHA
            self.wq = ColumnParallelLinear(self.dim, self.n_heads * self.qk_head_dim)
        else:
            # 其实就是$W^{DQ}$，用来生成$c_t^Q$
            # 下采样矩阵，得到压缩后的q向量
            self.wq_a = Linear(self.dim, self.q_lora_rank)
            self.q_norm = RMSNorm(self.q_lora_rank)
            # $W^{UQ}$
            # 上采样矩阵，用来恢复q向量
            self.wq_b = ColumnParallelLinear(self.q_lora_rank, self.n_heads * self.qk_head_dim)
        # $ [W^{DKV}; W^{KR}] $    
        # 下采样矩阵，得到压缩后的kv向量    
        self.wkv_a = Linear(self.dim, self.kv_lora_rank + self.qk_rope_head_dim)
        self.kv_norm = RMSNorm(self.kv_lora_rank)
        # 上采样矩阵，用来恢复kv向量
        # $ [W^{UK}; W^{UV}] $
        self.wkv_b = ColumnParallelLinear(self.kv_lora_rank, self.n_heads * (self.qk_nope_head_dim + self.v_head_dim))
        # output权重矩阵
        self.wo = RowParallelLinear(self.n_heads * self.v_head_dim, self.dim)
        # 计算1/sqrt(d_k)
        self.softmax_scale = self.qk_head_dim ** -0.5
        if args.max_seq_len > args.original_seq_len:
            mscale = 0.1 * args.mscale * math.log(args.rope_factor) + 1.0
            self.softmax_scale = self.softmax_scale * mscale * mscale
         
        if attn_impl == "naive": # native模式下，kvcache存储的是没有压缩的数据，大小为d_h + d_h^R, 不但没有节省，反而增加了显存消耗   
            self.register_buffer("k_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.n_local_heads, self.qk_head_dim), persistent=False)
            self.register_buffer("v_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.n_local_heads, self.v_head_dim), persistent=False)
        else:
            # 在非native模式下，存储的是压缩的c，大小为d_c
            self.register_buffer("kv_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.kv_lora_rank), persistent=False)
            self.register_buffer("pe_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.qk_rope_head_dim), persistent=False)

Clearly, the MLA operator is an attention mechanism tailored to the characteristics of modern GPU hardware. By rebalancing storage and computation, it can fully leverage the advantages of modern GPUs. We will now analyze several core implementation points of MLA in detail.

0x02 Key Points

The core elements of MLA are as follows:

Low-rank key-value joint compression reduces the resource consumption of the key-value cache. During attention computation, the compressed vectors undergo an up-dimensional transformation, thereby enhancing the model’s expressive power.
Weight absorbing reduces the computational cost of upward projection.
The conflict between RoPE and weight absorption is resolved by using the decoupled RoPE strategy.

2.1 Low-rank KV joint compression

2.1.1 Low-rank decomposition

Low-rank matrix factorization (LVGF) is a particularly effective matrix factorization method for discovering low-dimensional structures in data. The core idea of LVGF is to decompose a large matrix into the product of two or more smaller, simpler matrices, which typically have lower ranks.

Using low-rank decomposition in neural network layers generally trades memory cost for computational cost. Variations of this approach are popular in scenarios like LoRA fine-tuning, where the constraint is total memory cost rather than computational overhead or inference speed. The advantage is that the compressed matrices use fewer parameters and are, to some extent, more expressive, with an increased number of layers. They can ultimately be roughly approximated or equivalent to a larger matrix, so theoretically, we can multiply the weights of these matrices to recover an approximation of the original matrix.

The downside is that we now have to perform the operation twice every time we use these matrices, that is, for each compression and decompression layer, we double the total number of matrix multiplications in exchange for making them smaller. And because we restrict them to matrices of rank r or lower, we obviously lose some of the representational power of the original matrices.

2.1.2 Approach

Traditional attention mechanisms directly map the input X to the attention head dimension of a QKV. MQA and GQA indirectly compress the head dimension of the KV cache through a sharing mechanism. The core idea of MLA is to represent KV in a LoRA-like manner. Specifically, a compressed space is built during prefilling, compressing the HiddenSize dimension of the input matrix. That is, the input X is first mapped to a latent vector c and stored. Simply put, if there is a matrix with dimension n × n, it can be decomposed into the multiplication of two n × d matrices, where d ≪ n. This reduces storage requirements. Before calculating attention during the decoding stage, c is restored to the original QKV dimension through an upprojection matrix. This reduces the caching of attention keys and values during inference, thereby improving inference efficiency.

There’s actually another issue here: according to this low-rank scheme, W^Q, W^K, and W^V all matrices were converted to low-rank matrices. Since there’s a substitution of full-rank matrices with low-rank matrices, performance issues might arise. Since DeepSeek performed the substitution effectively, it indicates W^Q, W^K, and W^V may be redundant and possess significant low-rank characteristics.

2807

In implementation, the weight matrices of Q, K, and V are typically fused to improve the computational and memory efficiency of the GPU. Unlike performing independent projections, using a combined weight matrix optimizes the computation process.

2.1.3 Projecting downwards

The diagram above illustrates the specific process of downward projection, in which h_t is the input vector. W^{DKV} and W^{DQ} are compression matrices used for dimensionality reduction. c_t^{KV} and c_t^Q are the compressed KV latent vectors and Q latent vectors, respectively. The dimension of the latent vectors is much smaller than the product of the dimension of the input vector and the number of self-attention heads. c_t^{KV} is independent of which specific head, index i, needs to be cached. Essentially, we no longer directly cache the key/value pairs. We no longer directly cache the same vector as h_t, but cache c_t^{KV} and dynamically recover k_t and v_t through calculation.

2808

For KV, construct a shared dimensionality reduction mapping matrix W^{DKV}. It is used to reduce the dimensionality of the model input.
W^{DKV} projects the input h_t, the hidden state, onto latent vector c_t^{KV}. This is the joint latent vector of the key and value. It compresses a HiddenState with shape [S,H] to a shape of [S,d_c]. d_c is much smaller than the original dimensions of multi-head key and value. MLA does not retain the full hidden dimensions, but instead reduces their size.
The compressed key-value vectors are stored in the video memory as a key-value cache. During inference, only the latent vectors of each layer need to be cached, c_t^{KV}, because the attention head of each layer shares this parameter. Due to c_t^{KV}, the dimension of K is much smaller than that of K and V. Therefore, in MLA, the KV Cache parameters generated at each step of token inference change from 2 n_h d_h l to d_c l. This greatly reduces the memory footprint of the KV cache.
For Q, use a dimension reduction mapping matrix W^{DQ}. This is used to reduce the dimensionality of the model input. This is unrelated to reducing the KV cache. It primarily aims to reduce the number of parameters and the GPU memory occupied by corresponding activations during training.

2.1.4 Upward projection

When MHA is required during the Decode phase, the KVCache will be loaded and then W^{UK} and W^{UV} project c_t^{KV} upwards to restore a larger size. This larger size can be matched with the original input h_t, and the dimensionality can also be adjusted based on the attention head configuration. DeepSeek expands the dimensions of the key-value pair back to d = d_h n_h. As can be seen from the diagram, the new k_t^C and v_t^C are divided into equal parts n_h, and each attention head has a separate KV vector, consistent with the number of KV vectors in MHA.

See the image below for details. W^{UK}, W^{UV}, and W^{UQ} are all projection matrices used for dimensionality increase. Note that RoPE is omitted here. It will be expanded and updated later in conjunction with RoPE.

2809

Combining downward and upward projections, we can see that W^Q, W^K, and W^V are actually split into two parts and compressed into LoRA form. In this form, MLA is MQA with an extension to LoRA, and the computational complexity is reduced from d × d to 2 × d × d_c. This method of compressing information and then restoring the original dimensions, compared to the previous single-matrix form, greatly helps the network learn more effective information. It achieves better results with the same low-rank decomposition, which is the fundamental reason why MLA further compresses the KV cache than GQA.

The diagram below shows how to split the data: the top part is MLA, and the bottom part is MQA for comparison.

2810

In fact, the paper “TransMLA: Multi-Head Latent Attention Is All You Need” analyzes the expressive power of MLA. The paper points out that traditional GQA models share the same key-value pairs within the same group when calculating attention, which limits their expressive power. MLA, however, overcomes this limitation through low-rank decomposition and a unique projection matrix design.

See the image below for details. In MLA, taking W_K^b as an example, if the vectors here are orthogonal, then each channel multiplied by X W_k^a will later produce different outputs across channels. In contrast, GQA ensures that all heads within the same group produce the same output. This structural difference leads to MLA’s stronger expressive power when the key-value buffer size is the same.

2811

2.1.5 Complete Process

The complete comparison process is shown in the diagram below. The top of the diagram shows the overall approach. The bottom shows a comparison between MLA and GQA, which is divided into two parts: the upper part uses formulas to see how MLA enhances expressiveness, and the lower part shows the complete process.

2812

2.2 Weight Absorption

2.2.1 Current Status

We have already saved the compressed latent vectors by downprojecting, which reduces the memory footprint of the KV cache. We’ve also enhanced expressiveness by using the upprojection matrix. However, MLA emphasizes that the number of activations should also be reduced. Currently, we don’t see how to reduce the number of activations. This is because although the compressed KV cache occupies less memory, it still requires upprojection during each inference step. W^{UK} and W^{UV} must recalculate k_{t,i} and v_{t,i} from the cached c_t^{KV}. In terms of the number and dimensions of key-value pairs, it’s on par with MHA, and even more so than GQA and MQA. The upsampled key-value cache is enormous, potentially leading to OutOfMemoryError. It not only requires a significant amount of memory, but k_{t,i} and v_{t,i} still exist, and it introduces new computational overhead, leading to a computational bottleneck.

2.2.2 Weight Absorption

Since the computational cost is too high each time, DeepSeek wondered if it could reduce this computational cost, which also reduces the memory usage of new key-value pairs, while preserving compressed latent vectors. They then introduced the weight absorption technique. That is, the authors optimized these formulas using the properties of matrix associativity, avoiding recalculating the key and value for each query. Below is the original text from their paper:

2813

Note: Matrix absorption computation refers to the process of using linear algebra techniques such as the associative law of matrix multiplication or low-rank decomposition to change the order of matrix multiplication and recombine certain matrix factors, so that matrix products that originally needed to be calculated independently are merged together, avoiding the generation of large matrices, thereby reducing computational complexity and memory overhead.

For example, given three matrices A ∈ R^{m,k}, B ∈ R^{k,p}, and C ∈ R^{p,n}, as can be seen from matrix multiplication, (A × B) × C = A × (B × C). However, their computational complexity is different. (A × B) × C has computational complexity 2 × m × k × p + 2 × m × p × n = 2 × m × p × (k + n), while A × (B × C) has computational complexity 2 × m × k × n + 2 × k × p × n = 2 × n × k × (m + p). When n is significantly smaller than both m and p, the second computation order performs much better than the first.

2.2.3 Derivation

KQ merger

Let’s examine the specific form of Dot-Attention to see how we can circumvent this problem through a simple yet ingenious identity transformation. First, the training phase proceeds as usual, with limited optimization space. Then, during the inference phase, we use the following formula, without positional encoding, to see that during inference, if we combine {W^{UQ}}^⊤ W^{UK} into a single position-independent matrix W as the projection matrix of Q, then we can use c_t to replace the original k_t. This avoids repeatedly calculating the intermediate results q and k.

2814

Here, transpose ⊤ represents swapping the last two dimensions of the tensor shape. The shapes of each tensor are shown below. Note that num_heads is extracted as a single dimension because the final attention weights are independent between heads.

C^Q: [batch_size, 1, q_len, q_lora_rank]
W^{UQ}: [num_heads, q_lora_rank, qk_nope_head_num]
W^{UK}: [num_heads, kv_lora_rank, qk_nope_head_num]
C^K: [batch_size, 1, kv_len, kv_lora_rank]

Each time we cache c_t^{KV}, all of them can directly participate in the calculation without explicitly calculating K. Furthermore, the W matrix can be determined beforehand using {W^{UQ}}^⊤ W^{UK}. It is calculated automatically by the neural network.

The code is described as follows:

"""来源：https://mathmach.com/8b428574/"""
# 消融W_UK
W_UQ = tf.reshape(W_UQ, [q_lora_dim, num_head, head_dim])
W_UQ = tf.transpose(W_UQ, perm=[1, 0, 2]) # [num_head, q_lora_dim, head_dim]
W_UK = tf.reshape(W_UK, [kv_lora_dim, num_head, head_dim])
W_UK = tf.transpose(W_UK, perm=[1, 2, 0]) # [num_head, head_dim, kv_lora_dim]
W_UQUK = W_UQ * W_UK # [num_head, q_lora_dim, kv_lora_dim]

# 计算qk内积
c_Q = tf.reshape(c_Q, [batch_size, q_seq_len, q_lora_dim])
c_KV = tf.reshape(c_KV, [batch_size, kv_seq_len, kv_lora_dim])
c_KV = tf.transpose(c_KV, perm=[0, 2, 1]) # [batch_size, kv_lora_dim, kv_seq_len]
c_Q_product_W_UQUK = tf.einsum('bij,hjk->bhik', c_Q, W_UQUK) # [batch_size, num_head, q_seq_len, kv_lora_dim]
q_product_k = tf.einsum('bhik,bkj->bhij', c_Q_product_W_UQUK, c_KV) # [batch_size, num_head, q_seq_len, kv_seq_len]

VO merge

In addition, traditional methods require first calculating the Value vector v_t^C. Then, attention is calculated and projected onto the final output layer. We can directly absorb W^{UV} into W^O. Here, the final output calculation is simplified. The absorption formula is as follows.

(p · (c_kv · W^{UV})) · W^O
= (p · c_kv) · (W^{UV} · W^O)
= (softmax(q_nope · c_kv + q_pe · k_pe) · c_kv) · W^{UV} · W^O

This can be described in code as follows:

q_pe = W_QR(c_q)
q_nope = W_UQ_UK(c_q)
output = W_UV_O(MQA(q_pe, q_nope, c_kv, k_pe))

Note that we need to carefully ensure mathematical identity through transposition and other means. See the diagram below. Each attention head can be ablated into a matrix. Therefore, in actual code, a high-dimensional matrix can be used to ablate all heads into a single matrix, as illustrated in the code below.

2815

Code description:

"""来源：https://mathmach.com/8b428574/"""
# 消融W_UV
W_O = tf.reshape(W_O, [num_head, head_dim, hidden_dim])
W_UV = tf.reshape(W_UV, [kv_lora_dim, num_head, head_dim])
W_UV = tf.transpose(W_UV, perm=[1, 0, 2]) # [num_head, kv_lora_dim, head_dim]
W_OUV = W_UV * W_O # [num_head, kv_lora_dim, hidden_dim]

# 计算u
q_R = RoPE(c_Q * W_QR) # [batch_size, q_seq_len, num_head, rope_dim]
k_R = RoPE(h * W_KR) # [batch_size, kv_seq_len, rope_dim]
q_product_k_rope = tf.einsum('bijk,bhk->bijh', q_R, k_R) # [batch_size, q_seq_len, num_head, kv_seq_len]
q_product_k_rope = tf.transpose(q_product_k_rope, perm=[0, 2, 1, 3]) # [batch_size, num_head, q_seq_len, kv_seq_len]
attention_weight = tf.softmax((q_product_k + rope_score) / tf.sqrt(head_dim + rope_dim)) # [batch_size, num_head, q_seq_len, kv_seq_len]
c_KV = tf.transpose(c_KV, perm=[0, 2, 1]) # [batch_size, kv_lora_dim, kv_seq_len]
attention_weight_product_c_KV = tf.einsum('bijk,bhk->bijh', attention_weight, c_KV) # [batch_size, num_head, q_seq_len, kv_lora_dim]
u = tf.einsum('bijh,ihd->bjd', attention_weight_product_c_KV, W_OUV) # [batch_size, q_seq_len, hidden_dim]

Combination

Combining the current mergers, we get the following:

O = A W^O
  = ϕ(QK^T) V W^O
  = ϕ[H W^Q (C^{KV} W^{UK})^T] C^{KV} W^{UV} W^O
  = ϕ[H (W^Q {W^{UK}}^T C^{{KV}^T}] C^{KV} (W^{UV} W^O)

Thus, during reasoning, W^{UK} can be combined with W^{UQ} and W^{DQ}, while W^{UV} and W^O can also be combined. After matrix merging, the entire calculation process for KV is performed in a low-dimensional space, avoiding the need to re-calculate and decompress C^{KV} back into a higher-dimensional space. Furthermore, all the matrices mentioned above are model weights, which remain unchanged during the inference process and can be considered constants. If deploying an inference service, these two matrices can be multiplied when loading the model, saving two matrix multiplications for each subsequent inference. In reality, there is no additional computational overhead.

2.2.4 Discussion

train

The paper repeatedly mentions the use of weight absorption during the inference phase, which is easy to understand because the weight matrix is fixed at this point.

So why not combine W^{UK} and W^{UV} directly during the training phase? The reasons are roughly as follows:

From the perspective of gradient update, omitting weight scavenging simplifies optimization.
From a projection perspective, KV sharing W^{DKV} imposes a constraint on the space. Weight tying enables the model to converge better, improves its generalization ability, and also enhances the model’s stability.

Therefore, MLA is similar to MHA during the training phase. Except for an additional low-rank projection step and adding RoPE only in some dimensions, the head dimensions of MLA, Q, and K change from d_k to d_k + d_R, which is the same as MHA.

MHA

Secondly, since weight absorption is so good, why didn’t MHA perform weight absorption?

Let’s first look at the characteristics of the reasoning stage.

First, the calculation formula in MHA is as follows. In the standard MHA implementation, the query, key, and value embeddings are calculated separately. Then, the weight matrix of self-attention is calculated using the query embedding and key embedding. Finally, this weight matrix is multiplied by the value embedding to obtain the final result.

Z = softmax( q_t^T k_i / √d_k ) v_t W^O
  = softmax( h_t^T (W^Q)^T W^K h_i / √d_k ) h_i W^V W^O

It looks like (W^Q)^T W^K and W^V W^O both have the potential to be absorbed. Secondly, during the decoding computation, the input Q often contains only one token, which naturally provides an opportunity to simplify the calculation.

2816

Currently, it seems that MHA offers many advantages for matrix absorption. However, the reality is not as simple as that. We use q_t^T k_i as an example to analyze why it is not suitable for absorption and why MLA can improve efficiency.

For a single head, n_h = 1, corresponding to matrix multiplication [1,d] × [d,d_h] × [d_h,d] × [d,1]. There are several possibilities:

Standard KV Cache.
Bundle (W^Q)^T W^K together and apply the combined weights to x.
Bundle (W^Q)^T W^K together, but only cache x, not the weights of k and v.

Based on the above analysis, the standard key-value cache is already optimal in terms of space overhead and computation. Although we can reduce the key-value cache by half by caching only x, the combined matrix will increase the computational load when calculated at runtime, which is not a good solution in the long run.

Let’s take a look at MLA. W^K performs a low-rank transformation. Compared with MHA, the storage and computing perspectives are both improved, because the key-value cache becomes the compressed latent vector plus the RoPE component, and W^{UK} can be merged into W^Q, while W^{UV} can be merged into W^O.

No merger

In the specific implementation process, choices need to be made based on the actual situation. For example, Li Weihua has a wonderful discussion on this topic in https://developnotes.readthedocs.io/zh-cn/latest/deepseek.html#id1.

Consider the following operation: Y = X A B, C = A B. Here X ∈ R^{m×d} are the input hidden states. A ∈ R^{d×d_c} and B ∈ R^{d_c×n} are weight matrices. C ∈ R^{d×n} is the matrix after absorption.

Direct calculation Y = X A B has flops 2 m d d_c + 2 m n d_c = 2 m d_c (d + n). After merging, C = A B and Y = X C have flops 2 m d n. If d_c is smaller, then d n > d_c (d + n), and the computational load is too large, so weight absorption is not necessarily required.

Alternatively, let’s look at the actual MLA code. The known configuration is as follows:

"hidden_size": 5120, # 隐藏层的大小
"kv_lora_rank": 512, # KV压缩维度
"q_lora_rank": 1536, # Query压缩维度
"qk_rope_head_dim": 64, # 解耦Query和Key的每个头部维度
"qk_nope_head_dim":128 #

As you can see, putting W^{UQ} W^{UK} together significantly increases the computational load after merging. During prefilling, you shouldn’t actually perform “absorption”. Therefore, Absorb should be understood as the associative law of matrix multiplication, which prioritizes the combination of certain matrices and caches compressed latent vectors c_t^{KV}.

2.3 Decoupling RoPE

To improve the model’s sensitivity to contextual information in the sequence, MLA employs the Decoupled Rotation Position Encoding technique. However, we have missed a crucial step in our analysis so far: position encoding. This is because RoPE is incompatible with low-rank KV compression matrices, and it conflicts with weight absorption, making seamless switching impossible. To address this issue, MLA introduces an additional query vector q_t^R and shared key vector k_t^R to carry RoPE information. As can be seen from the architecture diagram, DeepSeek’s q and k each have two parts, namely [q_t^R, q_t^C] and [k_t^R, k_t^C].

One part is the compressed part: [q_t^C] and [k_t^C].
One section includes RoPE location encoding. This means there’s a separate RoPE path: [q_t^R] and [k_t^R].

Finally, the two parts are concatenated to form the Q and K matrices. This decouples RoPE from the low-rank compressed matrix, resolving the conflict between location information and inference efficiency.

2.3.1 RoPE Background

The code below is a summary of Llama 3’s attention calculation. In RoPE rotational position encoding, both the query and key are position-dependent. Before performing attention calculation, the code first applies W^K and related matrices to obtain Q and K. Then, RoPE, multiplication by a rotation matrix, is applied to Q and K to incorporate relative position information into Q and K.

class Attention(nn.Module):       
    def forward(self, x: torch.Tensor, start_pos: int, freqs_cis: torch.Tensor,       mask: Optional[torch.Tensor],):
        bsz, seqlen, _ = x.shape
        # 获取Q、K和V
        xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)
        xq = xq.view(bsz, seqlen, self.n_local_heads, self.head_dim)
        xk = xk.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)
        xv = xv.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)
        # 施加RoPE
        xq, xk = apply_rotary_emb(xq, xk, freqs_cis=freqs_cis)

        # 处理KV Cache
        self.cache_k = self.cache_k.to(xq)
        self.cache_v = self.cache_v.to(xq)        
        keys = self.cache_k[:bsz, : start_pos + seqlen]
        values = self.cache_v[:bsz, : start_pos + seqlen]

        # 计算注意力，分开计算了RoPE部分的q和k的注意力计算再求和
        scores = torch.matmul(xq, keys.transpose(2, 3)) / math.sqrt(self.head_dim)
        scores = F.softmax(scores.float(), dim=-1).type_as(xq)
        output = torch.matmul(scores, values)  # (bs, n_local_heads, seqlen, head_dim)
        output = output.transpose(1, 2).contiguous().view(bsz, seqlen, -1)
        return self.wo(output)

2.3.2 Problem

It cannot be directly applied to low-rank compression.

Let’s first see if we can apply RoPE to low-rank compression vectors, that is, RoPE is directly absorbed by low-rank compression vectors K and V.

Because the low-rank representations of K and V are already compressed, the compression operation may have lost some information. The RoPE matrix, however, is position-sensitive to the key and value, and directly applying R_m and R_n to c_t^Q and c_t^{KV} is no longer equivalent to applying positional encoding to the complete Q and K, and cannot directly and effectively reflect the relative positional relationship of the original Q and K.

Incompatible with weight absorption

Let’s take a closer look at whether RoPE can be absorbed by the weights when applied to the original K and V.

If we want Q and K to carry position information, we will multiply them by the corresponding position encoding matrices:

Q̂ = R_m Q
K̂ = R_n K

If we then calculate Q^T K, it becomes:

S = Q^T R_m^T R_n K

DeepSeek-V2 compresses both Q and K, so the whole process becomes:

S = (W^{UQ} c_t^Q)^T R_m^T R_n W^{UK} c_t^{KV}
  = {c_t^Q}^⊤ {W^{UQ}}^⊤ R_{m-n} W^{UK} c_t^{KV}

Currently, the formula includes an additional matrix related to the token position difference R_{m-n}. This matrix changes with relative positions, it is not a fixed matrix, and cannot be calculated in advance. Furthermore, matrix multiplication does not follow the commutative law, making it impossible to move R_{m-n} to another part of the formula. Therefore, during reasoning, W^{UQ} and W^{UK} cannot directly interact. That means W^{UK} cannot be integrated into W^Q.

The diagram below provides a more precise explanation. The top represents NoPE, and the bottom represents RoPE.

2817

2.3.3 Solution

To address the incompatibility issue between RoPE and low-rank key-value joint compression in MLA, the DeepSeek team proposed a strategy to decouple RoPE. For a head, a high-dimensional vector represents its text information, and a low-dimensional vector represents its rotation position encoding information. The high-dimensional vector is called nope, and the low-dimensional vector is called rope.

Information storage section: (q_t^C, k_t^C). This section stores most of the business information and is compressed.
Location information section: (q_t^R, k_t^R). It is further divided into two parts.
- Use shared keys k_t^R ∈ R^{d_h^R} to carry RoPE information.
- Use additional multi-head queries q_{t,i}^R ∈ R^{d_h^R} to carry RoPE location information.

Finally, these four variables are concatenated separately for attention calculation. This eliminates the need for positional encoding of the Key during inference, avoiding the coupling problem between RoPE and the low-rank compression matrix, resolving the conflict between positional information and inference efficiency, and improving inference efficiency.

The final product calculation is shown in Figure 4.1, where the first term, Figure 4.2, is calculated under the case without RoPE, and only c_t^{KV} needs to be cached during inference. The latter, labeled 4.3, caches only one shared cache for all attention heads, k_t^R. That is, during the inference phase, the KV Cache generated by a single Token contains two parts.

Compressed latent vectors of key-value pairs need to be cached: c_t^{KV}, with dimension d_c.
Shared key vector carrying RoPE information: k_t^R, with dimension d_h^R.

2818

The concat process increases the dimensionality of the Q and K vectors. To handle this increased dimensionality, the model can choose to increase the number of attention heads or adjust the processing dimension of each head. The image below provides a clear comparison.

2819

2.3.5 Combining with weight absorption

Let’s take a look at how to handle it after combining weight absorption. Here we need to add nope and rope as well. The formula evolves as follows.

2.4 Resource Consumption

2.4.1 Number of parameters

MLA’s approach is derived from LoRA, which emphasizes reducing the number of parameters, and MLA does indeed achieve this reduction. With DeepSeek-V3’s parameter configuration, the two low-rank matrices have the following parameter counts: 2 × d_c × d = 2 × 512 × 7168. The parameter matrix of a normal MHA has the following number of parameters: d × d = 7168 × 7168.

The specific parameters are as follows:

"vocab_size": 129280,
"dim": 7168,
"inter_dim": 18432,
"n_heads": 128,
"q_lora_rank": 1536,
"kv_lora_rank": 512,
"qk_nope_head_dim": 128,
"qk_rope_head_dim": 64,
"v_head_dim": 128,

The number of parameters for each matrix is as follows:

W^{DKV}: dim * kv_lora_rank = 7168 * 512
W^{UK}: kv_lora_rank * qk_rope_head_dim * n_heads = 512 * 128 * 128
W^{UV}: kv_lora_rank * qk_nope_head_dim * n_heads = 512 * 128 * 128
W^{KR}: dim * qk_rope_head_dim = 7168 * 64
W^{DQ}: dim * q_lora_rank = 7168 * 1536
W^{UQ}: q_lora_rank * qk_nope_head_dim * n_heads = 1536 * 128 * 128
W^{QR}: q_lora_rank * qk_rope_head_dim * n_heads = 1536 * 64 * 128
W^O: n_heads * v_head_dim * hidden_size = 128 * 128 * 7168

2.4.2 Memory Usage

However, MLA emphasizes the reduction of KV-cache, that is, the reduction of KV activation values. Compared with classic MHA, GQA, and MQA, the actual cached vector of MLA is:

c_t^{KV} dimension is d_c.
k_t^R dimension is d_h / 2.

As shown in the figure below, we can see that MLA has a strong advantage in optimizing key-value cache and ensuring model performance.

2820

Compared to MHA, MLA uses fewer elements per layer cache. Compared to GQA, MLA’s KV cache size is significantly smaller. Compared to MQA, MLA requires 2.25 times more storage, but its performance and effectiveness are significantly better than MQA, and even better than MHA and GQA. It truly achieves both reduced inference costs and guaranteed model performance.

2.4.3 Computational complexity

Compared to MHA, the head dimensions of Q and K in MLA become d_c + d_h^R, and the head dimension of V becomes d_c. The following are some of the hyperparameters of DeepSeek V3:

d_k (hidden dimension/model dimension): 7168
n_h (Number of attention heads): 128
d_h (Dimensions of each attention head): 128
d_c (KV compression dimension): 512, that is 4 d_h
d_h^R (RoPE head related dimensions): 64, that is d_h / 2

Since the Q/K head size of each head in MLA increases significantly, the computational cost of MLA inference also increases. So why does it still improve inference efficiency? Actually, MLA improves efficiency because it leverages the characteristic of LLM inference where the bottleneck is memory access rather than computation.

2.4.4 Information Transfer

Some researchers, after rereading MLA, believe that its function is actually “information transfer,” that is, transferring information unique to the key-value header to the corresponding key-value header, while storing shared information between the key-value headers in the key-value cache.

Objective of the improvement: To save on key-value cache while minimizing compression of key-value information in the head.
Improvement Background: The reason for saving the K and V values of all attention heads corresponding to the token is that each k_head has different information, which will be used to perform attention calculation with the corresponding q_head.
Improvement ideas: Extract the common information from all the k-headers of a token and compress it into the KV Cache. Transfer the unique information from each head in K to the corresponding Q head.

2821

Furthermore, GQA requires a strict match between the number of groups and the hardware scale, limiting the flexibility of model deployment. MLA, on the other hand, can dynamically adapt to different hardware configurations through latent spatial projection and decoupled weight merging.

2.5 Parallelism

In the decoding phase of large model inference, MLA cannot utilize tensor parallelism. Therefore, current open-source implementations primarily rely on data parallelism to process MLA, meaning that the KVCache for different requests is stored on different GPUs. The DeepSeek-V3 paper mentions using tensor parallelism and sequence parallelism.

Tensor parallelism: MHA typically achieves tensor parallelism by splitting the head_num dimension. MLA, however, has its own characteristics.
Data parallelism: This means splitting the data according to requests, and storing the compressed vectors of the latent space for different requests on different GPUs.
Sequence Parallelism: MLA uses sequence parallelism for assistance. That is, the KVCache is split according to the sequence dimension, and local attention is calculated using queries on each card, and then the results are reduced.

0x03 Calculation Process

Let’s break down the computational process of MLA during the inference phase.

3.1 Formula

First, we present the formulas corresponding to the transformation processes of Q, K, and V. The subsequent analysis will follow these formulas.

2822

3.2 Original Process

We convert the above formula into a flowchart, the details of which are as follows:

It is divided into three paths from top to bottom: Q, K, and V.
Both Q and K are further subdivided into two paths: the upper path with green weights and activation values corresponds to the latent vector and low-rank part; the lower path with gray gradient weights and activation values corresponds to decoupled RoPE.
The data flow of K’s bottom lane and V’s bottom lane is somewhat intertwined.
“Cache” refers to data that is cached during the inference phase, and it consists of two parts: KV Joint Implicit Vector c_t^{KV}, and the key k_t^R for which RoPE was applied separately.

Here, we assume the number of heads n_h is 2, meaning the matrix size is not scaled proportionally.

2823

3.3 Absorption

3.3.1 Process

The second step involves applying the weight absorption process described in the paper, resulting in the following figure:

The data to be cached during the inference phase remains unchanged.
W^{UK} is absorbed into W^{UQ}.
The calculation logic for Q remains unchanged, but the shapes of the weights and activation values have been adjusted accordingly.
The top path of K directly eliminates one linear mapping calculation logic, turning it into a repeated copy, similar to K’s bottom lane.
W^{UV} is absorbed into W^O.
The V-route linear mapping degenerates into the logic of repeated copying.
The final output mapping calculation logic remains unchanged, but the shapes of the weights and activation values are adjusted accordingly.

2824

3.3.2 Absorption Results

We have organized the above diagram and obtained the following results of absorption.

2825

3.3.3 MQA Form

The computational logic of the MLA inference phase is actually very similar to an MQA. Let’s compare them, excluding RoPE.

2826

For MLA, considering only the first part of the Attention calculation, both use a separate Q and shared K, where C^{KV} is equivalent to repeatedly copying the single-head key-value pairs several times before executing normal MHA.

The difference is that here the dimension of the vector dot product is d_c instead of d. The actual setting in the paper is d_c = 4 d. In other words, Multi-Head Latent Attention is essentially Multi-Query Attention with the head dimension increased by four times.

To make it clearer, we can further absorb and merge some weights in the diagram to obtain the following diagram.

2827

The calculation of Q degenerates into a common multi-head linear mapping.
For each head, a portion of its dimensions remains unchanged, corresponding to the green part.
Apply the RoPE transformation to another dimension of each head, corresponding to the red portion.
The calculation of K degenerates into a single-head linear mapping.
Similarly, RoPE transformation is applied only to a subset of dimensions.
V then directly uses the portion of K that has not undergone the RoPE transformation, and copies it repeatedly.

0x04 Code

We primarily use the V2 code for analysis because it’s more clearly structured. It’s also important to note that the DeepSeek code differs from the paper in several places. The DeepSeekV2Attention implementation in V2 is essentially the same as the native implementation in V3, and doesn’t actually save on KV-Cache. The non-native version in V3, however, is consistent with the paper and saves GPU memory.

4.1 Configuration

We have extracted some relevant configuration information as follows. In the naive implementation, there is a 512-dimensional Latent KV, c_t^{KV}. It is mapped back to 128 heads, each head having 128 dimensions, k^C and v^C. Then concatenate the position vector k^R. The final inputs, q, k, and v are fed into a standard Multi Head Attention system for Attention calculation. Additionally, the code also uses norms, which are mentioned in the paper.

2828

The specific configuration information is as follows:

Compressed dimensions of keys and values: d_c is set to 512, with the original embedding dimension d = 5120, resulting in a ratio of 1/10.
Query compression dimensions: d'_c is set to 1536, with a ratio of 0.3.

"num_hidden_layers": 60, # Transformer层的数量
"hidden_size": 5120, # 隐藏层的大小
"num_attention_heads": 128, # 注意力头的数量
"kv_lora_rank": 512, # KV压缩维度
"q_lora_rank": 1536, # Query压缩维度
"qk_rope_head_dim": 64, # 解耦Query和Key的每个头部维度
"n_shared_experts": 2, # MoE层中的共享专家数量
"n_routed_experts": 160, # MoE层中的路由专家数量
"moe_intermediate_size": 1536, # 每个MoE专家的中间隐藏层的维度
"num_experts_per_tok": 6, # 每个token激活的专家数量
"routed_scaling_factor": 16.0, # 路由专家的缩放因子
"rms_norm_eps": 1e-06 # RMS归一化的epsilon值

4.2 Definition

Given an input vector h_t ∈ R^{B×L×5120}, where B is the batch size and L is the sequence length:

class DeepseekV2Attention(nn.Module):
    """Multi-headed attention from 'Attention Is All You Need' paper"""
    def __init__(self, config: DeepseekV2Config, layer_idx: Optional[int] = None):
        super().__init__()
        self.config = config
        self.layer_idx = layer_idx
        self.attention_dropout = config.attention_dropout
        self.hidden_size = config.hidden_size
        self.num_heads = config.num_attention_heads
        self.max_position_embeddings = config.max_position_embeddings
        self.rope_theta = config.rope_theta
        # 对应 query 压缩后的隐向量的维度 d'_c
        self.q_lora_rank = config.q_lora_rank
        # query和key的隐藏向量中，应用rope部分的维度，对应d_h^R
        self.qk_rope_head_dim = config.qk_rope_head_dim
        # 对应 key-value 压缩后的隐向量维度 d_c
        self.kv_lora_rank = config.kv_lora_rank
        # value 的一个注意力头的隐藏层维度
        self.v_head_dim = config.v_head_dim
        # 向量中不应用rope部分的维度
        self.qk_nope_head_dim = config.qk_nope_head_dim
        # 每一个注意力头的维度应该是nope和rope两部分之和
        self.q_head_dim = config.qk_nope_head_dim + config.qk_rope_head_dim
        self.is_causal = True

        self.q_a_proj = nn.Linear(
            self.hidden_size, config.q_lora_rank, bias=config.attention_bias
        )
        self.q_a_layernorm = DeepseekV2RMSNorm(config.q_lora_rank)
        self.q_b_proj = nn.Linear(
            config.q_lora_rank, self.num_heads * self.q_head_dim, bias=False
        )

        self.kv_a_proj_with_mqa = nn.Linear(
            self.hidden_size,
            config.kv_lora_rank + config.qk_rope_head_dim,
            bias=config.attention_bias,
        )
        self.kv_a_layernorm = DeepseekV2RMSNorm(config.kv_lora_rank)
        self.kv_b_proj = nn.Linear(
            config.kv_lora_rank,
            self.num_heads
            * (self.q_head_dim - self.qk_rope_head_dim + self.v_head_dim),
            bias=False,
        )

        self.o_proj = nn.Linear(
            self.num_heads * self.v_head_dim,
            self.hidden_size,
            bias=config.attention_bias,
        )
        self._init_rope()

        self.softmax_scale = self.q_head_dim ** (-0.5)
        if self.config.rope_scaling is not None:
            mscale_all_dim = self.config.rope_scaling.get("mscale_all_dim", 0)
            scaling_factor = self.config.rope_scaling["factor"]
            if mscale_all_dim:
                mscale = yarn_get_mscale(scaling_factor, mscale_all_dim)
                self.softmax_scale = self.softmax_scale * mscale * mscale

The corresponding information is as follows. Breaking the entire computation process into four parts, q_nope, k_nope, k_pe, and q_pe, is to decouple RoPE. The two variables ending in pe are used to store the rotation position encoding information. Deepseek-V2 compresses the kv cache into a single small matrix, which is then decompressed later.

# q = q.view(bsz, q_len, num_heads, q_head_dim).transpose(1, 2)
# q_nope, q_pe = torch.split(q, [qk_nope_head_dim, qk_rope_head_dim], dim=-1)
q_pe : torch.Size([16, 128, 1, 64])
q_nope : torch.Size([16, 128, 1, 128])
# query_states = k_pe.new_empty(bsz, num_heads, q_len, q_head_dim)
query_states : torch.Size([16, 128, 1, 192])
    
# kv = .view(bsz, kv_seq_len, num_heads, qk_nope_head_dim + v_head_dim).transpose(1, 2)
# k_nope, value_states = torch.split(kv, [qk_nope_head_dim, v_head_dim], dim=-1)
value_states : torch.Size([16, 128, 1024, 128])
k_nope : torch.Size([16, 128, 1024, 128])  
# k_pe = k_pe.view(bsz, kv_seq_len, 1, qk_rope_head_dim).transpose(1, 2)
k_pe : torch.Size([16, 1, 1024, 64])
# key_states = k_pe.new_empty(bsz, num_heads, kv_seq_len, q_head_dim)
key_states : torch.Size([16, 128, 1024, 192])

4.3 Operation Q

We combined all the Q-related code for analysis. The overall process is as follows: when the model processes the hidden state calculated by the previous layer, hidden_size = 5120, it first compresses q to q_lora_rank = 1536, then expands it to the output dimension of q_b_proj, num_heads × q_head_dim, and finally splits it into two parts: q_pe and q_nope.

4.3.1 Variable Definition

Q projection matrix in MLA, W^Q, undergoes a low-rank decomposition, which generates two matrices, q_a_proj and q_b_proj.

The size of q_a_proj is [hidden_size, q_lora_rank] = [5120, 1536], corresponding to W^{DQ}.
The size of q_b_proj is [q_lora_rank, num_heads * q_head_dim] = [1536, 24576], corresponding to W^{UQ} and W^{QR} stored together.

self.num_heads = config.num_attention_heads # 128
self.q_lora_rank = config.q_lora_rank # 1536
self.q_head_dim = config.qk_nope_head_dim + config.qk_rope_head_dim # 128 + 64

# 对query进行压缩，即down-projection
self.q_a_proj = nn.Linear(
    self.hidden_size, config.q_lora_rank, bias=config.attention_bias
)
self.q_a_layernorm = DeepseekV2RMSNorm(config.q_lora_rank)
# 对压缩后的query映射成高维，即up-projection
self.q_b_proj = nn.Linear(
    config.q_lora_rank, self.num_heads * self.q_head_dim, bias=False
)

4.3.2 Variable Operations

In DeepSeek-V2, the Q vector also uses low-rank compression.

First, project the input vector into a low-dimensional space of 1536 dimensions: c_t^Q = W^{DQ} h_t.
Then, project it into the multi-head vector space to obtain the first part of the Q vector, q_t^C.
Then project it into the RoPE-related vector space to obtain the second part of the Q vector, q_t^R.
Finally, concatenate the two parts to obtain the full query vector q_t = [q_t^C, q_t^R].

2829

The specific implementation is as follows:

# hidden_states对应公式中的h_t，hidden_states的shape是(batch_size, seq_length, hidden_size)，其中 hidden_size为 5120，是num_head * q_head_dim
bsz, q_len, _ = hidden_states.size()

# 下面两行代码对应第37、38号公式，先降维再升维。q_b_proj维度是[1536, 24576]，q_a_proj维度是[5120, 1536]，是W^Q [5120, 24576]矩阵的低秩分解。即[5120, 24576] -> [5120, 1536] * [1536, 24576] 
# 首先，使用全连接层（self.q_a_proj）对输入的隐状态（hidden_states）进行降维投影
# 然后，使用全连接层（self.q_b_proj）对压缩的向量进行上投影  
q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))

# 重塑为多头形式，是第40号公式的前置准备操作，或者说是40号公式的反向操作
# q_pe 要扔给 RoPE模块，所以需要重整下形状
q = q.view(bsz, q_len, self.num_heads, self.q_head_dim).transpose(1, 2)

# 把最后一维切分成nope和rope两部分
q_nope, q_pe = torch.split(q, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1)

# 第39号公式，给q和k施加RoPE
q_pe, k_pe = apply_rotary_pos_emb(q_pe, k_pe, cos, sin, position_ids)

# 初始化查询状态（query_states）的张量
query_states = k_pe.new_empty(bsz, self.num_heads, q_len, self.q_head_dim)

# 下面两行对应第40号公式
query_states[:, :, :, : self.qk_nope_head_dim] = q_nope # 128
query_states[:, :, :, self.qk_nope_head_dim :] = q_pe # 64

4.4 Operating KV

We combined all the key-value related code for analysis. For the KV matrix design, the model uses a compressed KV matrix design, only 576 dimensions, performing dimensionality reduction followed by dimensionality increase during training. During model inference, the amount that needs to be cached becomes compressed_kv, which is then increased in dimension using kv_b_proj to obtain the calculated k and v results.

4.4.1 Variable Definition

Similar to the Q vector, the KV vector also undergoes a low-rank decomposition, generating two matrices: kv_a_proj_with_mqa and kv_b_proj.

self.kv_lora_rank = kv_lora_rank # 512，key和value各占256维度
self.qk_rope_head_dim = config.qk_rope_head_dim # 64
self.q_head_dim = config.qk_nope_head_dim + config.qk_rope_head_dim # 128 + 64
self.v_head_dim = config.v_head_dim # 128
self.hidden_size = config.hidden_size # 5120

self.kv_a_proj_with_mqa = nn.Linear(
    self.hidden_size,
    config.kv_lora_rank + config.qk_rope_head_dim,
    bias=config.attention_bias,
)
self.kv_a_layernorm = DeepseekV2RMSNorm(config.kv_lora_rank)
self.kv_b_proj = nn.Linear(
    config.kv_lora_rank,
    self.num_heads
    * (self.q_head_dim - self.qk_rope_head_dim + self.v_head_dim),
    bias=False,
)

4.4.2 Variable Operations

There are a few differences between the KV vector calculation and the formula. Some matrix operations are performed together, producing both K and V vectors at the same time, and then separated later.

2830

Dimensional analysis shows that kv_lora_rank is 4 times that of qk_nope_head_dim, and K and V share the latent state. qk_rope_head_dim is only half the size of qk_nope_head_dim. Combined, 4 + 1/2 = 9/2, which is the source of the MLA KVCache per Token size shown in the diagram below.

2831

The specific code implementation is as follows:

# 使用MQA（Multi-Query Attention）对输入的隐状态进行处理，得到压缩后的键值对表示（compressed_kv），对应41号公式和43号（还没有加 rope）。
compressed_kv = self.kv_a_proj_with_mqa(hidden_states)

# 将压缩后的键值对表示分为两部分：低秩压缩的键值对部分和经过位置编码的键部分（k_pe）
compressed_kv, k_pe = torch.split(
    compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1
)

# k_pe 要传给 RoPE模块，所以需要重整下形状
k_pe = k_pe.view(bsz, q_len, 1, self.qk_rope_head_dim).transpose(1, 2)

kv = (
    self.kv_b_proj(self.kv_a_layernorm(compressed_kv))
    .view(bsz, q_len, self.num_heads, self.qk_nope_head_dim + self.v_head_dim)
    .transpose(1, 2)
)

k_nope, value_states = torch.split(
    kv, [self.qk_nope_head_dim, self.v_head_dim], dim=-1
)

kv_seq_len = value_states.shape[-2]
if past_key_value is not None:
    kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)

cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
q_pe, k_pe = apply_rotary_pos_emb(q_pe, k_pe, cos, sin, position_ids)

key_states = k_pe.new_empty(bsz, self.num_heads, q_len, self.q_head_dim)
key_states[:, :, :, : self.qk_nope_head_dim] = k_nope
key_states[:, :, :, self.qk_nope_head_dim :] = k_pe

4.5 Attention operations

4.5.1 Variable Definition

o_proj corresponds to matrix W^O. The size is [num_heads * v_head_dim, hidden_size] = [128 * 128, 5120].

self.v_head_dim = config.v_head_dim # 128
self.num_heads = config.num_attention_heads # 128
self.hidden_size = config.hidden_size # 5120

self.o_proj = nn.Linear( # 对应第47号公式
    self.num_heads * self.v_head_dim,
    self.hidden_size,
    bias=config.attention_bias,
)

4.5.2 Variable Operations

After generating the QKV vectors, the process is essentially the same as standard MHA calculation. The only difference is that only the q_pe and k_pe parts are roped.

a = softmax((q_t^⊤ k_t + Mask) / √192)
  = softmax((q_t^{C⊤} k_t^C + q_t^{R⊤} k_t^R + Mask) / √(128 + 64))

Then, the weighted sum over V is calculated, and all heads are flattened to obtain the Attention output:

o = a · v_t ∈ R^{B×L×H×128} ≅ R^{B×L×16384}
u = W^O o ∈ R^{B×L×5120}

The specific code is:

# 更新和拼接历史 KVCache，将当前位置之前的压缩后的kv以及应用过rope的k的部分拼接进去，可以看到这里存储的是展开后的 MHA KVCache
if past_key_value is not None:           
    cache_kwargs = {"sin": sin, "cos": cos}  # Specific to RoPE models
    key_states, value_states = past_key_value.update( # 更新kv cache
        key_states, value_states, self.layer_idx, cache_kwargs
    )

# 后续就是标准的 MHA 代码，代码 Q^T*K*V*O
attn_weights = (
    torch.matmul(query_states, key_states.transpose(2, 3)) * self.softmax_scale
)

if attention_mask is not None:
    attn_weights = attn_weights + attention_mask

# upcast attention to fp32
attn_weights = nn.functional.softmax(
    attn_weights, dim=-1, dtype=torch.float32
).to(query_states.dtype)
attn_weights = nn.functional.dropout(
    attn_weights, p=self.attention_dropout, training=self.training
)
attn_output = torch.matmul(attn_weights, value_states)
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.reshape(bsz, q_len, self.num_heads * self.v_head_dim)
attn_output = self.o_proj(attn_output)
if not output_attentions:
    attn_weights = None

return attn_output, attn_weights, past_key_value

4.6 Forward Propagation

We have extracted the complete forward propagation code below for your better understanding.

def forward(
    self,
    hidden_states: torch.Tensor,
    attention_mask: Optional[torch.Tensor] = None,
    position_ids: Optional[torch.LongTensor] = None,
    past_key_value: Optional[Cache] = None,
    output_attentions: bool = False,
    use_cache: bool = False,
    **kwargs,
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
    bsz, q_len, _ = hidden_states.size()

    q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))
    q = q.view(bsz, q_len, self.num_heads, self.q_head_dim).transpose(1, 2)
    q_nope, q_pe = torch.split(
        q, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1
    )

    compressed_kv = self.kv_a_proj_with_mqa(hidden_states)
    compressed_kv, k_pe = torch.split(
        compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1
    )
    k_pe = k_pe.view(bsz, q_len, 1, self.qk_rope_head_dim).transpose(1, 2)
    kv = (
        self.kv_b_proj(self.kv_a_layernorm(compressed_kv))
        .view(bsz, q_len, self.num_heads, self.qk_nope_head_dim + self.v_head_dim)
        .transpose(1, 2)
    )

    k_nope, value_states = torch.split(
        kv, [self.qk_nope_head_dim, self.v_head_dim], dim=-1
    )
    kv_seq_len = value_states.shape[-2]
    if past_key_value is not None:
        kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
    cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)

    q_pe, k_pe = apply_rotary_pos_emb(q_pe, k_pe, cos, sin, position_ids)

    query_states = k_pe.new_empty(bsz, self.num_heads, q_len, self.q_head_dim)
    query_states[:, :, :, : self.qk_nope_head_dim] = q_nope
    query_states[:, :, :, self.qk_nope_head_dim :] = q_pe

    key_states = k_pe.new_empty(bsz, self.num_heads, q_len, self.q_head_dim)
    key_states[:, :, :, : self.qk_nope_head_dim] = k_nope
    key_states[:, :, :, self.qk_nope_head_dim :] = k_pe
    if past_key_value is not None:
        cache_kwargs = {"sin": sin, "cos": cos}
        key_states, value_states = past_key_value.update(
            key_states, value_states, self.layer_idx, cache_kwargs
        )

    attn_weights = (
        torch.matmul(query_states, key_states.transpose(2, 3)) * self.softmax_scale
    )

    if attention_mask is not None:
        attn_weights = attn_weights + attention_mask

    attn_weights = nn.functional.softmax(
        attn_weights, dim=-1, dtype=torch.float32
    ).to(query_states.dtype)
    attn_weights = nn.functional.dropout(
        attn_weights, p=self.attention_dropout, training=self.training
    )
    attn_output = torch.matmul(attn_weights, value_states)
    attn_output = attn_output.transpose(1, 2).contiguous()
    attn_output = attn_output.reshape(bsz, q_len, self.num_heads * self.v_head_dim)
    attn_output = self.o_proj(attn_output)

    if not output_attentions:
        attn_weights = None

    return attn_output, attn_weights, past_key_value

The corresponding example is shown below.

2832

4.7 V3 Code

We also provide the V3 code below. The native version in V3 does not actually save KV-Cache, it even adds storage, while the non-native version in V3, consistent with the paper, saves GPU memory.

# from: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py
class MLA(nn.Module):
    def __init__(self, args: ModelArgs):
        super().__init__()
        self.dim = args.dim
        self.n_heads = args.n_heads
        self.n_local_heads = args.n_heads // world_size
        self.q_lora_rank = args.q_lora_rank
        self.kv_lora_rank = args.kv_lora_rank
        self.qk_nope_head_dim = args.qk_nope_head_dim
        self.qk_rope_head_dim = args.qk_rope_head_dim
        self.qk_head_dim = args.qk_nope_head_dim + args.qk_rope_head_dim
        self.v_head_dim = args.v_head_dim

        if self.q_lora_rank == 0:
            self.wq = ColumnParallelLinear(self.dim, self.n_heads * self.qk_head_dim)
        else:
            self.wq_a = Linear(self.dim, self.q_lora_rank)
            self.q_norm = RMSNorm(self.q_lora_rank)
            self.wq_b = ColumnParallelLinear(self.q_lora_rank, self.n_heads * self.qk_head_dim)

        self.wkv_a = Linear(self.dim, self.kv_lora_rank + self.qk_rope_head_dim)
        self.kv_norm = RMSNorm(self.kv_lora_rank)
        self.wkv_b = ColumnParallelLinear(self.kv_lora_rank, self.n_heads * (self.qk_nope_head_dim + self.v_head_dim))
        self.wo = RowParallelLinear(self.n_heads * self.v_head_dim, self.dim)
        self.softmax_scale = self.qk_head_dim ** -0.5

        if attn_impl == "naive":
            self.register_buffer("k_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.n_local_heads, self.qk_head_dim), persistent=False)
            self.register_buffer("v_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.n_local_heads, self.v_head_dim), persistent=False)
        else:
            self.register_buffer("kv_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.kv_lora_rank), persistent=False)
            self.register_buffer("pe_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.qk_rope_head_dim), persistent=False)

The native implementation is intuitive and suitable for learning, but not suitable for the decoding stage because the decoding stage requires a key-value cache. Therefore, the naive implementation might be used in the prefill stage, but a better computation method is needed during the decoding phase, namely the non-native version.

The specific comparison is shown in the image below.

2833

0x05 Optimize the code

The DeepSeek code does not provide specific solutions for certain functions, such as compression optimization and weight absorption. Therefore, we mainly use the solution provided by Professor Zhang Mingxing, https://github.com/madsys-dev/deepseekv2-profile/tree/main, as an example for learning.

5.1 Compression Optimization

In the current V2 code, the KV Cache in Attention still caches the full set of keys and values, decompressed from the latent vectors, instead of the compressed sets mentioned in the paper, compressed_kv and k_pe, which means that the KV Cache is not actually reduced.

We can make the following modification to also cache the k_pe after RoPE in the KV Cache.

# 将当前位置之前的压缩后的kv（c_t^{kv}）以及应用过rope的k的部分拼接到KV Cache前面
if past_key_value is not None:
    # 得到的应该是
    # compressed_kv: [B, kv_seq_len, d_c]
    # k_pe: [B, 1, kv_seq_len, qk_rope_head_dim]
    compressed_kv, k_pe = past_key_value.update(compressed_kv, k_pe)

Teacher Zhang Mingxing provided a more detailed solution.

# CacheCompressed
def forward(self, hidden_states_q: torch.Tensor, q_position_ids: torch.LongTensor, compressed_kv: torch.Tensor):
    ...
    kv_seq_len = compressed_kv.size(1)
    compressed_kv, k_pe = torch.split(
        compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1
    )
    k_pe = k_pe.view(bsz, kv_seq_len, 1, self.qk_rope_head_dim).transpose(1, 2)
    kv = self.kv_b_proj(compressed_kv) \
        .view(bsz, kv_seq_len, self.num_heads, self.qk_nope_head_dim + self.v_head_dim) \
        .transpose(1, 2)
    
    k_nope, value_states = torch.split(kv, [self.qk_nope_head_dim, self.v_head_dim], dim=-1)
    ... 
    
def compress_kv(self, hidden_states_kv: torch.Tensor, kv_position_ids: torch.LongTensor) -> torch.Tensor:
    bsz, kv_seq_len, _ = hidden_states_kv.size()
    compressed_kv = self.kv_a_proj_with_mqa(hidden_states_kv) 
    compressed_kv, k_pe = torch.split(
        compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1
    )
    compressed_kv = self.kv_a_layernorm(compressed_kv)
    k_pe = k_pe.view(bsz, kv_seq_len, 1, self.qk_rope_head_dim).transpose(1, 2)
    cos, sin = self.rotary_emb(k_pe) 
    k_pe = apply_rotary_pos_emb(k_pe, cos, sin, kv_position_ids).view(bsz, kv_seq_len, self.qk_rope_head_dim)
    return torch.cat([compressed_kv, k_pe],dim=-1)

5.2 Weight Absorption

When calculating MLA, it is still necessary to store the decompressed complete KV cache, which can easily cause an OOM crash. The DeepSeek-V2 paper proposes that the decompression matrix of the KV can be absorbed into the Q-projection and Out-projection, so that the final Attention result can be calculated directly without decompressing the KV cache.

2834

5.2.1 absorbed_cache_compressed.py

Unlike the paper, this code uses the weights belonging to K in kv_b_proj, corresponding to W^{UK}, absorbed into q_nope, corresponding to q^C. Moreover, this is done at runtime, not pre-absorbed. The weights belonging to V in kv_b_proj, corresponding to W^{UV}, are absorbed into attn_out.

`W^{UK}`

Regarding the absorption of K, the non-RoPE part in the formula for calculating the attention score can be expanded as follows:

q_t^{C⊤} k_t^C
= (W^{UQ} c_t^Q)^⊤ W^{UK} c_t^{KV}
= c_t^{Q⊤} W^{UQ⊤} W^{UK} c_t^{KV}

The code is:

# Absorbed_CacheCompressed
def forward(hidden_states_q: torch.Tensor, q_position_ids: torch.LongTensor, compressed_kv: torch.Tensor):
    ...
    kv_b_proj = self.kv_b_proj.weight.view(self.num_heads, -1, self.kv_lora_rank)
    q_absorb = kv_b_proj[:, :self.qk_nope_head_dim,:]
    out_absorb = kv_b_proj[:, self.qk_nope_head_dim:, :]
    
    cos, sin = self.rotary_emb(q_pe)
    q_pe = apply_rotary_pos_emb(q_pe, cos, sin, q_position_ids)
    
    qk_head_dim = self.kv_lora_rank + self.qk_rope_head_dim
    query_states = k_pe.new_empty(bsz, self.num_heads, q_len, qk_head_dim)
    query_states[:, :, :, : self.kv_lora_rank] = torch.einsum('hdc,bhid->bhic', q_absorb, q_nope)
    query_states[:, :, :, self.kv_lora_rank :] = q_pe
    ...

`W^{UV}`

The absorption of V is slightly more complex. For clarity, we use Einstein’s summation convention to describe the process.

v_t = einsum('hdc,blc->blhd', W_UV, c_t_KV) # (1)
o   = einsum('bqhl,blhd->bqhd', a, v_t)     # (2)
u   = einsum('hdD,bhqd->bhD', W_o, o)       # (3)

u   = einsum('hdc,blc,bqhl,hdD->bhD', W_UV, c_t_KV, a, W_o)

o_  = einsum('bhql,blc->bhqc', a, c_t_KV) # (4)
o   = einsum('bhqc,hdc->bhqd', o_, W_UV)  # (5)
u   = einsum('hdD,bhqd->bhD', W_o, o)     # (6)

5.2.2 Move Elision

After implementing the above optimizations, the concatenation process here generates a large amount of useless data copying and broadcasting, and also consumes a lot of GPU memory, leading to OutOfMemoryError. Therefore, we adopt the MoveElision optimization strategy, which omits the process of concatenating the RoPE and non-RoPE parts, and instead directly calculates the Attention Score of the quantitative part separately and adds them together.

def forward(...):
    ...
    # 吸收后 attn_weights 直接基于 compressed_kv 计算不用展开
    attn_weights = torch.matmul(q_pe, k_pe.transpose(2, 3)) + torch.einsum('bhqc,blc->bhql', q_nope, compressed_kv)
    attn_weights *= self.softmax_scale
    ...

# Absorbed_CacheCompressed_MoveElision
def forward(...):
    ...

The code comparison is as follows:

2835

5.2.3 Materializing Projection Matrices

The DeepSeek-V2 paper states:

2836

However, it seems unnecessary to change the order and preprocess the model parameters, multiplying W^{UK} and W^{UQ} beforehand, and similarly multiplying W^{UV} and W^O. Therefore, Professor Zhang believes that this optimization step is not very necessary.

The specific code is as follows:

def forward(self, hidden_states_q: torch.Tensor, q_position_ids: torch.LongTensor, compressed_kv: torch.Tensor):
    '''
    Attention masks and past cache are removed.
    Input: 
    - hidden_states_q: [bsz, q_len, hidden_size]
    - compressed_kv: [bsz, kv_len, kv_lora_rank]
    - position_ids: [bsz, q_len]
    '''
    bsz, q_len, _ = hidden_states_q.size()
    q_b_proj_rope, q_absorbed, out_absorbed = self.get_absorbed_proj()
    q = self.q_a_layernorm(self.q_a_proj(hidden_states_q))
    q_nope = torch.einsum('bqc,hdc->bhqd', q, q_absorbed)
    q_pe = torch.einsum('bqc,hdc->bhqd', q, q_b_proj_rope)
    
    cos, sin = self.rotary_emb(q_pe)
    q_pe = apply_rotary_pos_emb(q_pe, cos, sin, q_position_ids)
    kv_seq_len = compressed_kv.size(1)
    compressed_kv, k_pe = torch.split(
        compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1
    )
    k_pe = k_pe.view(bsz, 1, kv_seq_len, self.qk_rope_head_dim)
    
    attn_weights = (torch.matmul(q_pe, k_pe.mT) + torch.matmul(q_nope, compressed_kv.unsqueeze(-3).mT)) * self.softmax_scale

    attn_weights = nn.functional.softmax(
        attn_weights, dim=-1, dtype=torch.float32
    ).to(q_nope.dtype)
    attn_output = torch.einsum('bhql,blc->bhqc', attn_weights, compressed_kv)
    attn_output = torch.einsum('bhqc,dhc->bqd', attn_output, out_absorbed)
    return attn_output

5.3 Fusion Operator

Furthermore, if different processing is applied to the prefill and decode stages, the logic followed by the prefill and decode stages will be different during inference.

During inference, the Prefill does not perform matrix absorbing because matrix absorbing in the Prefill increases the computational cost.
Decoding involves matrix absorbing, and the ops of matrix absorbing are much smaller than those of not absorbing. This is because the length of Q is 1 at this point.

After weight absorption, the formula is as follows:

(p · (c_kv · W^{UV})) · W^O
= (p · c_kv) · (W^{UV} · W^O)
= (softmax(q_nope · c_kv + q_pe · k_pe) · c_kv) · W^{UV} · W^O

This can be described in code as follows, which means that an MQA operator can be designed to implement it.

q_pe = W_QR(c_q)
q_nope = W_UQ_UK(c_q)
output = W_UV_O(MQA(q_pe, q_nope, c_kv, k_pe))

5.4 Reordering of Matrix Multiplication (Supplement @ 2025-04-19)

Content Reference: DeepSeek V3 Inference: MLA and MOE Analysis Arthur.

The specific features are as follows:

Solution source: SGlang, applied to DeepSeek-V2.
Scheme Features: By altering the computation order based on the associative law of matrix multiplication, the computational efficiency of the attention mechanism is optimized.
Solution details:
- Original calculation order: q_nope k_nope + q_rope k_rope.
- The order of improvement uses (q_nope^T W^{UK}) c.

This change utilizes the associative law of matrix multiplication, allowing computations to be reorganized across different dimensions during the decoding phase. When n_q = 1, the optimized method can significantly reduce the amount of computation.

2837

0x06 Conversion

6.1 GQA

Group Query Attention (GQA) is a variant of MHA designed to reduce the overhead of key-value caching. It divides the query header into multiple groups, each sharing a key-value pair. This approach reduces the size of the key-value cache by decreasing the number of key-value headers, but may sacrifice the model’s expressiveness. GQA can be viewed as a special case of MLA. Because GQA is generated through copying, while MLA is not subject to this constraint, it offers greater expressiveness.

Although MLA has proven its efficiency and effectiveness in Deepseek V2/V3/R1, many major model providers still rely on GQA. To promote wider adoption of MLA, the paper “TransMLA: Multi-Head Latent Attention Is All You Need” proposes TransMLA, a post-training method that converts widely used GQA-based pre-trained models, such as LLaMA, Qwen, and Mixtral, into MLA-based models. After conversion, the model can undergo additional training to enhance its expressive power without increasing the size of the key-value cache.

6.1.1 Approach

The paper first proves that for the same key-value caching overhead, MLA’s expressive power is always greater than GQA’s. Specifically, any GQA configuration can be equivalently converted to an MLA representation, but the reverse is not true. This conclusion provides a theoretical basis for converting GQA-based models to MLA-based models.

During the equivalence transformation, the TransMLA method first copies the key matrix from GQA to match the number of query headers. Then, it decomposes this copied key matrix into the product of two smaller matrices, thus obtaining the low-rank representation in MLA. In this way, TransMLA can convert a GQA-based model to an MLA-based model without increasing the KV cache size.

6.1.2 Scheme

The first step is to duplicate the key matrix to match the number of query heads. In GQA, to ensure that Q and K, and V, have the same number of heads during standard multi-head attention computation, K needs to be expanded from K_{n_k} to K_{n_q}. There are actually two ways to do this:

Define replication factor s = n_q / n_k, divide K into n_k blocks, copy each block s times, and concatenate them to obtain the extended matrix K'.
Another approach is to move the copying operation to the parameter side, which is essentially using MHA instead of GQA. Copy the projection matrix before calculating K.

See Figure (a) and Figure (b) below for details.

2838

Because W'_K is formed by copying W_K, its maximum degree of freedom is n_k d_h. Therefore, its rank is at most n_k d_h. To understand this more formally, let’s examine it through singular value decomposition.

2839

Furthermore, GQA requires a strict match between the number of groups and the hardware scale, for example 8 cards for g = 8, limiting the flexibility of model deployment. MLA, on the other hand, can dynamically adapt to different hardware configurations through latent spatial projection and decoupled weight merging. To compensate for performance loss, GQA requires increasing the size of the FFN layer, which increases model complexity, while MLA maintains performance without additional compensation through low-rank projection and dynamic routing.

6.2 MHA

Enabling LLMs, such as Llama, originally trained for MHA to quickly adapt for inference in MLA without retraining from scratch is both meaningful and challenging. The paper “Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs” presents the first data-efficient fine-tuning method, MHA2MLA, for converting from MHA to MLA. This method comprises two key components:

For partial-RoPE, the paper removes RoPE from the dimensions of queries and keys that contribute less to attention scores.
For the low-rank approximation, the paper introduces a joint SVD approximation based on pre-trained parameters of the keys and values.

These carefully designed strategies enable MHA2MLA to recover performance using only a very small fraction of the data, significantly reducing inference costs, while seamlessly integrating with compression techniques such as KV cache quantization.

2840

6.2.1 Partial-RoPE

To achieve the migration from standard MHA to MLA, the paper proposes a partial-RoPE fine-tuning strategy, which removes RoPE from the dimension of the target proportion and converts it to NoPE.

MHA

MHA’s Full-RoPE encodes position information into the query and key through rotations at a specific frequency, as shown in the figure below.

2841

Disassembly

In MLA, k_i depends on [k_{i,nope}; k_{i,rope}]. Therefore, we first need to make the MHA k_{i,rope} break down into two parts as well: one without RoPE encoding and one with RoPE.

DeepSeek’s MLA actually does not use RoPE encoding in each original head dimension d_h. Instead, it adds another dimension using RoPE encoding, d_h^R. But for now, we can only consider the entire length as d_h, disassemble k_{i,rope}, and take out the d_r part, where d_r ≪ d_h, for partial RoPE encoding.

When converting from Full-RoPE to Partial-RoPE, which subspace do we choose for rotation coding? The paper proposes four strategies:

High-frequency retention: retain the r fastest rotating subspaces.
Low-frequency retention: retain the r slowest rotating subspaces.
Uniform sampling: select r subspaces with equal intervals.
Head-wise 2-norm Contribution Selection: sort all subspaces according to their 2-norm scores and keep the top r.

2842

Selected d_h dimensions perform RoPE location encoding on d_r, and the remaining d_h - d_r should be treated as the position-free encoded part in MLA, namely q_nope.

6.2.2 Low-rank approximation

For MHA, k_i = W_k x_i, v_i = W_v x_i. We have already used one of the four methods above to find the part that needs to be RoPE’d, so we can then take the corresponding part from W_k to get W^{KR}.

We also extract the non-RoPE parameters from the data:

k_{i,nope} = W_{k,nope} x_i
v_{i,nope} = W_{v,nope} x_i

Our goal is to construct MLA’s W^{DKV} from W_{k,nope} and W_{v,nope}.

After converting from Full RoPE to Partial RoPE, the second component for obtaining the KV cache in MLA is obtained: c_{i,kv}. The paper proposes two SVD-based strategies, decoupled SVD and joint SVD, as shown in the figure below.

2843

Decoupling SVD, SVD_split: perform truncated SVD decomposition separately for W_{k,nope} and W_{v,nope}.
Joint SVD, SVD_joint: for reservation K_nope, the interaction between V and the connection matrix [W_{k,nope}, W_v] performs a joint decomposition. This decomposition method is more in line with the standard MLA format.

At this point, we have completed the processing of the key and value parts. Unlike MLA in DeepSeek, the query part does not undergo low-rank decomposition. Instead, it also decomposes the nope and rope parts in the key corresponding to the query into two parts.

0xFF Reference

DP MLA For DeepSeek In Sglang is Xiao Xiao.

DeepSeek V3, R1, Janus-Pro series modeling methods for interpreting durian pastries

[LLM Algorithm] MLA technology shines in DeepSeek-R1; Tsinghua TransMLA converts GQA to MLA with a single click. SmartMindAI

First Efficient Parameter Fine-Tuning Framework: Using DeepSeek’s MLA in Any LLMs AcademicDaily00 [AcademicDaily]

[LLM Algorithm] MLA technology shines in DeepSeek-R1; Tsinghua TransMLA converts GQA to MLA with a single click. SmartMindAI

DeepSeekV2’s MLA (Multi-head Latent Attention) Explains the Mission of a Drop of Water

DeepSeek Model Interpretation: Scaling Law, MLA, MoE JMXGODLZ

Still using MHA? MLA is here! A summary and reflection on MLA for DeepSeek-v2 (rainbow)

A Comprehensive Guide to DeepSeek-V2 (Chinese Model for Transformer Modification): Detailed Explanation of MoE, GRPO, MLA v_JULY_v

DeepSeekV2’s MLA (Multi-head Latent Attention) Explains the Mission of a Drop of Water

MLA Learning Notes (including matrix absorption analysis during inference) - A powerful tool for saving large model key-value caches (BBuf)

A Brief Read of DeepSeek-V2 Technical Report AGI Dream Factory

Developing DeepSeek-V2 from scratch using PyTorch

Illustrated Explanation of Mixtral 8 * 7b Inference Optimization Principles and Source Code Implementation by Mengyuan

From MHA to MLA: Attention Optimization: Discussing DeepSeek’s Pinduoduo-level Inference Prices [zartbot]

Continuing our discussion on MLA, DeepSeek-MoE, and SnowFlake Dense-MoE, Zabot’s Eraser [zartbot]

Some Analysis on MHLA (Multi-Head Latent Attention) (Zhengxiao Du)

[LM Base] About MLA (including code) in DeepSeek-V2 (by Mo Ran)

deepseek-v2 MLA Deep Analysis of Single-Word Zhuo

Deepseek-V2 Technical Explanation (Captain)

What are your thoughts on DeepSeek’s release of the MoE large model, DeepSeek-V2? (Zheng Huabin)

The Extreme Pull-Off Between Caching and Performance: From MHA, MQA, GQA to MLA (Su Jianlin)

DeepSeek-V2 High-Performance Inference (1): Tenfold Speed-Up of MLA Operator via Matrix Absorption ZHANG Mingxing

Speed reading Deepseek v2 (Part 1) – Understanding MLA Bruce’s Wanderings

Still using MHA? MLA is here! A summary and reflection on MLA for DeepSeek-v2 (rainbow)

What are your thoughts on DeepSeek’s release of the MoE large model, DeepSeek-V2? - Zhihu (zhihu.com)

Deepseek-V2 Technical Report Interpretation! The Most Detailed on the Entire Internet! (qq.com) [BaoBao Algorithm Notes] 2

DeepSeek-V2 High-Performance Inference Optimization Notes: MLA Optimization madsys-dev

GQA paper reading and related thoughts clvsit

LLM acceleration techniques: Multi Query Attention deephub

Attention optimization: Flash Attention and Paged Attention, MQA, and GQA miangangzhen

Write LoRA code from scratch

Large Model Lightweight Fine-Tuning (LoRA): Training Speed and Memory Usage Analysis - Top Secret Ambush

MLKV: Cross-layer KV Cache Sharing, Reducing Memory Footprint (AI Chat)

Continuing our discussion on MLA, DeepSeek-MoE, and SnowFlake Dense-MoE, Zabot’s Eraser [zartbot]

[Deep Learning] DeepSeek Core Architecture - MLA: Analyzing the Technical Details of Low-Rank Joint Compression Optimization of KV Caching and Improvement of Inference Efficiency Zhao Nanxia

Deep Seek-R1 Model Architecture In-Depth Analysis (Part 2): MLA

SGLang DP MLA Feature Interpretation BBuf

[LLM Paper Explained] MLA Technology Shows Its Power in DeepSeek-R1, Tsinghua TransMLA Converts GQA to MLA with One Click AI-PaperDaily

TransMLA: Multi-Head Latent Attention Is All You Need

SGLang DP MLA Feature Interpretation BBuf

Learning and thoroughly understanding the DeepSeek MLA algorithm from a code perspective

The most detailed guide online! DeepSeekMLA Multi-Head Latent Variable Attention: From Algorithm Principles to Code Implementation

Deepseek Technology Explained (1) - Thorough Understanding of MLA (Multi-Head Latent Attention) Jiang Fuchun

[Code Learning] Learning Deepseek-v2 Inference Code - MLA - Part 1

[Code Learning] Learning Deepseek-v2 Inference Code - MLA - Part 3

[Code Learning] Learning Deepseek-v2 Inference Code - MLA - Part 4

[Code Learning] Learning Deepseek-v2 Inference Code - MLA - Part 2

The Extreme Pull-Off Between Caching and Performance: From MHA, MQA, GQA to MLA (Su Jianlin)

DeepSeek Open Sources FlashMLA: A Detailed Explanation of MLA from Principles to Code Du Lingxiao

First parameter-efficient fine-tuning framework: Using DeepSeek’s MLA in any LLMs

How to convert MHA in a pre-trained model into MLA? Du Lingxiao

Finally figured out the multi-head latent attention mechanism in DeepSeek!! — Programmer Xiaohan

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-V2 High-Performance Inference (1): Tenfold Speed-Up of MLA Operators through Matrix Absorption

Kernel design of DeepSeek MLA in FlashInfer

A detailed explanation of DeepSeek MLA matrix ablation (formath, 2025-02-24)

sglang mla code analysis hcy

DP MLA For DeepSeek In Sglang is Xiao Xiao.

SGLang DP MLA Feature Interpretation BBuf

MLA and Matrix Absorption in DeepSeek V2/V3 ariesjzj

Kernel design of DeepSeek MLA in FlashInfer yzh119

Finally figured out the multi-head latent attention mechanism in DeepSeek!! Programmer Xiaohan

What are the highlights of FlashMLA, the project open-sourced on the first day of DeepSeek Open Source Week? (SIY.Z)

MLA Learning Notes: A Powerful Tool for Saving Key-Value Cache in Large Models (Including Matrix Absorption Analysis During Inference)

The Qwen architecture was transformed into Deepseek, and the R1 plan was replicated by Meng Fanxu.

DeepSeek V2 “Multi-Head Latent Attention” Paper Interpretation (Part 1): Large Model Coffee Time

Does Deepseek MLA absolutely require code ingestion? (Code mover)

DeepSeek V3 Reasoning: MLA and MOE Analysis Arthur

Some fragmented memories triggered by DeepSeek MLA (YyWangCS)

Discuss the matrix calculation order in deep learning performance optimization (YyWangCS)

[Deepseek v3 Technical Report Study] 1. MLA Duludulu

Can concat in attention be replaced with addition? Zhai Feiyue

sglang mla code analysis hcy

MLA achieves understanding of David

SGLang MLA implementation of BBuf parsing

DeepSeek V3 Reasoning: MLA and MOE Analysis Arthur

Understanding the position and role of FlashMLA in the DeepSeek MLA calculation process

The Mystery of MLA Absorption: Little Zhu Pulling the Aircraft Carrier

MLA Principles Introduction (Simplified Version) opter

DeepSeek-V3/R1 Inference Efficiency Analysis (v0.17) zartbot

DeepSeek V3/R1 Inference Efficiency Analysis (2): DeepSeek Full-Power Reverse Engineering Analysis by Han Shen

DeepSeek V3/R1 Inference Efficiency Analysis (3): Decode Configuration Generalization Discussion (Han Shen)

DeepSeek V3/R1 Inference Efficiency Analysis (1): Some Irresponsible Estimates of the Decoding Throughput Limit of DeepSeek V3/R1 (Han Shen)

MoE Inference On AnyScale MoE-On-AnyScale

Understanding the computational characteristics of prefill and decode based on chunked prefill: Chayenne Zhao

The Architectural Issues Behind the Separation of LLM and PD - Geek Boge

DeepSeek MLA Inference Optimization Watson

DeepSeek-V3 MTP Engineering Implementation Thoughts by Geek Boge

A few thoughts: Why is DeepPhone so fast? (Clouds open)

Should prefill and decode be separated onto different cards? (Chayenne Zhao)

1. DeepSeek Model Learning Notes Li Weihua

DeepSeek-V3 (671B) Model Parameter Decomposition Calculation ZihaoZhao

In-depth analysis of vLLM: Deekseek and vLLM-1 stephenxi

DeepSeek MLA Inference Process and Code Implementation in SGLang (Durian Pastry)

The evolution path from MHA to MQA to GQA to MLA : If I were given an AI

The Annotated Transformer https://nlp.seas.harvard.edu/2018/04/03/attention.html

Attention Is All You Need https://arxiv.org/pdf/1706.03762.pdf

Fast Transformer Decoding: One Write-Head is All You Need https://arxiv.org/pdf/1911.02150.pdf

https://www.researchgate.net/figure/led-dot-product-self-attention-mechanism_fig1_363923096

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints https://arxiv.org/pdf/2305.13245.pdf

How Attention works in Deep Learning: understanding the attention mechanism in sequence models https://theaisummer.com/attention/

A simple overview of RNN, LSTM and Attention Mechanism https://medium.com/swlh/simple-overview-of-rnn-lstm-and-attention-mechanism-9e844763d07b

https://pytorch-forecasting.readthedocs.io/en/latest/_modules/pytorch_forecasting/models/temporal_fusion_transformer/_modules.html#ScaledDotProductAttention

A Brief Discussion on Transformer Initialization, Parameterization, and Standardization https://spaces.ac.cn/archives/0

https://theaisummer.com/self-attention/

https://zhuanlan.zhihu.com/p/626820422

Are Sixteen Heads Really Better than One? https://arxiv.org/pdf/1905.10650.pdf

This post is all you need (Volume 1) – Unveiling the Transformer Layer by Layer https://zhuanlan.zhihu.com/p/420820453

The Illustrated Transformer https://jalammar.github.io/illustrated-transformer/

Multi-Query Attention is All You Need https://blog.fireworks.ai/multi-query-attention-is-all-you-need

Sequence Parallelism and Tensor Parallelism in DeepSeek MLA (YyWangCS)