Exploring the Transformer Series (25) --- KV Cache Optimization for Handling Long Text Sequences

0x00 Overview

Even with a KV cache, in long sequence scenarios, it only reduces the computational cost of the K and V results themselves. However, the computational cost of subsequent parts remains high, and the storage pressure on K and V is also very heavy. This seriously hinders our ability to use longer sequences as input and generate even longer sequences. To solve this problem, we need to optimize the KV cache.

The core of KV Cache optimization lies in reducing memory consumption. This goal can be achieved by further compressing the “K” (key) or “V” (value) in key-value pairs. The choice of these compression techniques directly affects the model’s efficiency, especially in terms of memory usage and processing speed. Compressing key-value pairs means shortening the K and V, that is, discarding or compressing some K and V. By discarding these tokens, we effectively set the corresponding attention score to zero and approximate the attention matrix as a sparser matrix.

Reducing sequence length is the most researched area in KV-Cache compression. Currently, it can be broadly divided into several main directions:

Discarding unimportant tokens (sparserization). Unimportant tokens are identified using attention scores or attention coefficients, and their corresponding key-value pairs are discarded. This part of the research, like quantization in LLM, has made significant progress since the introduction of activation outliers. While this method can improve efficiency, it carries the risk of discarding key-value pairs, especially for tasks requiring deep understanding of long-range context.
Token reuse. Reusing tokens reduces sequence length. For example, reusing system prompts is an engineering optimization; from an algorithmic perspective, it’s a lossless optimization that requires no intervention from the training side.
This is a retrieval-based method. For example, the KVCache can be offloaded to the CPU and organized as pages or clusters. The similarity between q and each page or cluster is calculated to determine which pages or clusters to use. Only the top-k pages or clusters with the highest scores are loaded into GPU memory for computation at a time.

Note: The complete list of articles is here. It’s estimated to eventually have around 35 articles. This list will be updated after each subsequent article is published.
Cnblogs Exploring Transformer Series: Article List

0x01 Optimization Basis

The theoretical basis for reducing sequence length is mainly the following two points:

Inherent sparsity in the attention mechanism of large language models (LLMs).
Tokens have different levels of importance.

1.1 Sparsity

Despite the significant computational demands of transformers, the attention mechanism within LLMs does inherently exhibit sparsity. Similarly, the historical records used as the attention mechanism in a KV cache are inevitably sparse. Of course, the degree of sparsity may vary depending on the specific downstream task.

Figure (a) below illustrates the varying levels of attention sparsity across different models when performing a summarization task using the CNN/DailyMail dataset. This variation is evident at all levels of the model, including the entire model, individual layers, and specific parts of the model. In (b), the cumulative distribution function (CDF) depicts the relationship between attention scores and the proportion of total context. It can be seen that even with a 50% reduction in the KV cache size, the average cumulative attention score for key tokens remains at a fairly high level, approximately between 90% and 95%. This allows for the generation of similar tokens.

The authors of the paper “H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models” also found that KV Cache is actually very sparse, and only 5% is needed to achieve the same effect, that is, to generate the same token.

Therefore, we can make certain compromises and trade-offs between memory consumption and model accuracy.

2501

1.2 Importance

One reason for sparsity may be the difference in importance between tokens.

The core module of an LLM is Attention: by multiplying matrices Q and K, it computes an attention score matrix of size $O(N^2)$ , and then uses this attention score matrix to calculate a weighted average of the V matrix, resulting in the output matrix of the attention layer. Each row of the attention score matrix represents the relevance score between each token in Q and each token in K, and each row of the attention layer output matrix represents the weighted average of each token with the other tokens. Since it’s a weighted average, it means that the importance of each token is not the same; some tokens have higher weights, while others have lower weights.

In fact, only a small subset of tokens receive the most attention during text generation. This highlights the importance of specific key tokens and their central role in understanding context and facilitating text generation. Different researchers have undertaken various efforts to discover these tokens.

Some work has demonstrated a bias towards initial tokens. This bias stems from the cumulative effect of attention scores accumulating during decoding iterations towards the initial tokens. That is, regardless of their relevance to the language modeling task, initial tokens are assigned a surprisingly large number of attention scores (largely due to their starting position, not the tokens themselves), and these tokens are referred to as “attention sinks.” When the text length exceeds a certain limit, the header tokens are discarded, which can cause significant problems in the softmax stage, thus affecting subsequent inference results.
Some works have also found that sparse KV caches can exhibit strong temporal locality. The sparsity distribution follows a power-law distribution, with a small number of fixed tokens consistently dominating decoding. That is, when calculating the attention score, only a small subset of tokens contributes the most to the model’s value, and these tokens are important throughout the entire generation process. These tokens are called “Heavy-Hitters” (H2). For example, the paper “Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time” found that the attention scores of individual tokens exhibit a strong power-law distribution, meaning that some tokens are far more important to the attention mechanism than others. Based on this, the authors proposed the “Persistence of Importance” hypothesis: during the generation process, only tokens with significant importance will have a sustained impact on subsequent generation steps. The paper further points out that the attention patterns between most Transformer layers are highly similar, with an overlap rate exceeding 90%. Therefore, by identifying “high-weighted score tokens,” the memory footprint of the KV Cache can be effectively reduced. Furthermore, by pre-allocating space in the KV Cache and storing only the necessary tokens, the computational complexity can be simplified to a constant level.

Of course, there is another possibility: because the model is trained, it does not need to pay attention to the entire sequence (e.g., Mistral AI’s Mistral-7B), so it will no longer pay attention to certain tokens or pay very little attention to them.

1.3 Summary

We already know that attention modules tend to disproportionately focus more attention on a few words in a sequence. Therefore, we can avoid storing the keys and values of words that the model pays little or no attention to.

By discarding some tokens with low weights in the KVCache, preventing them from participating in attention calculations, we set their corresponding attention scores to zero and approximate the attention matrix with a sparser attention matrix. This allows us to compress the KVCache length while maintaining model performance, thus storing more tokens with a limited amount of GPU memory and improving inference efficiency for long texts. Another possible reason for not storing all keys and values in the sequence is that we can recalculate missing keys and values in each iteration, essentially trading time for space.

Therefore, the problem of compressing the length of the KVCache can be transformed into finding a more relevant token in the sequence through algorithm design.

0x02 Sparsification

The primary method for reducing sequence length is sparsification. KVCache Sparsification is an optimization method that improves efficiency by reducing unnecessary storage and computation. Its core idea is to select key tokens that require attention and discard irrelevant tokens. This is an algorithmic optimization that directly reduces computation, thereby reducing the size and computational complexity of the KV Cache.

1.1 Classification

The various sparsity schemes in KV Cache essentially differ in their methods of selecting the tokens to focus on. However, since some token attention calculations are discarded, it is theoretically lossy, potentially requiring the training side to ensure model performance. Sparsity is currently a mainstream research direction in KV Cache, and different papers classify it differently. Two review papers refer to sparsity as Token Dropping and KV Cache Selection. The following figure shows the different names and subdivisions of sparsity in the two review papers.

Token discarding is a technique that identifies and discards unimportant tokens. However, a key challenge with these methods is determining which tokens are unimportant.
The KV Cache selection mechanism aims to reduce the memory utilization of the KV cache, minimize inference latency, and improve the overall throughput of large language models. KV Cache selection mechanisms are categorized by their execution process: static KV Cache selection mechanisms only perform token filtering during the pre-filling stage, and the selected tokens remain unchanged in subsequent decoding steps; dynamic KV Cache selection mechanisms continuously update the KV cache during the decoding stage, achieving adaptive cache management.

2502

This paper mainly divides sparsification into the following categories:

Static sparsity. When calculating and predicting the next token, only historical token information of a single window size is maintained. The sequence length is then adjusted by adjusting the attention window. The number of tokens within the window in static sparsity is fixed.
Dynamic sparsity. The sequence length is adjusted by modifying the attention window. The tokens within the dynamic sparsity window are not fixed.
For sparsity reduction in the prefill stage, the main focus is on optimizing the sequence length of the prompt and extracting the most important tokens.
Sparsity considerations for layer characteristics. The attention layer of the LLM model has its own characteristics, which also affects the sparsity strategy of the KV cache. Therefore, in KV cache management, it is necessary to distinguish or accumulate the processing based on the contribution of attention weights, and also to dynamically adjust the number of key tokens participating in attention calculation at each layer.
Other solutions include sparsification combined with speculative sampling, and sparsification tailored to head characteristics.

We also used the paper “A Survey on Large Language Model Acceleration based on KV Cache Management” for comparative learning. The parts in that paper corresponding to sparsity are KV Cache Selection, KV Cache Budget Allocation, and KV Cache Merging. Because they are not orthogonal to our classification method, we added labels to the graph.

2503

We will now learn through case studies.

1.2 Static Sparsity

Static sparsity can be achieved through attention mechanisms, namely sparse attention or linear attention. The core idea of sparse attention is to select appropriate tokens for computation during inference. This approach is particularly helpful for long sequences, significantly reducing the KV cache size and computational cost of the attention component. Linear attention mechanisms (such as Linear Transformer, RWKV, and Mamba) reduce memory requirements by replacing standard attention mechanisms with mechanisms linearly related to sequence length. However, this approach may reduce the model’s expressive power, leading to performance degradation in tasks requiring complex, long-distance lexical dependencies.

For example, Mistral-7B uses a variant of the attention mechanism known as sliding window attention (SWA) or local attention. Local attention ensures that we don’t need to focus on the entire sequence, but only on the last 4096 adjacent tokens to construct token representations, so that tensor pairs exceeding the window size are never stored in the KV cache. For another example,

The paper “Efficient Streaming Language Models with Attention Sinks” analyzes four Attention implementations, as shown in the figure below. Here, T represents the T-th token to be predicted, and L represents the maximum sequence length during pretraining (typically 2k, 4k, etc.). Since this primarily targets long sequence scenarios, T is much larger than L.

2504

The first is the original attention implementation, and the latter three are sparsification implementations. These sparsification methods are all static, that is, the correlation between tokens is a fixed paradigm, and the related tokens of each token are at a fixed distance.

Dense Attention is a vanilla Attention implementation. Each token can see both the current token and previous tokens, with a computational complexity of $O(T^2)$ and a KVCache storage complexity of $O(T)$ . Due to the high complexity, T is relatively small during pre-training. During inference, when the text length exceeds the pre-training length, the model performance degrades significantly, resulting in a larger PPL value.
Window Attention caches the key-value pairs (KV) of the most recent n tokens. This method is equivalent to maintaining a fixed-size sliding window only on the KV states of the most recent tokens. The gray dashed boxes represent tokens evicted from the cache after exceeding the window length. The computational complexity is $O(T \times L)$ , which is low, the cache is small, and it is efficient in inference. However, the model performs poorly because performance drops drastically once the keys and values of the tokens are deleted.
Sliding Window with Recomputation. This attention mechanism is similar to Window Attention, but the difference is that Sliding Window does not cache the tokens of the window. Instead, it reconstructs state from the nearest $L$ tokens for each new token’s $KV$ state. This approach outperforms Window Attention, with a significantly lower PPL value. The main reason is that the recomputation uses the tokens in the window as the initial tokens, thus preserving the initial tokens while ensuring that computation is limited to a single window, greatly reducing the storage complexity of the KVCache. While accuracy is guaranteed due to the recomputation, performance is significantly reduced. Although it performs well on long texts, its $O(TL^2)$ complexity due to the secondary attention in context recomputation makes it quite slow, rendering this method unsuitable for practical streaming applications.
StreamingLLM retains the attention sink (a few initial tokens) used for stable attention computation and incorporates the most recent tokens. It is efficient and provides stable performance on extended text. The yellow box represents the initial few tokens, that is, the attention sink, with a computational complexity of $O(T \times L)$ . The model accuracy is quite good, comparable to recalculated sliding window attention.

Next, we will analyze window attention and StreamingLLM.

1.2.1 Window Attention

Window Attention uses a sliding window mechanism to solve long text reasoning challenges, where tokens that fall outside the window are permanently evicted and become inaccessible.

In traditional attention mechanisms, each token undergoes attention with all preceding tokens. However, based on our everyday language habits, the relevance between words in a sentence varies greatly. For the current token, generally, the closer the words are, the stronger the relevance; tokens farther apart tend to provide less information, so it seems unnecessary to waste resources on attention with these distant tokens. Therefore, the sliding window attention mechanism works by having each token only perform attention with the preceding L tokens, including itself. In other words, attention is only calculated with tokens that are close to the current token. This keeps the cache size within L. Because each token only performs attention calculations with its neighbors, the computational complexity is $O(TL)$ , and the KVCache storage complexity is $O(L)$ . This method significantly reduces the storage overhead of the KVCache, decreasing it from linear complexity to constant complexity.

A typical example is Longformer, as shown in the figure below.

2505

(a) demonstrates the traditional “full attention” mechanism, in which each newly generated token pays attention to all previous tokens in the sequence.
(b) describes “sliding window attention.” This method maintains a fixed-size sliding window of the most recent tokens. Given a fixed window size $w$ , for each token, attention is paid to its two sides, that is, $\frac{1}{2}w$ tokens on each side. Sliding windows allow us to maintain a fixed-size cache. Once the sequence grows beyond the sliding-window token count, we can cycle through the cache and start overwriting. Because $w$ can be set relatively small, its computational complexity scales linearly with the input sequence length $n$ , namely $O(n \times w)$ , making it more suitable for shallow capture of local information. Moreover, because the position in the cache is irrelevant, all position-related information is encoded through positional embedding, making it easy to implement and effective. However, this method limits the model’s ability to capture comprehensive semantic information from the past, resulting in lower text generation quality and decreased accuracy.

One might wonder, while tokens further away may contain less information, doesn’t that mean they’re completely useless to the current token? Is it really reasonable to completely exclude them from the sliding window? Actually, Sliding Window Attention doesn’t completely ignore token information outside the window; rather, it indirectly utilizes tokens outside the window as the number of model layers increases. This can be compared to the receptive field in CNN technology. If multiple layers are stacked, starting from layer 0, the receptive field of the corresponding token expands forward by W with each layer upward. Although at each layer it can only see a portion of the preceding tokens at its “farthest,” if the model is deep enough, it will definitely be able to see all the preceding tokens at some layer. This layer’s window attention can access all input locations, producing a large receptive field that can construct a representation across the entire input information. In a Transformer with $l$ layers, the receptive field size of the top layer is $l \times w$ assuming $w$ is fixed for all layers. Depending on the application, it might be better to use different $w$ values for each layer to strike a balance between efficiency and model representation power. The specific details are shown in the image below.

2506

(c) demonstrates a variant called “inflated window attention,” which has similar accuracy and limitations to window attention. Its characteristics are as follows.
- To further increase the receptive field without increasing computational cost, the sliding window can be “expanded.” This is similar to dilated convolution, where the window has a spacing of size d, thus amplifying the intervals between receptive positions. This allows the token to capture information from further away at its current position, focusing on a wider field of view and considering more comprehensive contextual information. Due to this more comprehensive consideration of contextual information, the dilated sliding window mechanism performs better than the ordinary sliding window mechanism.
- At each position, Q still focuses on K or W items, but these W items are not consecutive but spaced out. Let d be the interval between these spaces, then the range of K items that Q focuses on is $w \times d$ . Computationally, the sliding window retains its performance advantages.
- Assuming fixed values d and w for all layers, the receptive field size is $l \times d \times w$ , which can reach thousands of tokens even for small values of d. In multi-head attention, each attention head computes a different attention score. Setting up each head with different expansion configurations can improve performance by allowing some heads without expansion to focus on local contexts, while others with expansion focus on longer contexts.
(d) is a global sliding window mechanism that integrates global information. Sliding windows and expanded attention are insufficient for learning task-specific representations and need to be borrowed from other sources. Therefore, global attention is introduced to supplement these two methods. Its characteristics are as follows:
- Global attention works the same as a regular transformer, focusing on the entirety of K. Depending on the specific task, global computations are introduced at chosen locations, allowing the model to receive global information during the computation process.
- A token with global attention will pay attention to all tokens in the entire sequence, and all tokens in the sequence will also pay attention to it. Since the number of these tokens is relatively small and independent of n, the complexity of combining local and global attention is still $O(n)$ . While specifying global attention is task-specific, it is a simple way to add inductive bias to the model’s attention, much simpler than existing task-specific approaches that use complex architectures to combine information across small blocks of input.

1.2.2 StreamingLLM

StreamingLLM leverages the “attention deposition” effect, combining the key-value pairs of early tokens with recent context to optimize long sequence processing, enabling support for infinitely long inputs while generating infinitely long outputs.

StreamingLLM comes from the paper “Efficient Streaming Language Models with Attention Sinks”. Its goal is to enable models to handle inputs of infinite length (here, “infinite length input” is different from “infinite length context” - the former does not need to remember all the input content).

While the Window Attention scheme compresses the length of the KVCache, its model performance is poor, primarily because the initial tokens are discarded. These tokens are highly important, and discarding them severely impacts model performance, significantly reducing accuracy. StreamingLLM addresses this shortcoming. StreamingLLM discovers a key phenomenon in attention: the key-value pairs retained in the initial sequence tokens maintain crucial model performance. This attention sinking effect is manifested through the accumulation of asymmetric attention weights in early positions, independent of semantic meaning. This method leverages this characteristic, combining the attention sink position with the most recent context for efficient processing.

The StreamingLLM strategy, building upon Window Attention, constructs a sliding window cache by retaining only the initial positional tokens (sink tokens) and the last adjacent tokens (local attention). Therefore, the StreamingLLM KV Cache has a fixed length, consisting of a fixed portion (typically 1 to 4 tokens) and a sliding portion. This preserves the characteristics of Window Attention, resulting in a KV Cache storage complexity of $O(L)$ and a computational complexity of $O(TL)$ , while ensuring that the model performance does not significantly degrade due to the loss of initial tokens. PPL is similar to Sliding Window Attention with recomputation.

Attention sink

The authors of StreamingLLM discovered that by adding just the first few tokens (e.g., four) to window attention and manipulating positional encoding, the model’s point-of-view (PPL) output doesn’t improve; the output remains stable up to 20k words. The paper calls this phenomenon “Attention Sink.” It’s as if the text’s attention is “sinking” to the beginning, meaning the initial tokens attract a large amount of attention. Furthermore, the absolute distance and semantic information between the initial tokens and the generated tokens are unimportant; what matters is that the initial tokens are the first or first few tokens - these tokens act as anchors, hence their particularly high weight.

2507

The image above visualizes the average attention logic across 256 sentences in lama-2-7B, with each sentence being 16 characters long. Observations include:

The attention graphs of the first two layers (layer 0 and layer 1) exhibit a “local” pattern, with the most recent token receiving more attention.
Apart from the first two layers, the focus afterwards is mainly on the first token.

Removing the key-value pairs (KV) of these initial tokens results in a significant portion of the denominator being removed from the SoftMax function used in attention calculations. This change causes a substantial shift in the distribution of attention scores, deviating from expectations in normal inference environments.

Why does this phenomenon occur? Evan Miller provides an insightful interpretation in “Attention Is Off By One”.

The problem with using softmax is that it forces each attention head to make an annotation, even if it has no information to add to the output vector. Using softmax to choose among discrete alternatives is great; using it for optional annotation (ie as input into addition) is, like, not cool, man. disallowed.

In the Attention mechanism, the output of Softmax represents the probability of a key/query match. If the softmax value at a certain position is very large, the weight at that position will be significantly updated during backpropagation. However, sometimes the attention mechanism cannot determine which position is more worthy of attention, but Softmax requires the sum of the values at all positions to be 1. Therefore, the model must “express” its stance and assign larger weights to certain positions. Consequently, the model tends to transfer unnecessary attention values to specific tokens, which are called initial tokens.

Therefore, Evan Miller improved Softmax and proposed “SoftMax-off-by-One”: by adding 1 to the denominator of softmax, all position values can be summed to a value other than 1, and Attention has the right not to “express an opinion” at any position.

2508

Ideas

Based on the characteristics of attention sinks, the authors of the paper refer to previous solutions for multi-turn dialogue scenarios and propose a solution for StreamingLLM, which introduces initial tokens (KV) on top of window attention.

Anchor points plus windows. The KV cache in StreamingLLM can be divided into two parts: a fixed part (usually 1 to 4 tokens) and a sliding part.
- Fixed part. The initial tokens (attention sink part) of the entire sequence are retained. The figure shows 4 initial tokens. Their function is to identify and save the model’s inherent attention sink, anchor the initial tokens for inference, perform attention calculations, and stabilize model performance.
- The sliding window’s key-value (KV) cache retains the three most recent tokens, with a fixed window size of 3. Its purpose is to allow large models to be output infinitely, achieving a “streaming” effect: without a KV cache, computation would be too slow, but with a KV cache, memory usage increases too much with the model’s length. Therefore, some KV cached data needs to be discarded.
The calculation involves performing attention calculations only with the tokens within the window and the initial tokens of the sequence. Furthermore, during training, “SoftMax-off-by-One” is used instead of regular softmax to address the attention sink problem.

2509

In the implementation, the “Rolling Buffer Cache” technique is used to discard certain rows in the KV cache, while also controlling the position encoding. The diagram below illustrates the operation of the Rolling Buffer Cache. When we use a sliding window, the KV cache no longer needs to store the KV information of all tokens. You can think of it as a fixed-capacity (W) cache, which is “rolled up” as the token index increases.

2510

question

While StreamingLLM remembers the attention sink and the key-value cache corresponding to the tokens in the most recent window, relying solely on these k tokens in the attention mechanism cannot provide the necessary accuracy. This is because it currently discards intermediate tokens based on their position; what if an intermediate token is actually important? This is the problem with static token sparsity: the token candidate set is too fixed. This token candidate set design stems from patterns discovered by the authors through observing correlations between tokens in certain datasets. These patterns cannot be guaranteed to be universally applicable, leading to the loss of recent context in key token attention and the absence of key context in window attention, which degrades the model’s performance on long sequence tasks.

1.3 Dynamic Sparsity

Dynamic token sparsity essentially maintains a set of highly relevant historical tokens for the currently processed token. However, unlike static token sparsity, the construction of this set is no longer determined by token distance or fixed tokens, but rather by designing an algorithm to filter historical tokens. Static token sparsity is a subset of dynamic token sparsity, so theoretically, dynamic token sparsity should perform better than static token sparsity.

However, dynamically determining which tokens are key tokens during inference, especially when the input sequence contains unknown or unseen tokens, is a significant challenge. There are two main approaches to address this challenge:

Scoring function. A scoring function is used to identify key tokens. For example, we introduce a scoring function $f_\theta$ for each token to identify k key tokens from a total of n tokens.
The expulsion strategy involves retaining key-value pairs (KVs) deemed important by the system and discarding those deemed unimportant. The determination of importance is crucial; most systems retain the top-B results for each layer based on their attention scores.

In fact, these two are two sides of the same coin: a dynamic evaluation strategy is used to determine which key-value pairs need to be retained and which need to be discarded, and runtime key/value eviction is used to reduce the size of the key-value cache.

2511

It’s important to note that most works assume the persistence of attention patterns across iterations; that is, if a token is considered unimportant in the current iteration (i.e., has a low attention weight), it is likely to remain unimportant when generating future tokens. Based on this assumption, when the KV cache size exceeds its budget, they evict tokens with low attention weights from the KV cache in each iteration. The keys and values of the evicted tokens are permanently excluded from subsequent iterations after being removed from memory.

Let’s learn through a few case studies.

1.3.1 H2O

The paper “H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models” proposes a classic approach called H2O. H2O observes that attention computation is primarily driven by a set of high-influence tokens known as “pivotal tokens” or “heavy hitters.” Therefore, H2O reformulates caching optimization as a dynamic sub-model problem, utilizing cumulative attention scores to guide token retention decisions.

motivation

Sparse key-value caches do not necessarily possess 100% temporal locality. For example, when decoding the 6th token, an LLM may not necessarily need tokens 1 through 5, but could potentially need any number of them. Imagine an extreme case where decoding the 6th token requires tokens 1, 2, and 4, and decoding the 7th token requires tokens 3, 5, and 6. Because while not all tokens are used at any given time, each token could potentially be used at any time, a key-value cache containing all historical tokens must be maintained in memory in this scenario. The paper refers to this situation as Dynamic Sparsity, as shown in the first image on the left below.

One intuitive idea is that sparse key-value (KV) caches might possess strong temporal locality similar to dense KV caches: only a small portion of the KV caches differ between two decoding steps, and tokens sparsely discarded in a certain decoding step will never be used again. This achieves temporal locality similar to dense KV caches while reducing KV cache memory usage. The paper refers to this type of situation as Static Sparsity (Local) and Strided Sparsity (Strided), as shown in the second and third figures from the left in the image below.

H2O argues that strong temporal locality sparsity, such as local sparsity and equal-step sparsity, is too naive; simply looking at the K most recent elements or the K most recent elements with equal step sizes cannot represent all historical information. A better sparsity strategy must exist that satisfies strong temporal locality. H2O discovered this strategy during the decoding process by leveraging the accumulated token importance, as shown in the fourth figure from the left.

2512

The authors of H2O believe the ideal way to drop tokens is to predict the attention scores of future tokens to the current token, and then drop those tokens with low scores. Since future tokens are unknown, they can only be predicted.

Because tokens with higher attention scores have a greater impact on the calculation results, the authors use the sum of previous attention scores to predict subsequent attention scores. In a later appendix, the authors prove that the H2O assumption’s greedy strategy for constructing the KV cache is near-optimal in the submodular case, meaning it exhibits marginal effects as the scale increases. Therefore, the H2O design problem has converged to a very small problem: the dynamic submodular problem. That is, how to design a greedy algorithm that ensures each evicted token is submodularly optimal, minimizing the impact of each discarded token on the generation process.

plan

The design goal of H2O is to determine a key-value (KV) cache maintenance/eviction strategy. H2O sets a maximum number of cached tokens (budget). Each time the cache budget is reached, the strategy uses a greedy algorithm to calculate the token score, discarding the token with the lowest accumulated attention score each time, thus only retaining tokens that consistently achieve high attention scores throughout the iterations. This makes the decoding performance based on sparse KV caches similar to that based on dense KV caches.

The authors of H2O discovered that influential tokens (“pivotal tokens” or “heavy hitters”) in a given step remain influential in future steps. In other words, we can be confident that discarded low-influence tokens will still be relatively negligible in future generation steps and can therefore be safely discarded. Based on this observation, the authors proposed a greedy historical token eviction strategy called the “Heavy-Hitters Oracle” (H2O): For a given token A, A and all tokens preceding A will pay attention to A. Summing these attentions gives A an “importance score.” The importance score for each token is calculated, and then the tokens with the lowest scores are discarded. H2O uses the attention score as the scoring function, denoted as $f_\theta(acc; attn)$ . This allows us to identify tokens that consistently receive high attention scores during the hint and token generation phases as the most critical or key tokens. There is actually room for improvement here, because the absolute value of the attention score is used to reflect the influence of the token. The summation implies the assumption that “the older the token, the more important it is”. The older the token, the more terms are summed, and the final value may be larger.

The specific algorithm flow is shown in the following figure:

Initialize the relevant historical token set and set the maximum capacity of the set.
Iteratively update the historical token collection.
- When the size of the historical token set equals the maximum capacity, the current token is added to the historical token set.
- When the size of the historical token set is greater than or equal to the maximum capacity, the relevance score between the current token and all tokens in the historical token set is calculated (implemented by summing the columns of the attention matrix and then taking the top k). The token with the lowest score is removed from the set, and the current token is added to the set. In other words, at each generation step, the historical token with the lowest relevance to the current token is permanently removed, and the removed historical token will not participate in the attention calculation of any subsequent token. In natural language, this means that the attention score corresponding to the sparsed part of the KV cache is set to 0, while the other parts are calculated normally.
This ensures that the sizes of the two key-value (KV) caches are equal, with at most one key-value pair differing in position. Specifically, either a key-value pair is deleted from the previous time step’s cache and added to the cache of the currently generated token; or the cache of the previous time step’s cache is retained without adding the cache of the currently generated token. This KV cache always satisfies strong temporal locality.

The backslash "" in the diagram is a set operation symbol: relative difference. “U \ A” represents all elements in set U but not in set A. For example, the relative difference {1, 2, 3} \ {2, 3, 4} is {1}, while the relative difference {2, 3, 4} \ {1, 2, 3} is {4}.

2513

A similar approach to H2O is Scissorhands. While H2O discards only one term at a time, Scissorhands discards as many terms as possible based on the target compression ratio (e.g., reducing the KV cache by 30%). Scissorhands simply retains the most recent terms and those with the highest attention scores within the history window. H2O, on the other hand, discards the terms with the lowest cumulative attention scores, retaining only those terms that consistently maintain high attention scores throughout the iterations.

Optimization points

Existing research on reducing KV cache size by eviction of tokens inherently faces several challenges. Methods like H2O assume that the attention pattern remains unchanged across iterations; in other words, H2O uses the sparsity of the previous KV cache layer to guide the sparsity of the next. However, this is not the case in reality, as the attention pattern is dynamic between iterations. Tokens considered unimportant in the current iteration may become important in subsequent iterations. Tokens evicted in previous generation processes may have a high correlation with the current token, but because they were evicted in previous generation processes, it is impossible to recall the token and complete the attention calculation with the current token.

Therefore, H2O has a narrow evaluation window, which makes it highly efficient within the KV cache budget. However, as the sequence length exceeds the KV cache budget, it begins to struggle to cope with the dynamics of attention patterns and fails to capture these dynamic changes well.

Furthermore, the H2O method sets the number of key/value tokens retained as a fixed percentage of the input sequence length. In this case, the KV cache budget remains constant regardless of the number of tokens generated. However, this fixed KV cache budget has some limitations. The number of key/value tokens required for each layer is different, and each query token also requires a different number of key/value tokens to effectively represent the attention pattern of the baseline model. Moreover, even for adjacent query tokens, the required number of key tokens can vary significantly. The fixed KV cache budget ignores this variability between query tokens, and if the number of retained key/value tokens is insufficient to accurately represent the attention weights of the baseline model, it will lead to inefficient KV cache management. Therefore, it is necessary to dynamically adjust the number of key/value tokens loaded and computed based on the needs of each query token, while considering the variability between different layers and query tokens, to achieve more efficient KV cache management.

1.3.2 keyformer

In generative inference, approximately 90% of attention weights are concentrated on a specific subset of tokens, referred to as “key” tokens. These tokens are crucial for large language models (LLMs) to understand context but may not be within the sliding window of window attention. Keyformer introduces a hybrid attention approach, as shown in Figure (d), which combines the most recent tokens with the preceding key tokens when generating the next token. Furthermore, Keyformer reveals that token removal distorts the underlying softmax probability distribution. Considering the critical role of the softmax distribution in token saliency evaluation, Keyformer incorporates regularization techniques to mitigate these distributional perturbations.

2514

motivation

The figure below compares the distribution of the score function under full attention with the distribution after reducing the KV cache. As shown in Equation 2, when the KV cache is reduced, tokens with lower scores are discarded. Due to the nature of the isoftmax function, this disrupts the distribution of the score function, leading to the loss of contextual information, reduced accuracy, and potentially lower quality of generated text. Therefore, it is necessary to enhance the model’s ability to extract recent key tokens.

2515

plan

Scoring function

Keyformer identifies key tokens by applying regularization to the unnormalized log-odds (logits) to adjust for changes in the scoring function caused by discarded tokens. Specifically, Keyformer proposes a novel scoring function, denoted as $f_\theta(\text{Keyformer})$ , to address the limitations of cumulative attention-based scoring functions ( $f_\theta(acc\; attn)$ ). This new scoring function integrates the Gumbel noise distribution into the unnormalized logits. However, it fails to account for discarded tokens when forming the underlying probability distribution. To correct this, Keyformer introduces a temperature parameter, denoted as $\tau$ .

The detailed analysis is as follows:

Keyformer employs log-odds regularization. It maintains the model’s robustness and adaptability by introducing an additional distribution ( $\zeta$ ) to regularize the reduced log-odds. This helps identify key tokens even when unknown context exists during inference. Keyformer adds this noise to the unnormalized log-odds derived from $QK^T$ . Furthermore, the type of distribution added significantly impacts the final probability distribution, corresponding to label 1 in the diagram below. Here, $y_i$ is the adjusted log-odds (logits), $x_i$ is the unnormalized log-odds, and $\zeta_i$ is the additional distribution used for regularization.
When calculating probabilities, discarded tokens are considered using a temperature parameter $\tau$ . The “temperature” parameter ( $\tau$ ) plays a crucial role in smoothing the probability distribution. This parameter controls the degree of randomness in the probabilities. $\tau$ is particularly important when tokens are removed from the KV cache, as they cannot be reintroduced without recalculating their keys and values. Higher $\tau$ values ( $\tau \rightarrow \infty$ ) produce uniform probabilities, assigning the same score to all tokens. Conversely, lower $\tau$ values ( $\tau \rightarrow 0$ ) produce a sharper distribution, prioritizing specific tokens based on unnormalized logits. Furthermore, as more tokens are discarded, we need a more uniform or random probability distribution. Therefore, we systematically increase the randomness in our scoring function $f_\theta$ by progressively increasing $\tau$ and $\Delta\tau$ .
The chosen distribution is inspired by the Gumbel distribution. The Gumbel distribution is particularly well-suited for our key token identification task because it describes the distribution of maximum values in the sample set and is biased towards the initial token. This makes it ideal for modeling key tokens in long sequences. Label 3 shows the standard Gumbel probability density function (pdf) applied to unnormalized logits, while label 4 shows the pdf of logits after adding Gumbel noise.

2516

algorithm

While the relevance of the current token is important for identifying key tokens, the model should maintain consistent behavior across most generated tokens. To identify key tokens based on this consistent behavior, Keyformer applies a cumulative scoring function ( $f_\theta$ ) in both the hinting and token generation phases. Without accumulation, a token’s relevance would depend solely on its correlation with previous tokens.

During the hint processing phase, Keyformer calculates the key and value of all n tokens within the hint length $S_n$ to predict the first token. Given a KV cache budget, Keyformer maintains a window of the most recent w tokens, discards n-k tokens from the n-w token window, and identifies k-w tokens based on the Keyformer scoring function. The combination of the key tokens (k-w) and the most recent tokens (w) forms the reduced KV cache.

During the token generation phase, Keyformer operates using a shrunken KV cache. The first generated token focuses only on the k tokens in the KV cache, as shown in decoding step 2 in the diagram below. The most recent window w shifts one token to the right, while the scoring function $f_\theta$ accumulates the scoring function from the previous decoding step. In each decoding step of the token generation phase, the model identifies k-w key tokens from a window of size k+1-w. Therefore, one token is added and one token is removed from the most recent window.

In addition, the temperature parameter $\tau$ adjusts the number of tokens removed from the probability distribution of the scoring function by increasing $\Delta\tau$ .

2517

The detailed algorithm of Keyformer is shown in the figure below.

2518

Optimization points

The following technologies are similar to H2O:

SubGen is an efficient compression technique developed for KV cache. Empirical evidence suggests that key embeddings in the attention module exhibit significant clustering trends. Based on this key insight, SubGen designs a novel caching method with sublinear complexity that employs online clustering of key tokens and online $\ell_2$ sampling of values.
LESS (Low-rank Embedding Sidekick with Sparse policy) learns the residual between the original attention output and the attention output approximated by the sparse policy. It achieves this by accumulating the information discarded by the sparse policy into a low-rank cache or state of constant size, thus allowing queries to still access information to recover previously omitted areas in the attention graph.

The aforementioned method based on permanent eviction faces two major limitations.

First, irreversible expulsion of tokens can impair the model’s performance on long sequence tasks, especially in needle-in-a-haystack scenarios, where these methods have proven difficult to adapt to multi-turn dialogue environments. Secondly, choosing a KV cache during the decoding stage introduces computational overhead, which adversely affects decoding latency and end-to-end acceleration.

To address these challenges, several studies have focused on developing key-value (KV) cache selection strategies for the decoding phase without requiring permanent eviction. For example:

InfLLM: Block-level storage full KV cache, GPU retains key tokens, and secondary parts are offloaded to CPU.
SqueezedAttention: Offline K-means clustering generates semantic key clusters, which are then loaded during inference.
SparQ: Selectively retrieves the hidden dimension of cache keys based on salient elements of the query vector, and approximates the calculation of attention.

These methods typically employ multi-tiered caching systems (e.g., CPU-GPU hierarchical caching) and leverage advanced data structures and system-level enhancements to optimize retrieval efficiency, thereby achieving efficient inference while reducing GPU key-value cache footprint. To accelerate key token retrieval, these studies propose index-based methods that organize and access key-value caches at the block or cluster level, employing multi-tiered caching (e.g., CPU-GPU hierarchical storage) to avoid permanent removal and optimize retrieval efficiency, thus enabling efficient query and retrieval operations.

1.3.3 Summary

The paper “A Survey on Large Language Model Acceleration based on KV Cache Management” provides a summary of static and dynamic sparsification methods.

2519

1.3 Sparsity for prefill

Existing key-value (KV) caching compression methods primarily focus on optimization during the generation stage, neglecting KV caching compression during the input stage. However, in practice, large-scale model applications such as dialogue (especially RAG) and document processing are characterized by very long inputs and relatively short outputs. The input may already be enough to fill up GPU memory. Or even if GPU memory is sufficient, each computation step can be very slow due to the large number of tokens requiring interaction. Therefore, the input-stage KV caching is a bottleneck in both memory and computation. For example, if our input prompt is 16KB, storing all the KV tokens corresponding to this 16KB text would put immense pressure on both GPU memory and computation. Therefore, we can process the KV cache during the prefill stage by dropping some less important tokens, that is, compressing the KV generated from the prompt. The decoding stage proceeds as usual, eliminating the need for additional computation during decoding and thus accelerating the process.

1.3.1 FastGen

FastGen identifies five attention patterns (local, punctuation-oriented, sparse distribution, etc.) and then identifies important tokens during pre-filling. For each token, FastGen performs targeted profiling based on the characteristics of different attention patterns to discern the internal structure of the attention module. Then, it applies different compression strategies to the identified structures accordingly, adaptively building a key-value cache. FastGen does not set a cache budget; instead, it sets the maximum approximation error of the attention matrix, thus focusing on preserving model accuracy.

Insight

Previous research on compressed key-value caches has not utilized the complex attention structure in LLM. The attention structure in LLM has the following characteristics:

Note that the module has a rich structure.
Different attention heads typically have different functions and follow different patterns.
Not all attention modules need to pay attention to all representations.

The above characteristics indicate that a compression strategy needs to be customized for each individual attention head. The left side of the image below shows four common attention structures, and the right side shows the attention graph of three attention heads in the same layer. Specifically, special lexical units (green) + punctuation lexical units (orange) + local attention (blue). Gray represents other lexical units.

2520

algorithm

In addition to the traditional full key-value (KV) caching strategy, FastGen considers four basic KV cache compression strategies. The four KV cache compression strategies are:

Special tokens. Only special tokens, such as sentence beginning tokens <s> and instruction tokens [INST], are retained in the KV cache.
Punctuation marks. Only punctuation marks, such as ”.”, ”:”, and ”?”, are retained in the KV cache.
Locality. This strategy evicts long-term context tokens, retaining only the last adjacent token. Some attention modules primarily focus on local context, or what we commonly refer to as Locality. In this strategy, once the relative distance between the context token and the current token exceeds a threshold, the context token’s key-value cache is evicted. The threshold is determined by a predefined ratio of the local context length budget to the input sequence length.
Frequency. Monitor the cumulative attention score for each token, then treat these scores as the token usage frequency, and only store the most frequently used tokens in the KV cache.

FastGen can be divided into two steps:

First, at the end of the pre-filling stage, the model’s attention layers are profiled to identify the behavior and structural patterns of various attention heads. Guided by this profiling, the most suitable compression strategy that satisfies the error objective is adaptively selected for each head, and a key-value cache is built. Like other methods, FastGen assumes that the identified attention patterns will remain unchanged in future generation steps. If the error objective is too stringent to be achieved, FastGen reverts to the regular key-value cache.
Then, at each generation step, instead of indiscriminately adding a new KV vector to each newly generated token, the KV cache is managed according to the compression strategy chosen for each token. For example, long-range text emphasizing local context is evicted from the attention header, non-special tokens in the attention header centered on a special token are discarded, and the standard KV cache is only used for attention headers that broadly pay attention to all tokens.

1.3.2 SnapKV

The core idea of SnapKV is to identify important tokens based on the user’s question. LLMs exhibit strong consistency and stability in their attention patterns towards input tokens during generation, and these patterns can be observed within a small window at the end of the input sequence. Therefore, SnapKV only retains the key-value pairs corresponding to input tokens that are continuously followed by attention heads, thus significantly reducing the size of the key-value cache. Simultaneously, SnapKV also incorporates clustering techniques such as pooling, which can capture complete local information while filtering out most irrelevant tokens.

Insight

The authors conducted an experiment, dividing the input into windows of 128 tokens each, and using the last 20 windows for computation. For any token in the prefix, let’s call it $t_i$ . We take one of the 20 windows, let’s call it $w_i$ . Every element in $w_i$ interacts with $t_i$ because every element in $w_i$ comes after $t_i$ . Therefore, $w_i$ and $t_i$ will calculate an attention score. For window $w_i$ , $t_i$ will have 128 attention scores. The average of these 128 attention scores is taken as the importance value of $t_i$ . Thus, for window $w_i$ , each token in the prefix can have its importance value calculated, and then the top K most important tokens in the prefix during the generation stage are selected based on the average score.

Then, the authors compared the overlap rate between the important tokens selected by the window and those selected during the generation stage. By comparing the important tokens selected by the 20 windows with the important tokens actually selected during the generation stage, the authors found that the tokens selected by the last window and those selected during the generation stage had a high degree of overlap. Therefore, this demonstrates that the importance of the input can be determined using the input prompt itself. This is what the paper states:

Pattern can be identified before generation.
Pattern is consistent during generation.

In addition, the author observed several other phenomena.

Regardless of the length of the generation context, specific tokens in the hints consistently exhibit higher attention weight during the generation process, and this is relatively stable.
The location of the question in the prompt does not significantly change the consistency of the observed attention pattern. That is, attention to relevant features can be obtained regardless of the location of the question.
Attention patterns are highly context-dependent, indicating that these patterns are strongly associated with the specific instructions given by the user.

Therefore, a context-aware KV cache compression method may result in better performance.

plan

SnapKV simplifies FastGen’s approach, focusing solely on retrieving tokens based on their importance scores. Of all the cue tokens, only a subset carry crucial information for response generation; these tokens retain their importance during the generation phase. SnapKV employs an end-positioned observation window to detect these important context tokens. Then, it concatenates their corresponding key-value pairs with the tokens in the observation window.

SnapKV does not retain a KV cache of all input tokens during the Prefill phase, but instead adopts a sparse approach.

During the prefill phase, the Prompt for each Attention Head is divided into two parts: Prefix and Window, as shown in orange and green in the diagram below. These can be set as representative tokens and window tokens, corresponding to number 1 in the diagram. The specific attention features that each Attention Head consistently focuses on during generation can be obtained from the “Observation” window located at the end of the Prompt.

The sparse tokens are selected by comparing the attention scores of the tokens in the window and the tokens in the prefix. That is, assuming a token A precedes the window token, the window token will have an attention score for A. The sum of the attention scores of all window tokens for A is the “importance score” of A. The tokens with the highest importance scores are the representative tokens. This corresponds to number 2 in the diagram below.

The top k tokens are selected based on their attention scores. Furthermore, before selecting the top k, a one-dimensional pooling operation is performed on the attention matrix. The paper explains that this helps LLM retain more complete information, because using only the content selected by the window would result in incomplete generated information. Pooling nearby tokens can fill in the context, similar to Gaussian blur. This corresponds to icon 3 in the image below.

Finally, the KV Cache of the selected token and the KV Cache of all tokens in the last window are combined to form the KV Cache of the Prompt. This corresponds to icon number 3 in the image below.

SnapKV only selects critical KV caches during the Prefill phase and then uses these sparse KV caches during the Decoding phase to accelerate the generation of long-context scenarios. Furthermore, the Decoding phase does not update the Prompt KV cache.

2521

The specific algorithm is as follows.

2522

1.4 Sparsification based on layer characteristics

The layered architecture of LLM results in different information extraction modes across layers, and the KV cache of each layer contributes differently to the model performance.

The attention layer of the LLM model has its own characteristics, which also affect the sparsity strategy of the KV cache. For example, to estimate how many keys/values in the KV cache need to be retained, the InfiGen authors sorted the attention weights of each query token in descending order and accumulated the weights of key tokens until the accumulated weights reached 0.9. The following figure shows the distribution of the number of key tokens required to reach a total attention weight of 0.9 in different layers (Layer 0 and Layer 18). The horizontal axis of the histogram represents the number of key tokens, and the vertical axis represents the number of query tokens.

The attention pattern at layer 0 is more dispersed, and the number of key tokens required for different query tokens varies greatly, requiring reference to more key tokens to capture effective attention.
The vast majority of queries at layer 18 only require 16 to several dozen key tokens to achieve a cumulative weight of 0.9. Only a very small number of tokens require more key tokens, indicating that the attention pattern at higher levels is more focused and can obtain effective attention by relying on fewer key tokens.

2523

These observations reveal that each layer contributes to the overall attention to varying degrees, so they cannot be directly excluded. Furthermore, the attention patterns across different layers differ significantly. Therefore, a unified key-value cache compression across layers may not be optimal.

In KV Cache management, it is necessary to extract heterogeneity from information between LLM layers, differentiate or accumulate it based on the contribution of attention weights, and dynamically adjust the number of key tokens participating in attention calculation at each layer. By intelligently allocating memory resources according to the importance of each component to prediction accuracy, memory utilization can be optimized while minimizing accuracy degradation.

Note: The paper “A Survey on Large Language Model Acceleration based on KV Cache Management” categorizes both layer-based and head-based sparsity into “KV Cache Budget Allocation.” KV cache budget allocation addresses this challenge by intelligently allocating memory resources based on the importance of each component to prediction accuracy. For example, it allocates more KV cache to lower layers where information is more dispersed, while reducing the allocation to higher layers where information is more concentrated. A detailed summary is shown in the figure below.

2524

Let’s take a look at some specific cases.

1.4.1 PyramidKV

The paper “PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling” explores the effect of cross-layer shared kvcache by studying the attention mechanism between different layers. It finds that lower layers exhibit a uniform attention distribution across the input sequence, while upper layers show concentrated attention on specific tokens. Therefore, PyramidKV employs a pyramid-shaped memory allocation strategy, allocating more KV cache to lower layers where information is more dispersed, with each KV containing less information. Simultaneously, it reduces the KV cache in higher layers, concentrating information in fewer key tokens at higher levels, while selecting tokens with high attention values at each layer. Furthermore, during the decoding phase, PyramidInfer dynamically maintains a set of important tokens through frequent updates driven by attention values.

Insight

The authors studied the layer-by-layer attention graphs of the Llama model for multi-document question answering and discovered a pyramidal information funneling pattern in the attention layers.

In the lower layers of the model (e.g., layer 0), the attention scores exhibit an approximately uniform distribution, indicating that the information is relatively scattered. This suggests that the model globally aggregates information from all available content at lower layers, rather than prioritizing specific paragraphs.
When the encoded information reaches the intermediate layer (6-18), it gradually shifts to a localized attention pattern focused on the paragraphs themselves. At this stage, attention is primarily concentrated on tokens within the same document, indicating that the model has aggregated information within individual paragraphs. In other words, the model has concentrated the necessary information into a subset of tokens.
This trend continues and intensifies in the upper layers (24-30), and the phenomena of “Attention Sink” and “Massive Activation” can be observed.

Massive activation, a term from the paper “Massive activations in large language models,” refers to the situation in the hidden states of a model where a very small number of activations are much larger than other activations; these activations are called “massive activations.”

plan

The authors chose to improve compression efficiency by dynamically allocating cache budgets based on attention patterns. Different key-value cache mechanisms are used for different characteristics of the layers.

Use the entire kvcache in the first few layers. Because these layers aggregate global information, the information is more dispersed.
Deeper layers use less key-value caching. In these layers, the attention mechanism focuses heavily on a few key tokens, so retaining only these key tokens ensures consistent output and reduces memory usage. Other tokens, due to their low values, have a low probability of being sampled, so caching these tokens can be discarded.

The KV cache ultimately takes the form of a pyramid, as mentioned in the paper, known as PyramidKV. Once the KV cache budget is determined for each layer, PyramidKV selects KVs to cache based on attention at each Transformer layer. The final portion ( $\alpha$ of the KV cached tokens), that is, the Instruction Tokens, are retained across all Transformer layers.

2525

To quantify the importance of each token in the generation process, the authors measured the level of attention each token received from the instruction token and used this measurement to select important tokens for the KV cache. The specific implementation process is as follows:

Since the first and nearest neighbor tokens contain important information and have a significant impact on the generation effect, they cannot be discarded. Therefore, hyperparameters are used to control the selection of these tokens, such as always retaining the initial 64 tokens (sink tokens) and the most recent 128 tokens (instruction tokens).
Determine the total number of tokens to be retained in the first and last layers. For example, the first layer might retain all tokens, and the last layer might retain 1/10 of the tokens. The number of tokens that need to be retained in the intermediate layers decreases linearly from the first to the last layer. The specific token count is determined by using the instruction token and the attention score calculated for that token.
Once the number of tokens k is determined, the top k can be selected. Since the final hidden state is obtained by weighting and averaging the weights and values of each row of the qk matrix, the selection rule is the weight value of each row of the qk matrix.
During decoding, each layer only loads the corresponding number of kvcache instances for computation.

1.4.2 PyramidInfer

PyramidInfer also employs a pyramid-shaped budget allocation strategy, selecting tokens with high attention values at each level. Furthermore, during the decoding phase, PyramidInfer dynamically updates and maintains important tokens based on their attention values.

motivation

Previous solutions often overlooked the interdependencies between layers and the significant memory consumption during precomputation. The authors of PyramidInfer discovered that the number of key-value pairs influencing future generation decreases layer by layer, and that these crucial KVs can be extracted through the consistency of attention weights. Based on these findings, the authors proposed PyramidInfer, which compresses the KV cache by preserving key context layer by layer.

The image below shows the difference between PyramidInfer, StreamingLLM, and H2O. In PyramidInfer, the KV cache decreases layer by layer, becoming sparser as you go further down the hierarchy.

2526

plan

The execution process of PyramidInfer is shown in the following figure:

During the Prefill phase, PyramidInfer retains only the Pivotal Context (PVC) of each layer to compress the KV Cache.
During the Decoding phase, PyramidInfer updates the PvC based on the new, most recent Token.

2527

1.4.3 ZigZagKV

ZigZagKV is actually an optimization of SnapKV. In other words, it is also a KV cache compression algorithm for input prompts, but ZigZagKV is sparsified based on the characteristics of the layers, so it is placed here.

motivation

To further improve this, ZigZagKV defines the following metrics:

MBA (Minimum Budget size required to maintain 90% of the total attention score) is essentially the minimum number of key-value pairs needed to maintain 90% of the attention score.
LMBA (Layer Minimum MBA size to maintain attention score) is the average MBA of all attention heads in a layer, calculated by each attention head. A higher LMBA means more tokens are needed to maintain accuracy, which in turn requires more GPU memory to store the key-value cache.
LMBO (Layer-wise Minimum Budget for Output): This parameter is used to analyze how many key-value pairs need to be retained to make the outputs similar, based on the similarity of the outputs.

See the diagram below for details, where $a_i$ is the i-th element of A, and n is the number of tokens in the current KV Cache.

2528

Through experiments, the authors discovered:

The LMBA values of different layers vary considerably across different models; the LMBA values of different layers within the same model also vary considerably.
The number of key-value pairs required for different layers of the same model can vary greatly while maintaining a relatively consistent output.

Based on the two observations above, the author concludes that it is unreasonable to allocate the same amount of video memory to each layer of the model to store the KV cache.

Space allocation

How much space should be allocated to store the KV cache for each layer?

An intuitive idea is to allocate space to each layer according to the proportion calculated by LMBA, with larger layers receiving more and smaller layers receiving less. However, this allocation method has a problem: if the LMBA of a certain layer is particularly large, it might result in some layers receiving very little space, meaning these layers can essentially not store any key-value pairs. A layer with virtually no key-value cache will naturally compromise performance. Therefore, we reserve a minimum size for each layer, and the allocation formula is changed. This is the main improvement of ZigZagKV. SnapKV allocates space evenly to each layer, while ZigZagKV allocates space adaptively.

Having solved the space allocation problem for each KV cache layer, another issue remains: which K and V values to store. The token selection in ZigZagKV is very similar to that in SnapKV. However, while SnapKV distributes memory evenly across all layer heads, ZigZagKV doesn’t allocate the same amount of memory to each attention head. Instead, it allocates memory based on the number of tokens each head needs to retain the majority of its information.

2529

1.4.4 Layer-Condensed KV Cache

LCKV proposes computing and caching only a small subset of keys and values across layers, or even just the top layer, and then allowing lower-level queries to pair with the stored keys and values for inference. This approach not only significantly improves inference speed and reduces memory consumption but also aligns orthogonally with existing memory-saving techniques, enabling direct integration for further optimization. While this mechanism makes the next attention computation dependent on the top-level keys and values of the previous attention, contradicting the parallel training of Transformers, LCKV introduces an approximate training method to support parallel training.

Specifically, the Transformer’s layer-by-layer encoding of model input is essentially a step-by-step processing of token information. Therefore, the authors of the Layer-Condensed KV Cache believe that the information in the last layer is the most important and comprehensive, and the information obtained from the previous layers can be ignored, retaining only the KV cache of the last layer. When generating subsequent tokens, their corresponding KV caches are retrieved from the last layer. This way, the model no longer needs to calculate and store the key-value pairs of tokens from the previous layers, nor does it need the $W^K W^V$ matrices from the previous layers, greatly saving computational and memory overhead. See the diagram below.

2530

1.5 Other Solutions

1.5.1 Sparsity based on head characteristics

It performs finer-grained per-head budget allocation based on sparsity for head features, enabling precise memory allocation among individual attention heads within each layer, providing more flexible and targeted optimization opportunities.

AdaKV: It allocates caches to each head based on differences in attention patterns and maximizes the retention of information by optimizing the L1 loss bound.
RazorAttention describes two distinct retrieval heads: an “echo head” focuses on previously encountered identical tokens, and an “induction head” focuses on previous tokens that are duplicates of the current token. The framework implements a differentiated caching strategy, maintaining complete cache entries for the retrieval heads while compressing remote attention into merged compensating attention that is not part of the retrieval heads.
DuoAttention introduces a parameterized approach to distinguish between two types of attention mechanisms: retrieval heads, crucial for comprehensive long-context processing; and streaming heads, primarily handling recent attention and attention absorbers. This classification is achieved by learning parameters that automatically identify retrieval heads requiring full attention.

1.5.2 Speculative Sampling Sparsity

The idea behind speculative sampling sparsity is to generate candidate tokens using a draft model and then select the most important tokens.

The paper “TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding” proposes TriForce, a hierarchical speculative sampling system primarily used for long sequence generation. TriForce aims to address the acceleration problem in the context of long sequence generation. In the long sequence generation setting, in addition to the model’s own weights, the key-value cache for model inference also occupies a large amount of GPU memory, and becomes a key factor limiting inference speed when the context is long enough.

Therefore, TriForce introduces a target model that uses only a portion of the KV cache to reduce the constraint of the full KV cache on inference speed. Simultaneously, it employs a multi-level speculative inference strategy to further reduce the inference overhead of the draft model. The core idea of TriForce is shown in the figure below, which mainly consists of three processes involving two levels of speculative sampling:

Speculation Phase 1 (This phase can be executed multiple times until multiple candidate tokens that meet the conditions are obtained):
- We used a relatively small LLaMA-68M model as a draft model, and used the StreamingLLM approach to quickly generate multiple candidates.
- The LLaMA2-7B-128K model was used as the target model, and a sparse key-value cache scheme was employed for parallel verification. The verified token served as a new draft. Due to the use of a sparse key-value cache, the results may be lossy.
Speculation Phase 2 (executed once): Using the LLaMA2-7B-128K model as the target model, parallel validation is performed using a full key-value cache scheme. This phase ensures that the generated results are lossless.

2531

In terms of caching, TriForce adopts a search-based approach:

Divide the key and value into different chunks.
Calculate the distance between the query and the center of each chunk (using the mean as the center).
Select the topk chunks as sparse key-value caches for Attention calculation.

2532

1.5.3 Cluster Sparsity

To accelerate the retrieval of key attention, some research has proposed index-based methods that organize and access KV caches at the block or cluster level, achieving efficient query and retrieval operations. InfLLM maintains a complete KV cache within a block while facilitating long sequence processing through a hierarchical storage strategy. This framework employs CPU-GPU memory orchestration, retaining basic attention and current computation units in GPU memory while offloading less frequently accessed units to CPU memory. To further improve the accuracy of top-k block retrieval, the Quest framework proposes a fine-grained block representation method based on the minimum and maximum key values in the KV cache blocks.

SqueezedAttention employs K-means clustering in the offline phase to group semantically similar keys, with each group represented by a centroid. During inference, it compares the input query against these centroids to identify and load semantically relevant keys within the context. Similarly, RetrievalAttention uses Approximate Nearest Neighbor Search (ANNS) to index KV cache attention. Furthermore, EM-LLM dynamically segments incoming attention into sporadic events. It also implements a hybrid retrieval mechanism that combines semantic similarity matching with temporal context to efficiently access relevant KV cache segments. Likewise, ClusterKV groups attention into semantic clusters and selectively recalls them during inference, achieving high accuracy and efficiency for LLM.

We will use ClusterKV as an example for learning.

Research on how to discard K and V is further subdivided into two directions:

Non-recallable KV Cache: This directly discards some key-value pairs (K and V) that the system deems unimportant. The key is how to determine what is unimportant. For example, it retains the KV cache corresponding to the attention sink and the tokens in the most recent window, discarding everything else, thus keeping computation and GPU memory manageable.
Recallable Key-Value Cache: In reality, the importance of a token in the key-value cache changes dynamically. A token may be unimportant at the beginning but become important later. To achieve better results, we cannot simply discard a key-value pair. Instead, we should discard key-value pairs at times and reload them into memory for continued use, depending on the specific circumstances.

The image below compares several methods, with asterisks indicating important tokens. We can see that the important tokens are actually dynamically changing.

2533

A key observation of ClusterKV is that for a given q-value, the cosine distances of the corresponding K values are relatively small. Based on this observation, the authors used K-means clustering to cluster similar K values and their corresponding V values using the similarity of K values as the classification criterion. Each cluster then has a centroid vector, which is used to calculate importance. If the cosine distance (1-cosine value) between q and a centroid vector is small, then the cosine distances between q and all vectors within that cluster should also be small; in other words, clusters with smaller distances tend to be larger. Therefore, we can load the entire cluster for subsequent calculations while minimizing the number of important tokens missed.

2534

The clustering methods are as follows:

First, the attention sink part is retained. The paper simply retains the first 16 tokens.
Clustering was performed on the results of the prefill phase. The authors’ experiments showed that 400 classes were the optimal trade-off for a 32K context.
During the decoding phase, clustering is performed every m steps. The paper, by default, performs a new clustering operation after every 320 tokens are generated.

2535

When selecting a category, first set a fixed size B, and then select from the most important categories according to their importance, continuing until the sum of all elements in all categories exceeds B.

2536

0x02 Reuse

Reuse is also an important method to reduce the length of KV sequences. In this section, we mainly introduce two schemes: KV cache merging and prefix reuse.

2.1 KV Cache Merging

There’s no definitive answer to the question of whether to delete unimportant key-value pairs to save memory, or to keep them as is but compress them without deletion. Deletion might immediately alleviate memory pressure, but could potentially impact model performance. Conversely, more advanced compression techniques strive to reduce memory usage while maintaining information integrity.

KV cache merging explores methods to compress or consolidate KV caches to retain as much information as possible about discarded tokens without significantly reducing model accuracy. This can be seen as an extension of the token discarding strategy mentioned earlier. Instead of employing a uniform compression strategy, KV cache merging leverages inherent redundancy within and between layers to dynamically optimize memory utilization. These methods aim to reduce the size of the KV cache while retaining the critical information needed for accurate attention computation, thereby achieving efficient inference in resource-constrained environments. Existing KV cache merging strategies can be categorized into two main approaches: intra-layer merging, which focuses on merging KV caches within each layer to reduce memory usage per layer; and cross-layer merging, which addresses cross-layer redundancy to eliminate unnecessary duplication. Both approaches offer complementary advantages, providing flexibility in balancing memory savings and model performance degradation. A summary of KV cache merging is shown in Table 4.

2537

2.1.1 Inter-layer merging

We will use MiniCache as an example to learn about inter-layer merging.

MiniCache merges similar key-value pairs in mid-to-high-level layers. Specifically, MiniCache decomposes the key-value pairs of each token in adjacent similar layers into shared direction vectors and magnitudes, storing only the shared direction vector, attention magnitude, and non-scalable attention to maximize memory efficiency.

motivation

The authors discovered that adjacent layers in the deeper parts of the KV cache within the LLM exhibit high similarity, which can be used to compress and merge KV caches. Furthermore, they introduced a token preservation strategy to prevent the merging of KV caches with significantly different depths. This method can also be used orthogonally with other KV cache quantization schemes.

Ideas

The diagram below illustrates the main differences between MiniCache and previous methods. MiniCache’s key feature is its focus on inter-layer redundancy in the KV cache along the depth dimension of the LLM, an aspect previously overlooked by intra-layer-based methods. Here, T refers to the last timestamp of pre-decoding, and T+1 refers to the first timestamp of decoding.

2538

The following diagram shows MiniCache’s compression and retention strategies.

(a) describes the cross-layer compression process. The model obtains KV caches from layers l and l-1 and merges them into the shared state via equation (3). In addition, the model also selects non-mergeable attentions for retention and then stores the merged caches, retained attentions, and quantities in C.
(b) illustrates the recovery process for layers l and l-1, including quantity rescaling and attention-preserving recovery as described in equation (2).

2539

algorithm

The following diagram shows the detailed execution flow of MiniCache.

Obtain KV Cache: During the Prefill phase, KV Cache is generated layer by layer.
Cross-layer merging: When the merging start layer S is reached, the KV Cache of the current layer L is merged with the KV Cache of the previous layer L-1 to reduce redundancy.
Cache: Store the merged KV cache for future use.
Deletion: During the Decoding phase, unnecessary or redundant KV caches are deleted to optimize memory usage.
Loading and generating: Retrieve the required KV cache for generating output.
Recovery: Apply error mitigation mechanisms, including rescaling and retention recovery, to the acquired KV Cache to minimize errors introduced during merging and compression.
Update: After the recovery phase, update the shared KV Cache with the final KV Cache.

2540

2.3 Intra-layer merging

We use DMC to learn intra-layer merging.

In a normal Transformer, at each time step $t$ , $k_t$ and $v_t$ are appended to the KV Cache.

In DMC, the KV Cache update process is different, as shown in Algorithm 1 in the figure below. At each time step, DMC decides whether to directly append the current key-value pair to the cache, or to perform a weighted average of them with the top item in the cache.

Specifically, DMC uses a decision variable $\alpha_t \in \{0, 1\}$ and an importance variable $\omega_t \in [0, 1]$ . We use the first neuron in $k_t$ and $q_t$ to extract these two scores. DMC decides based on $\alpha_t$ whether to append the KV representations $k_t$ and $v_t$ to the cache or to accumulate them with the last element of the cache.

The memory in DMC grows in a sublinear manner, which, while not as constant as the memory during inference in Linear Attention Transformer, is significantly better than that of Transformer.

2541

2.2 Prefix Reuse

LLM services have evolved from stateless to stateful systems, leveraging the dependencies inherent in reasoning requests to optimize key-value caches. These dependency-based techniques can be categorized into two types: cross-request and intra-request.

Cross-request techniques leverage dependencies across requests. A notable technique is prefix reuse, also known as context caching, which reuses the key-value cache for requests sharing the same hint prefix, thereby speeding up the pre-filling phase. The ability to reuse the key-value cache across requests significantly improves latency (especially first-word latency) and throughput (by drastically reducing the memory footprint of concurrent requests with shared prefixes).
On the other hand, in-request techniques leverage dependencies within a single request. A well-known example is split inference, which breaks a request into two sub-requests for better scheduling, and sequential parallelism, which breaks a request into multiple sub-requests to distribute the load.

This article introduces prefix reuse; the next article will introduce separate reasoning.

2.2.1 Requirements

Most current LLM applications generally follow the system prompt + user prompt paradigm when constructing input. We also often encounter scenarios with long system prompts and multi-turn dialogues.

Scenarios involving long system prompts. During the batch execution of an LM program, if multiple Large Language Model calls share the same prefix, the system prompt is identical across different requests, and this part is located at the beginning of the prompt, so its key and value are also the same across requests. System prompts may be shared across processes for extended periods, from seconds to even hours or days. These predefined long system prompts can lead to memory waste with low hit rates and are static and inflexible when frequently refreshed in large-scale deployments.
Multi-turn dialogue scenarios. In multi-turn dialogue scenarios, the prompt for the next turn can actually be viewed as a combination of the prompt and completion of the previous turn. Each turn of dialogue needs to rely on the context of all previous turns of dialogue, and the key-value cache in the previous turns must be recalculated in each subsequent turn.

In both scenarios, data in the key-value cache can theoretically be reused multiple times. If the key-value caches from the system prompt and previous rounds can be saved for reuse in subsequent requests, multiple redundant system prompt key-value caches can be avoided. This significantly reduces the time spent on the first token, improves the service’s maximum throughput, and reduces the time to compute the first token (TTFT) for a new round of requests. For example, for the system prompt, if these caches are named using key-value pairs from a self-attention mechanism, requests with the same prompt prefix can share the same key-value cache, effectively reducing computational redundancy and memory consumption. In multi-turn dialogue applications, reuse can eliminate the need for recompute in previous rounds when generating dialogues.

The ability to reuse KV Cache across requests will result in significant improvements in latency (especially the latency of the first token) and throughput (by greatly reducing the memory footprint of concurrent requests with shared prefixes).

However, this optimization is highly complex and faces numerous technical challenges, as it requires additional storage and more sophisticated memory management. For example, each inference process is an independent application, and these copies operate in parallel. The challenge lies in efficiently reusing the key-value cache (shared prompt prefix) across multiple calls and instances. However, existing systems lack effective mechanisms to support this reuse. Consequently, most leading inference systems currently adopt a relatively conservative approach: they do not analyze the prefix-sharing characteristics across large batches, nor do they schedule requests sharing the same prefix together. Instead, they schedule tasks at the request level, discarding the key-value cache after processing a request. This not only leads to the premature eviction of shareable key-value caches, preventing this valuable resource from being shared across multiple requests, but also extends the lifetime of shareable key-value caches, slowing down execution and resulting in significant redundant computation and memory waste.

The image below illustrates four typical models of key-value caching: few-shot learning examples, questions in self-consistency, chat history in multi-turn chat, and search history in tree-of-thought. Blue boxes indicate shareable hints, green boxes indicate non-shareable parts, and yellow boxes mark non-shareable model outputs. Existing systems struggle to automatically handle all these patterns.

2542

The following section will introduce the implementation of Prefix Caching in some large-scale model inference systems.

2.2.1 Prompt Cache

The paper “Prompt Cache: Modular Attention Reuse For Low-latency Inference” proposes Prompt Cache, a method to accelerate LLM inference by reusing attention across different LLM prompts. Prompt Cache employs a method called a schema to explicitly define reusable text segments, referred to as prompt modules. This schema ensures positional accuracy during attention reuse. Prompt Cache then populates its cache when a schema is loaded. During inference, if these “cache” segments appear in the input prompt, the system uses pre-computed key-value attention in memory, only computing the uncached text. Prompt Cache is essentially a space-for-time tradeoff technique.

The image below shows the differences between fully autoregressive generation (a), KV Cache (b), and Prompt Cache (c), with each method displaying three steps, and each box representing a token. Blue boxes represent prompts.

The LLM receives a prompt and predicts the next token (A). It then appends the generated token (A) to the prompt to predict the next token (B). This process continues until a stopping condition is met.
KV Cache computes attention only once for the prompt and reuses the KV Cache in subsequent steps.
Prompt Cache can reuse key-value states across services, thus omitting attention calculations for prompt words. During loading mode, Prompt Cache populates the cache and reuses the cached state in prompt words based on that mode.

As the size of the cache segment increases, the performance advantage of the Prompt Cache becomes more apparent, because the computational overhead of attention is quadratic with the size of the input sequence, while the space and computational complexity of the Prompt Cache are linear with its size.

2543

Insight

Many input prompts contain overlapping text segments, such as system messages, prompt templates, and context-provided documentation. By pre-compiling and storing the attention of these frequently occurring text segments on an inference server, they can be effectively reused when they appear in user prompts.

However, two challenges arise when reusing attention across prompts.

Attention is position-encoded, and the attention of a text segment can only be reused when the segment appears in the same position. How can we ensure that the reused attention is in the same position as the current prompt?
How can the system efficiently identify text segments that may have been cached?

To address these two issues, Prompt Cache combines two ideas.

The Prompt Markup Language (PML) is used to format the structure of prompt words. PML marks reusable text segments as modules, or prompt modules. It not only solves the second problem mentioned above but also opens up new avenues for solving the first problem because a unique location ID can be assigned to each prompt module.
The authors discovered that LLM can handle attention with discontinuous position IDs. This means that attention from different parts can be extracted and concatenated to form subsets. This allows users to select prompt modules as needed, and even update certain prompt modules at runtime.

plan

A schema is a document that defines prompt modules and describes their relative positions and hierarchical structure. LLM users write prompts in PML so that attention can be reused based on them. Importantly, users must derive prompts from an agenda, which is also written in PML.

When the Prompt Cache receives a prompt, it first processes its schema and calculates the attention of its prompt module. It reuses these states for the prompt modules within the prompt as well as for other prompts derived from the same agenda.

2544

The image above shows a sample prompt based on the example schema, which can explain the reuse mechanism of the Prompt Cache:

PML explicitly defines reusable prompt modules in both the schema and the prompts. Prompts can include new text segments, parameters, or ending segments at excluded module locations.
Prompt Cache pre-computes the attention of all modules in the agenda and caches it for future reuse.
When a prompt is provided to the Prompt Cache, the Prompt Cache parses it to ensure consistency with the declared schema. The Prompt Cache verifies the validity of the imported module. Then, the Prompt Cache retrieves the attention of the imported prompt module cache from the cache, calculates the attention of the parameters and the new text segment, and concatenates them to generate the attention of the entire prompt.

2.2.2 Hydragen

Hydragen is a hardware-aware, precise implementation of attention mechanisms, suitable for situations involving shared prefixes. Hydragen computes attention for shared prefixes and unique suffixes separately. It utilizes a single shared key-value cache to handle common prefixes across multiple requests, then concatenates the queries across sequences, achieving efficient prefix attention and reducing redundant memory reads.

The following diagram shows the structural description of Hydragen:

Left: An example of an LLM inference scenario where the chatbot model handles many shared big shared prefixes (system prompts).
Right: An overview of Hydragen, where the overall attention is broken down into attention on shared prefixes and attention on individual suffixes.

2545

When computing attention, sequences sharing a common prefix have partially overlapping keys and values. Our goal is to separate the computation of this partial overlap into two separate operations: focusing on the shared prefix, where there is total key-value overlap; and focusing on the unique suffix, where there is no overlap.

The derivation is illustrated in the diagram below. Hydragen aims to avoid directly calculating SDP(Q, K, V), instead using the results of sub-computations $SDP(Q, K_1, V_1)$ and $SDP(Q, K_2, V_2)$ to calculate this value. The challenge in dividing attention lies in the softmax operation, as the softmax denominator is calculated by summing all the exponentially sized attention scores in the sequence. To combine the two sub-computations, Hydragen employs a denominator rescaling technique inspired by the blocked softmax calculation of FlashAttention. This technique allows Hydragen to obtain the final result $SDP(Q, K, V)$ by calculating the softmax denominator of the entire sequence and rescaling the attention output accordingly.

2546

The image below shows a code example.

2547

2.2.3 ChunkAttention

ChunkAttention organizes key-value cache blocks into prefix trees, supporting runtime detection and sharing of common prefixes across multiple requests, and improving data locality through a two-stage partitioning algorithm. Specifically, ChunkAttention decomposes the overall key/value tensor into smaller blocks and then structures them into an auxiliary prefix tree. Moreover, based on the prefix tree-based key-value cache, ChunkAttention implements a two-stage partitioning algorithm, which enhances data locality during self-attention computation when shared system prompts are present.

Prefix Tree

ChunkAttention assumes that the key-value cache should be prefix-aware. Therefore, ChunkAttention organizes the key-value caches of all decoded sequences into a prefix tree. The key-value cache only stores the key/value tensors of the sequence currently being decoded. Thus, the key-value cache is prefix-aware and can dynamically detect and remove redundancy at runtime without human intervention.

Each path in a prefix tree defines a sequence. Multiple trees can exist simultaneously on a server; for example, application developers may design different system prompts. The parent-child relationships in the tree define the subset of sequences covered by each block. The root node covers all sequences, while leaf nodes cover only one sequence. A key property of prefix trees is that the sequences covered by each block are contiguous along the sequence index dimension.

The diagram below shows the structure of the KV cache stored in the prefix tree. $S_0$ , $S_1$ , and $S_2$ are generic and can be shared. However, the problems are different and cannot be shared. Each node is a chunk C that stores three key elements:

c context tokens shared by the sequence $S_i, \ldots, S_j$ .
A slice of a key tensor of size $b \times h \times c \times d$ corresponding to c tokens.
The corresponding value tensor slice.

2548

During the inference process, there are three possible scenarios: i) a new sequence is added, ii) a completed sequence leaves, and iii) all sequences decode a single token together. Each scenario can be transformed into a prefix tree operation. When a new sequence is added, the prefix tree is searched and updated to insert the new path. When a completed sequence leaves, the prefix tree is updated to remove its path. In each decoding iteration, the new token is appended to a leaf block, or a new block is added when the leaf block is full.

By sharing a common prefix, ChunkAttention increases the number of sequences that can be processed simultaneously to approximately $1/(1-r)$ . The sharing ratio r is defined by the number of shared tokens $n_s/(n_p + n_c)$ , where $n_c$ is the number of completed tokens. In memory-constrained inference scenarios, this helps to increase batch size, thereby improving throughput.

Two-phase partitioning (TPP)

During pre-filling, ChunkAttention performs a prefix lookup to avoid repeatedly computing KV projections and position embeddings for matching prompt prefixes. For non-matching suffix tokens, KV projections and position embeddings are still computed, and the key/value tensor is partitioned and inserted into the prefix tree. We then apply an existing highly optimized self-attention kernel, such as FlashAttention, to the entire key/value tensor.

The partitioning strategy in ChunkAttention is based on online softmax and inspired by FlashAttention. However, FlashAttention is not flexible enough when dealing with non-contiguous memory or variable sequence lengths, making it more suitable for model training than inference. During decoding, it offers almost no benefit when the query token count is always 1. ChunkAttention excels at handling variable sequence lengths during decoding and batches attention operations on several sequences to reduce memory operations. Therefore, ChunkAttention and FlashAttention are complementary.

During iterative decoding, self-attention computation is divided into two phases: block-first and sequence-first. These two phases focus on different slices of the query tensor, key-value cache blocks, and use different parallelization strategies. Query tensors with sequences of matching prompt prefixes are batched together, and attention is performed on them along with the key/value tensors.

The process is shown in the diagram below. This implicitly includes head dimension processing.

2549

The block-first phase only processes blocks shared by multiple sequences. Since GPUs have more streaming multiprocessors than the number of heads and partitioning by head would inefficiently utilize hardware resources, we perform additional key/value partitioning. The specific algorithm is shown in the figure below: traverse the shared blocks in the prefix tree, execute the partial attention kernel partial_attn during the traversal, and save the partial attention results to memory.

2550

In the sequence-first stage, we load the partial attention results of the shared blocks from the block-first stage and continue processing the blocks associated with a specific sequence. We partition the sequence by slicing the i-th row of Q, where each q processed by the sequence-first kernel is a vector, as shown in the algorithm diagram below.

2551

2.2.4 AttentionStore

AttentionStore allows for the reuse of key-value caches in subsequent rounds of the same dialogue, effectively reducing the overhead of redundant computations in multi-round dialogues. To improve the efficiency of AttentionStore, its authors devised a KV cache truncation scheme that features overlapping KV cache access, hierarchical placement of KV caches, and decoupling of positional encoding.

motivation

The current LLM service engine has significant efficiency issues when performing multi-turn dialogues, mainly due to the repeated calculation of key-value caches in multi-turn dialogues.

During a conversation, the LLM engine stores intermediate data, namely key-value pairs, in high-bandwidth memory on the GPU. However, when the conversation ends and the session becomes inactive, the LLM engine typically discards the KV cache associated with that session to make room for other active sessions. When the same session becomes active again, the LLM engine needs to recompile the entire KV cache. This results in the same KV cache being calculated repeatedly, wasting valuable GPU computing resources. As the number of conversation rounds increases, the overhead of repeated calculations increases linearly.

As shown in Figure (a): In the first round of the conversation, the LLM service engine generates key-value caches for q1 and a1. However, after completing the first round, the LLM service engine discards these key-value caches to reclaim HBM space. In the second round, the LLM service engine needs to regenerate the key-value caches for q1 and a1. In the third round, it needs to regenerate the key-value caches for q1, a1, q2, and a2. As the session expands and historical tokens accumulate, the overhead of repeated computation increases significantly.

2552

plan

AttentionStore introduces a hierarchical key-value caching system that uses cost-effective storage media to store key-value caches for all requests, thereby significantly improving the efficiency of multi-turn conversations.

In traditional attention mechanisms, the key-value cache is discarded after each dialogue. AttentionStore stores the KV cache in a KV cache system. When a dialogue session is inactive, the relevant KV cache is saved. If the same session is reactivated in the future, its KV cache is retrieved from the KV cache system and reused for inference. In this way, AttentionStore only needs to partially pre-populate new input tokens for the new round of dialogue, without having to re-pre-populate all historical tokens. For example, during the third round of dialogue inference, the KV caches for q1, a1, q2, and a2 are reused, and only q3 needs to be pre-populated, as shown in Figure (b) above.

In addition, AttentionStore has also made many design improvements to address technical challenges. Through these designs, AttentionStore effectively solves the key-value cache management problem in multi-turn dialogues, significantly improving the efficiency and performance of multi-turn dialogues.

To reduce the overhead of accessing the KV cache from slow storage media, AttentionStore employs a hierarchical preloading and asynchronous saving scheme, overlapping KV cache access with GPU computation.
Meanwhile, to ensure that the key-value cache that needs to be accessed is always in the fastest level, AttentionStore uses a scheduling-aware acquisition and eviction scheme, which intelligently places the key-value cache in different levels based on the prompts provided by the inference task scheduler.
Furthermore, AttentionStore avoids key-value cache invalidation due to context window overflow by decoupling positional encoding and effectively truncating the key-value cache. This innovation ensures that the stored key-value cache remains valid at all times.

2.2.5 Radix Attention

The RadixAttention technique proposed in SGLang’s paper is one of the typical solutions for realizing KV cache reuse and a representative of efficient memory management.

Unlike existing systems that discard the key-value cache after a request is generated, RadixAttention utilizes a radix tree to achieve fast matching, insertion, and replacement of the cache. Furthermore, in RadixAttention, both the prefix and the key-value cache generated during the Generate phase are cached. This maximizes the likelihood of the key-value cache being reused by new requests. In contrast, ChunkAttention only focuses on the prefix itself.

Ideas

The key points of SGLang’s Prefix Caching solution are as follows:

Unlike previous “use and discard” systems, the RadixAttention algorithm does not discard the KV cache after each request is processed, but keeps it in GPU memory.
The RadixAttention algorithm also stores the mapping between sequence tokens and the KV cache in the RadixTree data structure. Compared to traditional trie, the edges of a RadixTree can be labeled not only as single elements but also as sequences of elements of variable length. When necessary, a large node already in the tree can be dynamically split into smaller nodes to meet the needs of dynamic shared prefixes. This feature greatly accelerates data processing speed.
When a new request is received, the scheduler uses a radix tree for prefix matching. If a cache hit occurs, the scheduler reuses the cached key-value tensor to satisfy the request.
Because GPU memory is limited, cached key-value tensors cannot be retained indefinitely. Therefore, the RadixAttention algorithm includes an eviction policy. Optimal cache reuse may be incompatible with scheduling methods like First-Come, First-Served. Therefore, RadixAttention is equipped with a modified scheduler that prioritizes requests that match cache prefixes.

plan

Given the precious and rapidly depleted nature of GPU memory resources, cached key-value tensors cannot be retained indefinitely. To make caching more efficient, RadixAttention has specifically designed two strategies:

The Least Recently Used eviction policy prioritizes removing leaf nodes that have not been accessed for the longest time, making room for new content. This policy not only effectively releases the memory occupied by evicted leaf nodes but also fully utilizes their common ancestor nodes, maximizing memory reuse. As the system runs, when these ancestor nodes also become orphaned leaf nodes and meet the eviction criteria, they will also be evicted in an orderly manner, thus maintaining the health and efficiency of the entire caching system.
Cache-Aware Scheduling Strategy. We define cache hit rate as the number of cached prompt tokens divided by the number of prompt tokens. When there are many requests in the queue, the order in which requests are executed significantly affects cache hit rate. For example, if the request scheduler frequently switches between different, unrelated requests, it can lead to cache jitter and a lower hit rate. Therefore, RadixAttention designs a cache-aware scheduling algorithm to improve cache hit rate. This algorithm sorts requests by the length of their matching prefixes and prioritizes requests with longer matching prefixes, rather than using a first-come, first-served scheduling approach.

Figure (1) to (9) represent the dynamic changes of the Radix Tree.

The entire tree structure represents the sharing and branching operations of requests. Each edge has a label representing a substring or token sequence. Nodes are color-coded to reflect different states: green indicates a newly added node, blue indicates a cached node accessed at the current time, and red indicates a node that has been evicted. Newly arriving requests first match the longest prefix in the Radix Tree to find the furthest node. If a request still has tokens that do not match an existing prefix, a new token child node is added. Each time a new token is added, the new token node is marked green, and the tokens of its prefix nodes are marked blue. This token primarily updates the node’s access time; when cache space is insufficient, recently unaccessed nodes are cleared. The specific operation flow is as follows:

In step (1), the radix tree is initially empty.
In step (2), the system prompt is “You are a helpful assistant.” The server processes the incoming user message “Hello” and replies with “hi” using the LLM output. The entire dialogue is merged into a large node and saved to a new node a in the Radix Tree.
In step (3), the same user inputs a new prompt, the server finds the historical conversation prefix in the radix tree, and reuses its KV cache. The new round is appended to the tree as a new node “b”.
In step (4), a new chat session is started. The server splits the large node “b” in step (3) into two nodes “c” and “d” to allow the two chat sessions to share the system prompt.
In step (5), the second chat session continues, and the user requests “Write a story”. However, due to memory limitations, node “c” must be evicted. The new round is cached in the new node “d”.
In step (6), the server receives a batch of few-shot learning queries that share the same few-shot example set. The server processes these queries and inserts them into the tree. Because the new queries do not share any prefix with existing nodes, the root node is split, generating node “e” to enable sharing.
In step (7), the server receives a batch of other few-shot learning queries. These queries share the same few-shot example set as step (6), so the nodes are split and new nodes “f”, “g”, and “h” are inserted.
In step (8), the server receives a new message from the first chat session. It evicts all nodes in the second chat session because they are the least recently used.
In step (9), the server receives a request to sample more answers for the question in node “j”. To make room for these requests, it evicts nodes “i”, “k”, and “l”.

2553

0x03 Based on the search scheme

A more aggressive strategy is to store the KV cache on a richer but slower external device, or even a different storage medium, if GPU memory is insufficient. This approach transforms KV cache management into a retrieval challenge: retrieving relevant key-value pairs on demand and reintegrating them into the model. While this reduces memory usage on the main device, it increases the complexity of the retrieval and integration process.

3.1 InfiniGen

Modern LLM servicing systems support offloading data to CPU memory to efficiently service LLMs within limited hardware budgets. These offload-based inference systems are even beginning to support offloading the key-value cache to CPU memory, allowing users to generate longer contexts beyond the GPU’s memory capacity. However, due to the low PCIe bandwidth between the CPU and GPU, transferring the large key-value cache from CPU memory to the GPU has become a new performance bottleneck in LLM inference.

The paper “InfiniGen: Efficient generative inference of large language models with dynamic KV cache management” uses a predictive mechanism to offload the main part of the KV cache to the CPU, retaining only the critical components. When a new request arrives, InfiniGen uses the limited information it retains to speculatively reload a small portion of the KV cache onto the GPU. This saves memory space on high-speed computing devices without impacting performance, while avoiding excessive data exchange.

InfiniGen was designed to address the increased inference latency caused by insufficient PCIe bandwidth in Offload. However, the final solution is based on an algorithmic approach, primarily aiming to save decoding time.

question

The diagram below shows a high-level timing diagram of different ways to execute a Transformer block, from which performance bottlenecks can be identified.

(a) indicates the case where the KV Cache resides entirely in GPU memory. In this scenario, the KV Cache load latency involves only read operations from GPU memory, which is negligible due to the high bandwidth of GPU memory. However, the maximum batch size or sequence length is limited by the GPU memory capacity, which is typically smaller than CPU memory.
(b) indicates that the KV Cache is offloaded to CPU memory. Although offload-based inference systems alleviate the limitations of batch size and sequence length, transferring hundreds of GB of KV Cache to the GPU for attention computations significantly increases the overall execution time of Transformer blocks due to PCIe bandwidth limitations.
(c) represents the traditional prefetching technique. In this case, only a portion of the loading latency can be hidden by the computation of the previous Transformer block. While quantizing and compressing the KV cache may reduce data transfer overhead in offload-based systems, this does not solve the fundamental problem because quantization does not address the root cause of the KV cache problem - the linear growth of KV entries with sequence length. Therefore, intelligent KV cache management is necessary to mitigate performance overhead while retaining its advantages.
(d) indicates that by identifying the key-value pairs that are critical for calculating attention scores, the amount of KV cache that needs to be loaded can be reduced. In attention calculations, the keys and values of some tokens are more important than those of others. If appropriate tokens are selected, skipping the attention calculation of some less critical tokens will not significantly reduce the model’s accuracy.

2554

Insight

InfiniGen’s core insight lies in its ability to infer a small number of key tokens crucial for subsequent Transformer attention layer computations by performing a minimal review of the current layer input and subsequent partial query weights and key caches. This approach allows us to prefetch only the necessary KV cache entries, thereby mitigating the overhead of fetching data from host memory in offload-based LLM service systems.

Large language models exhibit outliers in the input tensors of Transformer blocks. These outliers are elements with significantly larger values compared to other elements. In LLMs, these outliers tend to appear in a few fixed channels between layers. The dot product between any row of the model input and a column of the weight matrix, such as $W^K$ , can have a similarly large magnitude, which in turn leads to outlier channels in the query and key matrices.

The attention score is highly dependent on a few columns with large amplitudes in the query and key matrices. These large amplitude columns have a significant impact on the attention pattern because the dot product between the query and key is greatly influenced by these few columns.

The InfiniGen authors observed that by tilting the query and key matrices, making a small subset of channels significantly larger than others, and using only these channels to compute the attention score matrix, it is possible to effectively predict which tokens are important. Essentially, we multiply the Q and K matrices with an orthogonal matrix A, aligning it with the direction in which Q is stretched most significantly, thus generating the corresponding tilted matrices Q and K.

To find such an orthogonal matrix A, we employ Singular Value Decomposition, a widely used matrix factorization technique in linear algebra. For a real matrix Q of size $m \times n$ , its SVD decomposition can be expressed in the following form:

$\mathbf{Q} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^{\top}$

Where U and V are orthogonal matrices of size $m \times m$ and $n \times n$ respectively, and $\Sigma$ is an $m \times n$ diagonal matrix whose non-zero values on the diagonal represent singular values, where $k = \min(m,n)$ .

For example, the diagram below illustrates how the column vectors of $V^T$ are transformed into the column vectors of Q when both m and n are 2. In (a), the orthogonal unit vectors are first stretched to points on the ellipse whose semi-axis lengths correspond to the elements on the diagonal of $\Sigma$ . These vectors are then rotated and/or reflected through matrix U. On the other hand, (b) shows how the orthogonal matrix A is rotated such that the resulting q1 is much larger than q2. Specifically, A rotates vectors to the ellipse axes. This process emphasizes the magnitude of q1 relative to q2, allowing us to effectively use only q1 to predict the attention score while ignoring q2.

2555

design

InfiniGen is built on two key design principles.

First, while calculating Layer i-1, we simultaneously calculate which token key-value caches are needed for Layer i. That is, we predict and select the key-value cache entries that are crucial to generating the next output token, while discarding non-critical entries.
Secondly, it utilizes CPU memory capacity to maintain a KV cache pool on the CPU, rather than on the GPU, to ensure that critical KV cache values can be identified for all outputs and layers with large window sizes while mitigating the problem of limited GPU memory capacity in the generation of long content.

Overall Plan

The diagram below illustrates the operation process of InfiniGen.

2556

Offline phase

In the offline phase, InfiniGen modifies the weight matrix to generate a skewed query and key matrix. To achieve this, InfiniGen first performs a forward pass of the model once using the sample input. During this process, InfiniGen collects the query matrix from each layer and performs singular value decomposition on each query matrix. The skew matrix for each layer is obtained using the decomposition matrix of the query matrix. This skew matrix is then multiplied by the weight matrix of each query and key in the corresponding layer to obtain the skewed query and key matrix.

It is important to note that:

Skew does not change the original functionality. Even after skew, the attention layer will produce the same computational results.
Skewing is a one-time offline process and does not incur any runtime overhead because we are modifying a weight matrix that remains unchanged at runtime.
Because we utilize a column pattern derived from the model’s inherent properties rather than the input, after skewing, the values will exhibit a high degree of skewness regardless of the different inputs we compute for the query and key, thus improving the effectiveness of our prefetch module.

The diagram above, number 1, illustrates how InfiniGen creates these partial matrices.

Prefill stage

During the pre-filling phase, InfiniGen selects several important columns from the query weight matrix and key cache to infer attention patterns and generates a partial query weight and key cache matrix used in the decoding phase.

InfiniGen’s prefetch module is built upon the key observation that the attention inputs of successive attention layers in large language models are highly similar. The similarity of attention inputs across successive layers in Transformer models is primarily due to layer normalization and residual connections. Layer normalization reduces scale differences in the input data, making the inputs of each layer numerically closer. Residual connections allow the output of each layer to be directly added back to the input, further enhancing the similarity between layers. Therefore, although each layer processes the input and produces new outputs, these outputs retain a high degree of similarity to the inputs of the previous layer when used as inputs to the next layer.

This corresponds to number 2 on the diagram.

Decoding stage

Label 3 shows how InfiniGen calculates the inferred attention score.

Decoding Phase: In the decoding phase, InfiniGen infers the attention pattern of the next layer and determines the key keys and values to be prefetched.

In layer i-1, we use the partial query weight matrix and key cache determined in the pre-filling phase for layer i, along with the attention input for layer i-1. After multiplying the partial query and partial key cache, InfiniGen prefetches only the keys and values of tokens whose attention scores are greater than the highest attention score minus $\alpha$ . Since multiple attention heads are computed in parallel, we ensure that each head in the same layer prefetches the same number of tokens by averaging the number of tokens between the highest score and the threshold across all heads.

By reducing the amount of key-value cache that needs to be loaded and computed, InfiniGen effectively reduces loading latency while maintaining output quality similar to the original model with a full KV cache. Furthermore, since InfiniGen does not need to load a fixed number of tokens from CPU memory, it only utilizes the necessary PCI interconnect bandwidth. InfiniGen performs speculation and prefetching starting from layer 1 because outliers, crucial for leveraging input similarity, can occur during computation at layer 0.

KV Cache resource pool management

We manage the key-value cache as a pool, offloading it to CPU memory and prefetching only the necessary amount of data to the GPU. While CPU memory is cheaper and has a larger capacity than GPU memory, its capacity is still limited. Therefore, in some deployment scenarios, limiting the size of the KV cache pool and removing less important KV entries that are not frequently selected by query tokens can be crucial. We extended the design to incorporate user-defined memory size limits. At runtime, when the CPU memory size reaches the user-defined limit, the KV cache pool manager selects a victim KV entry for eviction. Subsequently, the manager overwrites the selected victim entry with newly generated keys and values and updates the corresponding portion of the key cache in the GPU. It’s important to note that the order of KV entries can be arbitrary, as long as the keys and values of the same token maintain the same relative positions in the KV cache pool.

The strategy for selecting victims is crucial as it directly impacts model accuracy. We consider counter-based strategies as well as two widely used software-cached eviction strategies: First-In-First-Out and Least Recently Used.

0xFF Reference

Z. Zheng, X. Ji, T. Fang, F. Zhou, C. Liu, and G. Peng, “Batchllm: Optimizing large batched llm inference with global prefix sharing and throughput-oriented token batching,” 2024. [Online]. Available: https://arxiv.org/abs/2412.03594

L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, CH Yu, S. Cao, C. Kozyrakis, I. Stoica, JE Gonzalez, C. Barrett, and Y. Sheng, “Sglang: Efficient execution of structured language model programs,” 2024. [Online]. Available: https://arxiv.org/abs/2312.07104

L. Ye, Z. Tao, Y. Huang, and Y. Li, “Chunkattention: Efficient selfattention with prefix-aware kv cache and two-phase partition,” 2024. [Online]. Available: https://arxiv.org/abs/2402.15220

C. Hu, H. Huang, J. Hu, J. Xu, X. Chen, T. Xie, C. Wang, S. Wang, Y. Bao, N. Sun, and Y. Shan, “Memserve: Context caching for disaggregated llm serving with elastic memory pool,” 2024. [Online]. Available: https://arxiv.org/abs/2406.17565

GPU-accelerated LLM paper praised | InfiniGen efficiently implements generative inference in LLM through dynamic KV cache management: Motivation for GPU developers

LLM inference latency analysis

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition Zhao Shangchun

AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving

Modular Attention Reuse for Low-Latency Inference(@yale.edu)

Paper: “Efficiently Programming Large Language Models using SGLang”

ZigZagKV and KVCache achieve an 80% compression rate, resulting in near-lossless performance.

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition(@microsoft.com)

Efficient Prompt Caching via Embedding Similarity(@UC Berkeley)

Modular Attention Reuse for Low-Latency Inference(@yale.edu)

33x LLM inference performance improvement: Best practices from Character.AI

[ CLA] Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

[ GQA] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

[KV Cache Optimization] MQA/GQA/YOCO/CLA/MLKV Notes: Sharing DefTruth in Intra-Layer and Inter-Layer KV Cache

[LLM Inference Service Optimization] DistServe Quick Read—Prefill & Decode Decoupling, Model Parallelism Strategy & GPU Resource Allocation Decoupling ( by A-Jie)

[LLM] KV cache detailed explanation with illustrations, video memory, computational complexity analysis, and code. Don’t laugh at Fourier’s code.

[LLM Performance Optimization] A Discussion on Performance Optimization Directions for Long Text Inference ( by A-Jie)

[ MLKV] MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

[ MQA] Fast Transformer Decoding: One Write-Head is All You Need

[Prefill Optimization][10,000 words] Principles & Diagrams of vLLM Automatic Prefix Cache (RadixAttention): First Token Latency Optimization DefTruth

[ YOCO] You Only Cache Once: Decoder-Decoder Architectures for Language Models

A Survey on Efficient Inference for Large Language Models

akaihaoshuai

Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding

Confused with four attention mechanism and their performance mentioned by paper · Issue #33 · mit-han-lab/streaming-ll

Continuous Batching: A Powerful Tool for Improving LLM Deployment Throughput (Magic Square AI)

DejaVu: KV Cache Streaming For LLM

Efficient Streaming Language Models with Attention Sinks

GPT Model and K/V Cache: A Guide to Bright Moon and Rising Sun

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

https://lmsys.org/blog/2024-01-17-sglang/

Hydragen: High-Throughput LLM Inference with Shared Prefixes (@Stanford University)

Inference without Interference: Decoupled LLM Inference Services for Mixed Downstream Workloads (Zhu Yan Jian Zhi)

How might Kimi Chat’s large model achieve a lossless context of 2 million characters? GiantPandaCV

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

LaViT: That works too. Microsoft proposed directly using the attention weights of the previous layer to generate the attention weights of the current layer | CVPR 2024 VincentLee

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

LLaMa Principles and Source Code – Deconstruction (KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU) (Yuezero_)

LLM Inference Series: 3. KV caching unveiled

Exploring LLM Inference Optimization (2): Detailed Explanation of Transformer Model KV Caching Technology (Baihai IDP)

Exploring LLM Inference Optimization (3): How to Effectively Control the Memory Usage of KV Cache and Optimize Inference Speed? Baihai IDP

LLM Inference Beginner’s Guide ②: In-depth Analysis of Key-Value Caching OneFlow

LLM Inference Acceleration 02 KV Cache The Contemplative Stoic Nine

LLM Inference Acceleration 3: Summary of Inference Optimizations Mooncake/AttentionStore/vllm0.5/cache optimizations etc.

LLM reasoning acceleration research akaihaoshuai

A Brief Description of the LLM Inference Algorithm ( by Lu Chun)

Six latest works optimizing LLM KV caches, including MiniCache and PyramidInfer. (AI Talk)

MLKV: Cross-layer KV cache sharing, reducing memory footprint.

MLSys 2024 Keyformer: Select key tokens to reduce KV caching by 50% MLSys2024

MODEL TELLS YOU WHAT TO DISCARD: ADAPTIVE KV CACHE COMPRESSION FOR LLMS

Mooncake (1): Making mooncakes on the dark side of the moon, Kimi’s split reasoning architecture centered on KVCache. ZHANG Mingxing

Mooncake (2): Kimi How to handle the “torrential traffic”? Predictive scheduling strategy under a separate architecture ZHANG Mingxing

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving Hand-pulled Pancake Bear

Orca: How to Serve Large-scale Transformer Models WXB506

OSDI22 ORCA: A classic example of LLM Serving system optimization, featuring dynamic batching and a 36x increase in throughput. MLSys2024

OSDI24’s latest masterpiece: InfiniGen, a high-efficiency key-value cache management framework for large model inference, achieving a 3x increase in throughput and a 32% improvement in accuracy. (MLSys2024 )

SnapKV: Key-Value Cache Sparsity, Zero-Fine-Tuning Accelerates Long-Sequence LLM Inference AI Casual Discussion

SnapKV: LLM Knows What You are Looking for Before Generation

Token Sparsity + Channel Sparsity: 16x LLM Inference Acceleration for AI Chat

Transformers KV Caching Explained

TriForce: KV Cache sparsity + speculative sampling, 2.3x lossless acceleration of LLM. vAttention: used to serve LLM BBuf without Paged Attention.

YOCO: Long sequence optimization, 1M long text 10x LLM inference acceleration AI chat

You Only Cache Once: Decoder-Decoder Architectures for Language Models

ZHANG Mingxing: Mooncake (1): Making mooncakes on the dark side of the moon, Kimi’s KVCache-centered split reasoning architecture

[Tearing Down an LLM-KV Cache] Why is there no Q-Cache?? ( by Xiaodonggua AIGC)

Understanding KVCache: Why Did You Kaichao

Accelerate LLM Inference with KV Cache but No Q Cache? (DefTruth)

A Comprehensive Analysis of LLM Inference Optimization: Technology, Applications, and Challenges (AI Talk)

Random Thoughts on Mooncake by Xu Xinran

Dao Dao Ning: Notes: A Brief Analysis of Llama.cpp Code (Part 1): Parallel Mechanism and KVCache

Dao Dao Ning: Notes: A Brief Analysis of Llama.cpp Code (Part Two): Data Structure and Sampling Method Analysis; KV Cache Optimization Technique in the Autoregressive Generation Process of the GPT Model (sudit)

Illustrated Explanation of Mixtral 8 * 7b Inference Optimization Principles and Source Code Implementation by Mengyuan

Illustrated Guide to Accelerating Large Model Computation Series: Decoupled Inference Architecture 1, Starting with DistServe and the Mighty Ape

Illustrated Guide to Accelerating Large Model Computation Series: Separate Inference Architecture 2, Chunked-Prefills for Fuzzy Separation and Merging Boundaries ( Ape)

The Taizu Long Fist of Parallel Inference in Large Models: Deciphering Jeff Dean’s Outstanding MLSys 23 Paper by Fang Jiarui

Large Model Inference Optimization — Prefill Stage seq_q Segmentation kimbaol

Large Model Inference Optimization Practice: KV Cache Reuse and Speculative Sampling Alibaba Technology

Large Model Inference Optimization Techniques - KV Cache [Eat Jelly Without Spitting Out the Jelly Shell] https://www.zhihu.com/people/liguodong-iot

Accelerating Inference for Large Models: The Hidden Approach of KV Cache Sparsity

Accelerating Large Model Inference: Learning Key Values from Graphs and Cache from Graphs

Understanding KV Cache for Large Model Inference Performance Optimization (Young)

100x Acceleration of Large Model Inference: Sparse KV Cache ( Zhang)

Plain Language Explanation of Continuous Batching Gnuey Iup

A Review of Inference Performance Optimization for Large Language Models (Young)

Microsoft MInference: Million Token Sequence, 10x Acceleration

Critical Thinking | H2O: Solving the Long Context Problem of LLM from a Caching Perspective DSTS Center [Data Space Technology and Systems]

Building a High-Performance Large Model Inference Platform: Prefill and Decode Separation Series (Part 1): Microsoft’s New Product SplitWise Improves GPU Utilization Through PD Separation. Doraemon is Not a Dream.

Mooncakes in the Dark: Centered on KVCache, a split inference architecture boosts LLM throughput by 75%.

Extreme Exploration: In-depth Analysis of LLM Inference Speed MLSys2024

Unorthodox Approach: Modifying the KV Cache to Allow for Infinite Output of Large Models!

Dr. Fu Yao - Full-Stack Transformer Inference Optimization Season 2: Deploying Long Context Models - Translation Reinforcement Apprentice

Let’s talk about split reasoning in large-scale model reasoning. (by Dao Dao Ning)

Let’s talk about optimization problems in large model inference services. (by Dao Dao Ning)

A-Jie: In-depth analysis of vLLM & PagedAttention papers (Part 1) – Current status and optimization strategies of LLM services

Efficient Generation: An Interpretation of the Splitwise Paper

https://yaofu.notion.site/Full-Stack-Transformer-Inference-Optimization-Season-2-Deploying-Long-Context-Models-ee25d3a77ba14f73b8ae19147f77d5e2#d03b8a2a583246cea5a1fdeb8f583586

Longformer

Efficient Streaming Language Models with Attention Sinks

https://www.evanmiller.org/attention-is-off-by-one.html?continueFlag=5d0e431f4edf1d8cccea47871e82fbc4

Massive activations in large language models. arXiv preprint arXiv:2402.17762, 2024. Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu.

PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

ZigZagKV and KVCache achieve an 80% compression rate, resulting in near-lossless compression. (Du Lingxiao )

SnapKV: LLM Knows What You are Looking for Before Generation

FastGen: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

Prompt Cache: Modular Attention Reuse for Low-Latency Inference (@yale.edu)

Cache Similarity: Efficient Prompt Caching via Embedding Similarity (@UC Berkeley)

Shared Prefixes: Hydragen: High-Throughput LLM Inference with Shared Prefixes (@Stanford University)

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition (@microsoft.com)

ClusterKV, a near-lossless KV cache compression method. (Du Lingxiao )

Illustrated Guide to Accelerating Large Model Computation Series: Separate Inference Architecture 2, Chunked-Prefills for Fuzzy Separation and Merging Boundaries ( Ape)

Abridged version of “A Review of LLM Acceleration Research Based on KV Cache Management” by Chang Hua and Andy

[KV Cache Optimization] MQA/GQA/YOCO/CLA/MLKV Notes: Sharing DefTruth in Intra-Layer and Inter-Layer KV Cache

Venkat Raman. Essential math & concepts for llm inference, 2024. URL https://venkat.eu/essential-math-concepts-for-llm-inference#heading-insights-from-model-latency-amp-understanding-hardware-utilization-on-modern-gpus.

Yao Fu. Challenges in deploying long-context transformers: A theoretical peak performance analysis. arXiv preprint arXiv:2405.08944, 2024

GQA, another key-value cache compression method besides MLA: Dynamic Memory Compression (DMC) BBuf

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding Gray Pupil Six Points

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

| [Transformer Series] (25) KV Cache Optimization for Handling Long Text Sequences

Exploring the Transformer Series (25) --- KV Cache Optimization for Handling Long Text Sequences

Exploring the Transformer Series (25) --- KV Cache Optimization for Handling Long Text Sequences

0x00 Overview

0x01 Optimization Basis

1.1 Sparsity

1.2 Importance

1.3 Summary

0x02 Sparsification

1.1 Classification

1.2 Static Sparsity

1.2.1 Window Attention

1.2.2 StreamingLLM

Attention sink

Ideas

question

1.3 Dynamic Sparsity

1.3.1 H2O

motivation

plan

Optimization points

1.3.2 keyformer

motivation

plan

Scoring function

algorithm

Optimization points

1.3.3 Summary

1.3 Sparsity for prefill

1.3.1 FastGen

Insight

algorithm

1.3.2 SnapKV

Insight

plan

1.4 Sparsification based on layer characteristics

1.4.1 PyramidKV

Insight

plan

1.4.2 PyramidInfer

motivation

plan

1.4.3 ZigZagKV

motivation

Space allocation

1.4.4 Layer-Condensed KV Cache

1.5 Other Solutions

1.5.1 Sparsity based on head characteristics

1.5.2 Speculative Sampling Sparsity

1.5.3 Cluster Sparsity

0x02 Reuse

2.1 KV Cache Merging

2.1.1 Inter-layer merging

motivation

Ideas

algorithm

2.3 Intra-layer merging

2.2 Prefix Reuse

2.2.1 Requirements

2.2.1 Prompt Cache

Insight

plan

2.2.2 Hydragen

2.2.3 ChunkAttention

2.2.4 AttentionStore

motivation

plan

2.2.5 Radix Attention

Ideas

plan

0x03 Based on the search scheme

3.1 InfiniGen

question

Insight

design

Overall Plan

Offline phase

Prefill stage

Decoding stage

KV Cache resource pool management