Exploring the Transformer Series (24) --- KV Cache Optimization

0x00 Introduction

Current large-scale language model (LLM) service systems employ key-value (KV) caches to avoid redundant key-value projection calculations during the decoding phase. While this is an effective solution for generating short sequences for a single client request, when dealing with multiple clients, each request maintains its own KV cache, increasing the overall KV cache size during inference. Furthermore, even for a single client request, KV caches can significantly impact inference performance when generating long sequences or processing multi-turn dialogues. For example, beam search and parallel sampling are widely used to generate better outputs or provide candidate options to clients. These techniques also increase the KV cache size, similar to batch inference, because they process multiple sequences simultaneously.

The core problem with key-value caches (KV Caches) lies in their high memory and memory access bandwidth consumption, coupled with the introduction of repetitive computations during the generation phase. This prevents us from processing or generating very long sequences and handling large volumes of data, hindering the maximization of hardware efficiency. It’s like a librarian busy searching for information on every book but having no time to fetch books for customers. Furthermore, the overall trend is towards increasingly larger LLM window lengths, leading to a major contradiction: the conflict between the need for ever-growing LLM window lengths and limited GPU memory.

Therefore, optimizing the key-value cache is essential, and key-value cache compression technology has become a hot research direction in the field of LLM inference. This article will guide you through an in-depth analysis of the various characteristics of the key-value cache, elaborate on the various methods currently used to optimize the use of key-value cache space in LLMs, clarify the interrelationships between them, and compare their core ideas.

Note: Because there is so much to optimize for KV Cache, we will divide it into three articles for detailed study.

Note: The complete list of articles is here. It’s estimated to eventually have around 35 articles. This list will be updated after each subsequent article is published.
Cnblogs Exploring Transformer Series: Article List

0x01 Background Knowledge

1.1 Measurement Indicators

Inference acceleration can generally be reflected in two aspects: throughput and latency. See the figure below for the metrics of these two aspects.

2401

Generative LLMs can be used for a variety of tasks with different SLOs (Service Level Objectives). For batch processing tasks (such as summarizing), latency metrics like TTFT or TBT are less important than throughput. On the other hand, for latency-sensitive tasks (such as conversational APIs), TTFT and TBT are more important metrics, and their SLOs are more stringent.

1.1.1 Throughput

Throughput is a metric from a system perspective, or more specifically, a cost indicator for LLM services, representing the computational cost the service provider incurs for each token generated. There are various ways to evaluate throughput, such as tokens per second (tps), which is the number of output tokens the inference server can process per unit of time for all users and requests. If we consider not only pre-filling and decoding but also memory limitations and context switching, we can consider session-based throughput, which is the number of concurrent user interactions within a given time period—an end-to-end target.

$\begin{align} \texttt{throughput} = \frac{\texttt{number of user interaction sessions}}{\texttt{time}}\notag \end{align}$

Throughput measures the number of requests or tokens completed across all users and requests, thus ignoring latency requirements. Researchers have introduced goodput, which is the number of requests completed per second that meet SLO (TTFT and TPOT) requirements. Because goodput captures the request throughput while maintaining SLOs, it considers both cost and quality of service.

1.1.2 Delay

The total time required for the model to generate a complete response for the user can be calculated using two metrics: TTFT and TPOT.

Time To First Token (TTFT)

TTFT (Time to Validation Time) is the time required for the model to generate the first output token after the user enters a query. In real-time human-computer interaction, providing users with a fast response is crucial; the faster the result, the better the user experience. However, this is less important for offline workloads. In practice, we typically set a TTFT SLO (Service Level Objective). For example, suppose we set P90 TTFT SLO = 0.4s, which means our requirement for the system is: “90% of requests must have a TTFT value <= 0.4”.

Time Per Output Token (TPOT)

TPOT is the average time required for the inference system to generate subsequent tokens based on user requests, which is the total generation time divided by the total number of generated tokens.

This metric corresponds to how each user perceives the model’s “speed.” Higher latency leads to longer wait times for customers, significantly impacting the user experience. However, a good user experience can be achieved as long as the generation speed exceeds human reading speed. For example, 100 milliseconds/token in TPOT means each user generates 10 tokens per second, or approximately 450 words per minute, which is faster than the average person’s reading speed.

Time Between Tokens (TBT)

Some researchers have also suggested using the time between tokens (TBT) because the definition of TBT more explicitly specifies the delay between the generation of two tokens. Therefore, designing the Service Level Objective (SLO) based on the upper limit of TBT can better reflect the user experience during streaming interaction.

TBT and TPOT may behave inconsistently in the following situations:

Uneven generation speed: If the model’s generation speed is inconsistent, TBT will show this fluctuation, while TPOT will average out these differences. If some tokens are generated particularly slowly while others are generated quickly, TBT will reflect this difference, while TPOT will provide an average value.
Batch processing effect: Some models may perform batch processing, generating multiple tokens at once. In this case, TBT may exhibit a periodic pattern (fast within batches, slow between batches), while TPOT smooths out this effect.
Long sequence generation: When generating long sequences, the model may gradually slow down due to the growth of context. TBT clearly shows this gradual slowdown trend, while TPOT only reflects the overall average speed.
Hardware resource fluctuations: If the server load is unstable or there is resource contention, the generation speed may suddenly slow down at certain times. TBT can capture these instantaneous fluctuations, while TPOT will average them out.
Model architecture characteristics: Some model architectures may be faster or slower in generating specific types of tokens. TBT can reflect this speed difference that varies by token type, while TPOT provides an overall average.
Caching effect: If the model uses caching to speed up generation, some tokens may be generated particularly quickly. TBT exhibits this uneven speed, while TPOT averages out this effect.

In summary, TBT provides a more granular view of performance, reflecting changes and fluctuations during the generation process. TPOT, on the other hand, provides a holistic, average performance metric. When analyzing model performance, it is best to consider both metrics simultaneously, rather than relying solely on the commonly used TPOT.

Key Indicators

The inference process of LLM is generally divided into two stages: prefill and decoding. The prefill stage is responsible for processing the complete content of the input prompt. It is computationally intensive but highly parallel and generates the first token. The decoding stage generates subsequent tokens one by one through an autoregressive method. Although the computational amount per step is small, the generation of each new token must repeatedly access the key-value cache corresponding to all previously generated tokens.

Because the prefilling and decoding phases have different characteristics, different metrics are used to measure their respective SLOs. Specifically, the prefilling phase mainly focuses on the delay between the arrival of a request and the generation of the first tag, i.e., TTFT. On the other hand, the decoding phase focuses on the delay between the generation of consecutive tags for the same request, i.e., TBT, or TPOT. Maximizing the overall effective throughput is the main goal of LLM inference optimization, which mainly involves TTFT and TBT, or simply put, optimizing the time consumption of the prefilling phase + optimizing the time consumption of the decoding phase.

1.2 Memory Crisis

The throughput of LLM inference services is primarily constrained by GPU memory limitations. Research teams have found that existing systems waste 60% to 80% of GPU memory due to a lack of fine-grained memory management methods, with the wasted memory mainly coming from the KV cache. The figure below shows the GPU memory allocation of a 13B model during inference on an A100 40GB GPU (others represent the activation size generated during the forward process; these activations are used up and then discarded, thus occupying a small amount of GPU memory). From this figure, we can intuitively understand the GPU memory usage of the KV cache during inference.

2402

The main reasons for the memory crisis are as follows:

The most expensive operations in LLM are matrix multiplication and softmax, which are themselves memory-bound.
The weights and activations in LLM are already close to reaching memory capacity.
Key-value (KV) caching further exacerbates the demand for memory (growing linearly with sentence length and batch size), while storage capacity is determined by sequence length and model dimensionality. For a given LLM, the memory allocated to the KV cache continues to grow linearly with increasing batch size and sequence length, eventually exceeding the available memory capacity.
Long sequence access bottleneck: As the number of tokens increases, the amount of KV cache data that needs to be read each time a new token is generated increases linearly, leading to increased pressure on video memory bandwidth and increased latency.

The following diagram provides a detailed view of the KV cache’s memory usage during inference, as well as the key metrics focused on at different stages of inference. Specifically:

Model memory : The model size remains constant and is unaffected by sequence length or batch size.
Peak Memory : Peak memory = Model memory + kvcache.
Latency : Total generation time = TTFT + TPOT * number of tokens generated.

2403

As can be seen from the diagram, the relationship between latency and memory during the generation process is analyzed in detail below.

The green label 1 indicates the standby phase, during which the model waits for input, and the memory represents the model weights.
Green label 2 indicates the Prefill stage, which has the following characteristics:
- It can be computed in parallel. The light blue label 1 represents the time required to generate the first token, which has a relatively low latency.
- It will generate a large number of KV caches at once, so memory usage will surge in a short period of time.
- Assuming a batch size of 1, the larger the sequence length, the greater the computational intensity, and it will usually be located in the Compute Bound region.
- Pre-padding long inputs requires more computation time and GPU memory than short inputs;
Green label 3 represents the Decoding stage, which has the following characteristics:
- Parallel computation is not possible. The light blue label 2 represents the latency of a token generated during the Decoding phase. This latency is relatively high, and it will gradually increase as the sequence lengthens.
- The increase in video memory is relatively slow, but the increase in video memory is proportional to the length of the sequence.
- After pre-filling, the large KV cache residing on the GPU HBM significantly limits the number of concurrent users that can be served;
- During decoding, repeatedly reading the KV cache from HBM to SM causes significant context switching latency.
- The larger the batch size, the greater the computational intensity and the higher the theoretical peak performance, but at this point it is usually located in the memory bound area.
The green label 4 indicates that the corresponding KV cache will be released after a sequence has been processed. Furthermore, model parameters are shared between different sequences, while KV cache is not.

The figure below shows the execution time and GPU memory usage of OPT-6.7B inference with and without a KV cache. The x-axis step index represents the length of the output sequence. Without a KV cache, the execution time increases rapidly; with a KV cache, only the attention weights and scores of newly generated tokens are computed as vector-matrix multiplications, and the execution time remains almost constant across different steps. This reduction in runtime comes at the cost of GPU memory usage, which gradually increases over time due to the increasing size of the KV tensor.

2404

Therefore, optimizing the KV cache, saving GPU memory, and improving inference throughput have become key issues that the LLM inference framework needs to address.

1.3 KV Cache Issue

Let’s summarize the problems that exist in traditional key-value caches.

Memory usage is an issue. As the total sequence length (context window size) or the number of sequences processed at once (batch size, i.e., throughput) increases, the KV cache also grows linearly. Therefore, there is practically no upper limit to the size of the KV cache, while GPU memory is obviously limited. This makes the KV cache the primary cache resource. In longer contexts, the KV cache size can even exceed the size of the model parameters.
Memory fragmentation. Each newly generated token produces a corresponding K and V vector, which are stored in the cache. In the continuous $\{K, k_n\} \longrightarrow K$ and $\{V, v_n\} \longrightarrow V$ operation, memory is repeatedly allocated and released, leading to a large number of fragmentation problems.
Memory management issues. As sequence length increases, cache size grows linearly, leading to increased management complexity. For example, traditional strategies (such as LRU and LFU) are difficult to adapt to LLM access patterns, requiring the design of dedicated strategies; cache memory requirements grow linearly with sequence length and number of layers, easily exceeding hardware limitations, necessitating optimization of multi-storage device collaboration. Adaptive caching strategies are needed to handle dynamic load balancing in the face of changing access patterns; and in a distributed environment, how to coordinate cross-node consistency and resources is a key concern.
Memory bandwidth bottleneck. As the number of tokens increases, the amount of KV cache data that needs to be read each time a new token is generated grows linearly, and the inability to process the KV cache in batches exacerbates the severe memory bandwidth bottleneck.
The GPU is underutilized. The decoding phase has shifted from computationally intensive to memory-intensive. This means that GPU computing power is no longer the bottleneck during decoding; to accelerate decoding, we need to focus on memory access. However, we are acquiring large weight matrices and ever-growing KV caches just to perform trivial matrix-to-vector operations. We end up spending more time on data loading than on actual numerical computation, which clearly leads to inefficient use of GPU computing power.

0x02 Overall Approach

2.1 Classification

Given the limited memory of GPUs, strategies to reduce memory footprint are a win-win-win situation, as they improve hardware utilization and cost-effectiveness while reducing latency and increasing throughput. Therefore, the memory pressure on KV caches has spurred innovations in many different directions, such as novel attention architectures (e.g., MQA, GQA, SWA), cache compression strategies (e.g., H2O, Scissorhands, FastGen), efficient memory management (e.g., PagedAttention, RadixAttention), and quantization and storage capacity scaling (e.g., in load balancing systems, single-host and multi-host model parallelism).

But how do we categorize these KV cache optimization strategies? Some researchers have already provided their answers, and these answers vary.

For example, the image below comes from the paper “A Survey on Large Language Model Acceleration based on KV Cache Management”. Based on the significant impact of KV Cache on LLM inference time and memory requirements, this paper systematically reviews existing optimization strategies and divides them into three levels: token-level optimization, model-level optimization, and system-level optimization.

Token-level optimization refers to improving the efficiency of key-value cache management by focusing on fine-grained details. It involves careful selection, organization, and compression at the token level without requiring architectural changes to the original model or system parallelism techniques. Optimization methods fall into five categories: key-value cache selection, budget allocation, merging, quantization, and low-rank decomposition. These methods achieve initial optimizations in computational efficiency and memory requirements by directly manipulating token-level data, laying the foundation for subsequent, higher-level improvements.
- Key-Value Cache Selection: Focuses on priority ordering and storing only the most relevant tokens. Prioritizes storing tokens most important for inference, reducing redundant caching and improving memory utilization by evaluating token relevance.
- KV Cache Budget Allocation: Dynamically allocate memory resources among tokens to ensure optimized cache distribution under limited memory constraints and ensure efficient utilization.
- KV Cache Merging: Identifies and merges similar or overlapping KV pairs to reduce redundant storage and further compress the cache size.
- KV Cache Quantization: Minimize memory footprint by reducing the storage precision of KV pairs (e.g., from Float32 to Float16 or lower bits), while balancing precision loss with performance gains.
- KV Cache Low-Rank Decomposition: This technique uses low-rank decomposition to decompose the KV matrix into smaller-dimensional representations, compressing the cache size while retaining key information.
Model-level optimization refers to optimizing KV cache management from an architectural design perspective. This involves adjusting the model structure or introducing new mechanisms to reduce cache dependencies, fundamentally reducing the generation and usage requirements of the KV cache, and improving overall efficiency. Specific strategies include:
- Attention grouping and sharing: Check for redundancy in key-value pairs and group and share KV caches within or across Transformer layers to reduce storage requirements.
- Architecture changes: Design new attention mechanisms or build external modules for KV optimization, restructure the model computation process for KV optimization, and reduce direct reliance on caching.
- Non-Transformer architectures: These architectures employ other memory-efficient designs, such as using recurrent neural networks to optimize the key-value cache in traditional Transformers.
System-level optimization refers to optimizing KV cache management from the perspective of underlying hardware and resource scheduling. It involves classic issues such as memory allocation and task scheduling, aiming to maximize hardware resource utilization and improve efficiency in diverse computing environments. Specific technologies include:
- Memory management: Optimize memory usage through architectural innovations, such as virtual memory adaptation, intelligent prefix sharing, and layer-aware resource allocation, to ensure efficient cache storage and access.
- Scheduling strategies: Maximize cache reuse through prefix-aware methods, achieve fair context switching through preemptive techniques, or implement fine-grained cache control through design-specific mechanisms to address diverse optimization objectives.
- Hardware-aware design: Proposes solutions for single/multi-GPU, I/O optimization, heterogeneous computing and SSD storage to improve the access speed and storage capacity of KV Cache.

This classification framework, from micro to macro, proposes specific technical paths for different challenges in KV cache management, providing clear theoretical guidance for understanding and designing efficient acceleration solutions.

2405

For example, the image below comes from the paper “Thus Spake Long-Context Large Language Model.” It mainly categorizes token dropping, token merging, and layer-wise sharing. This classification method gets straight to the point. Since the size of the KV cache is determined by the product of the cache sequence length, the number of layers, the number of KV heads, the number of feature dimensions, and the type of data stored, we can optimize the overhead by improving each of these factors. In particular, since the optimization of the sequence length has been discussed the most, it can be further divided into token dropping and token merging.

2406

This article draws on previous research while incorporating the author’s own thoughts. The author believes that, broadly speaking, KV cache optimization can be approached in three directions: optimization from a formula perspective; optimization based on characteristics; and optimization based on stages. These directions are not necessarily orthogonal, as some solutions may optimize from both the model structure characteristics and formula perspectives. For better analysis, solutions that can be analyzed from a formula perspective are categorized as such.

2.2 Optimization from the perspective of formula

As shown in the previous article, the formula for calculating the peak memory usage (in bytes) of the KV cache is as follows:

$batch\_size \times sequence\_length \times head\_number \times head\_dimension \times layer\_number \times kv\_bits \times 2$

Taking QWen as an example, the comparison between 4K context and 100K context is as follows:

2407

From the formula above, we can see that the LLM inference service allocates GPU memory space for each token. The size of this GPU memory space is related to several dimensions, including batch size, sequence length, head dimension, number of heads, number of model layers, and the size of the data types stored in the KVCache. Our goal is to process as many tokens as possible within a given GPU memory space. Therefore, there are two optimization strategies:

Compress the batch size and sequence length dimensions to represent the full token information with less token information;
Compression is applied to several dimensions, including the header dimension, the number of headers, the number of model layers, and the data types stored in KVCache, to reduce the memory overhead of each token.

We will examine each item one by one to see the specific factors affecting the KV cache and the methods to reduce the KV cache memory usage.

Note: “Optimization from a formula perspective” basically corresponds to “token-level optimization” in the paper “A Survey on Large Language Model Acceleration based on KV Cache Management”.

2.2.1 Options that are difficult to use

We found that the following options are difficult to utilize.

Reduce the number of model layers

Some studies determine the importance of a layer by comparing the similarity of its input and output embeddings, thus pruning less important layers. This can simultaneously reduce model weights and KV cache. However, reducing the number of layers doesn’t yield significant benefits.

Reduce batch size

Batch size is closely related to the performance of sub-stages in inference. For example, the prefill stage can achieve greater computational intensity and higher throughput with a relatively small batch size; while the decoding stage requires a larger batch size to achieve relatively high computational intensity and throughput.

While reducing batch size can decrease the memory footprint of the key-value cache and lower latency, it also reduces the number of objects that can be processed in a single run, thus decreasing hardware utilization and reducing cost-effectiveness. Therefore, in most cases, we do not want to reduce batch size; instead, we try to increase it as much as possible.

2.2.2 Optimization Directions for Accelerating LLM Inference

From this, we can identify some specific directions for optimizing the KV cache:

Reduce sequence length: For excessively long prompts, compression is achieved by reducing the number of slots in the key-value cache, thereby reducing the length of the input and output sequences (or the prompt + completion portion). For example, for a prompt of 128K tokens, only 1024 tokens are selected for storage in the key-value cache.
Reduce the number of attention heads: MQA (multi-query attention) and GQA (grouped-query attention) reduce memory usage by reducing the number of heads in the key-value cache.
Reduce key bits. Since FP16 occupies 2 bytes, the optimization mainly involves quantization, converting model parameters from FP16 to int8/int4, and reducing the bytes occupied by each parameter from 2 bytes to 1 byte/0.5 bytes. While keeping the number of key-value cache slots unchanged, compress the data format from FP16 to lower-precision formats such as int8 or int4.
Reduce head dimensions. DeepSeek V2’s MLA introduces a LoRA-like concept that effectively reduces the size of the key-value (KV) head.
Optimize KV cache memory management: Currently, the effective storage rate of KV cache on GPUs is relatively low, which can be optimized using methods similar to PagedAttention. Because it is not orthogonal, we will also include the memory management part in the “Optimization by Characteristics” section for analysis later.

2408

Furthermore, the paper “A Survey on Large Language Model Acceleration based on KV Cache Management” summarizes model-level optimizations. These optimizations optimize KV cache reuse by designing new transformer architectures or mechanisms, typically requiring model retraining or fine-tuning. Model-level optimizations are not orthogonal to the classification in this paper; therefore, the solutions discussed in this paper are added to the diagram.

2409

2.2 Optimize according to characteristics

As mentioned earlier, prefilling and decoding have different characteristics, for example:

The prefilling stage demands enormous computing resources. Even small batches of prefilling tasks, or even a single long prefilling task, can saturate the GPU’s computing power. The prefilling stage is the computational bottleneck; as the input length increases, the time consumption increases superlinearly (throughput decreases; if the time consumption increases linearly, then the average throughput remains unchanged).
In contrast, the decoding task requires a larger batch size to fully utilize computing resources. The decoding stage is the transmission bottleneck. As the batch size increases, the time consumption increases slightly (mainly because the transmission time increases due to the increase in data volume, while the computation time is considered to remain basically unchanged), and the throughput increases approximately linearly with the input length.

The diagram below illustrates the use of the KV cache during the prefill and decoding phases. In the initialization phase (corresponding to the first iteration), the LLM generates a key-value cache for each token in the input prompt. In the subsequent decoding phase, the LLM only needs to compute the lookup, key, and value for a newly generated token, progressively simplifying the process by utilizing the pre-computed key-value cache. Therefore, the execution time of the decoding phase iterations is typically shorter compared to the initialization phase (i.e., the first iteration).

2410

Therefore, we need to handle the KV cache differently based on these characteristics. This part overlaps with the system-level optimizations in the paper “Thus Spake Long-Context Large Language Model”.

2411

2.2.1 prefill

Several optimization approaches exist to address the high computational intensity of the prefill stage, such as parallel generation and splitting the prompt. These will be analyzed in more detail in later chapters.

2.2.2 Decoding

Although the computational load of each Decoding Step increases linearly, the computation time doesn’t increase significantly due to memory access bottlenecks. If the number of Decoding Steps were reduced by a larger proportion, it’s highly likely the overall request latency would be lowered. Therefore, it’s necessary to find ways to increase the number of Decoding requests or achieve a similar effect. For scenarios with relatively low computational intensity in the Decoding phase, there are two optimization approaches:

Inter-request optimization. Continuous Batching, which involves batching between different requests, can increase the batch size in the decoding stage, thereby making better use of GPU computing power. However, it cannot reduce the latency of individual requests, and if there are few user requests, such as only a single user, then batching can never be achieved.
In-request optimization. Validating multiple Decoding Steps (Speculative Decoding) within a single request, although the computational cost of each Decoding Step increases linearly, does not significantly increase computation time due to memory access bottlenecks. If the number of Decoding Steps is reduced by a larger proportion, it is very likely to reduce the overall request latency.

We will provide a detailed analysis in a subsequent article.

2.2.3 Separation or Not

Because prefilling and decoding have different characteristics and significantly different computational costs, they can easily interfere with each other. For example, in continuous batch processing tasks, a newly added prefill task can significantly increase the latency of the decoding task, causing requests in the decoding stage to “get stuck” every time a prefill request enters the system. Therefore, some implementations decouple prefilling and decoding onto different GPUs and customize parallel strategies for each stage. This naturally solves the two problems mentioned above.

There is no interference between pre-filling and decoding, allowing both stages to reach their respective SLOs more quickly.
Resource allocation and parallel strategies are decoupled, thereby allowing for tailored optimization strategies for pre-filling and decoding.

Of course, some researchers have questioned the separation strategy, arguing that while PD separation provides flexibility, it leads to a serious waste of GPU resources. Specifically, the utilization of HBM capacity and bandwidth of GPUs running the computationally intensive Prefill stage is low. The memory-intensive Decode stage also faces the problem of low utilization of computing resources. Therefore, another school of thought emphasizes the need for fusion or computing power migration (for example, offloading some of the memory-intensive Attention computation operations in the Decode stage to the Prefill node), which we will analyze in detail in a subsequent article.

2.2.4 Memory Management

This can be considered from two perspectives.

First, let’s look at it from a system perspective. Many current inference frameworks suffer from memory fragmentation and excessive memory retention, resulting in low memory utilization. Because sequence lengths are highly variable and unpredictable, the memory requirements of the key-value cache are unknown, making it difficult to pre-define storage space for the cache. Therefore, many inference frameworks pre-allocate a contiguous rectangular storage space for each request. However, this allocation method easily leads to insufficient GPU memory utilization, thus affecting the throughput during model inference. Therefore, system-level memory management optimizes memory usage through architectural innovations, such as virtual memory adaptation, intelligent prefix sharing, and layer-aware resource allocation, ensuring efficient cache storage and access. We will discuss this in this article.

The figure below summarizes some existing memory management schemes in the review paper. The architecture design, using vLLM with PagedAttention and vTensor as an example, adopts classic operating system principles to create a flexible dynamic memory allocation system, optimizing physical memory usage through complex mapping and virtual memory abstraction. Prefix-aware designs such as ChunkAttention and MemServe further optimize this approach by organizing data structures to achieve efficient cache deduplication and shared common prefixes, thereby improving memory utilization and computational efficiency. These innovations collectively demonstrate the potential to significantly enhance LLM services through memory management.

2412

Secondly, the paper “Thus Spake Long-Context Large Language Model” proposes viewing the KV cache as part of memory management, which is from a business perspective. Although KV cache optimization strives for longer contexts in practice, it essentially seeks a balance between efficiency and performance. Cache optimization does not attempt to push the limits of LLM capabilities because it does not change the organization of context information. Long-context LLMs based on standard KV cache mechanisms still face some limitations, including the need for read-only access and reading all information at once, making them unsuitable for more complex scenarios. This prompts the incorporation of memory management into LLMs, treating the KV cache as a specific memory instance. The most intuitive improvement of this memory management compared to KV cache is its more flexible access method, avoiding reading the entire KV cache at once. MemTrans stores the KV cache during the pre-training phase in external storage to provide more relevant information during inference. MemLong extends this concept to long contexts by storing KV caches of context fragments and retrieving KV pairs based on relevance to guide inference. This paper will not elaborate on this aspect.

2.2.5 Scheduling

This section mainly focuses on scheduling strategies implemented based on the characteristics of KV Cache, primarily through three approaches:

Prefix-aware scheduling strategies. Prefix-aware methods, such as BatchLLM and RadixAttention, aim to maximize cache reuse.
Preemptive and fairness-oriented scheduling. Preemptive techniques for achieving fair context switching are exemplified by FastServe and FastSwitch.
Layer-specific hierarchical scheduling methods. These methods employ fine-grained cache control tailored to the characteristics of each layer to address diverse optimization objectives. LayerKV and CachedAttention are typical examples.

The following diagram illustrates these categories.

2413

Furthermore, after separating the prefilling and decoding servers, the core of further optimization lies in how to schedule the prefilling and decoding servers with maximum efficiency. There are two key points here:

How to optimize KV cache transmission? Because storage is cheaper and more abundant than computing resources, KV cache is stored in different parts of the system and then transmitted over the network.
How to maximize batch size during the decoding phase (but too large a batch size will also lead to a longer average generation time for a single token).

2.3 Optimize according to stages

The image below shows the partitioning method proposed in the paper “Keep the Cost Down: A Review on Methods to Optimize LLM’s KV Cache Consumption”.

2414

This paper unfolds according to the time sequence of LLM, dividing the optimization schemes for KV Cache from the training phase to the deployment phase, and then to the post-training phase.

The training phase primarily involves KV cache compression methods that can be used for model pre-training. During this phase, the model has the greatest plasticity. The main adjustments made at this stage are to the model architecture, which can reduce the size of the generated key and value vectors to a quarter or even less while preserving the excellent properties of Attention. Alternatively, the architecture can be altered to completely abandon attention, which has quadratic complexity. These methods are generally the most effective, but they are not suitable for applying to existing models or in computationally limited scenarios.
The deployment phase primarily involves optimizing the KV cache using different frameworks. These methods do not require significant modifications to the KV cache itself, but can significantly improve its efficiency in the same environment.
The post-training phase primarily involves real-time optimization methods for the KV cache, mainly including three approaches: deletion, merging, and quantization. During inference, compression optimization can be performed from the model input prompt level to the embedding level and then to the KV cache level.

0x03 Optimize according to the formula

Next, we will analyze how to optimize based on the characteristics of the following formula through specific cases.

$batch\_size \times sequence\_length \times head\_number \times head\_dimension \times layer\_number \times kv\_bits \times 2$

3.1 Reduce the number of heads to focus on.

As mentioned earlier, the KVCache size of a single token is determined by four main factors: Layers, NumHead, HeadDim, and the KVCache data type. Therefore, there are four main directions for compressing the KVCache size of each token. Let’s first look at how to optimize it by reducing NumHead.

For a given model architecture, model size is primarily controlled by the number of layers and heads; reducing the number of attention heads might mean choosing a smaller model. However, upon closer inspection, we find that only the number of key and value heads needs to be reduced, as the number of query heads does not affect the size of the KV cache. Similar to layer-level optimization, reducing the number of heads significantly impacts the representational power of LLMs. To preserve performance, head-level optimization often relies on sharing strategies. This is the core idea behind MQA (Multi-Query Attention) and GQA (Grouped Query Attention) architectures. The sole purpose of these multi-head attention (MHA) variants is to train the base model with fewer attention heads, thereby reducing the size of the KV cache. However, these models are typically used after fine-tuning for a specific task. There are also KV cache compression algorithms that profile which heads are more important for generation, allocating more cache space to them and reducing or eliminating space for those that are less important.

2415

Furthermore, fine-tuning the existing model can further optimize the head size. For example, SHA calculates the cosine similarity between head weight matrices and groups similar heads to share a single key-value cache. DHA uses a centroid alignment method to calculate the similarity between heads, effectively compressing MHA into GQA by linearly fusing the key-value caches of similar heads.

We will dedicate a separate article to introducing MQA and GQA in a later article.

3.2 Reduce the number of bytes occupied by each parameter

This direction leads to the application of quantization techniques in KV Cache. KV Cache quantization can not only significantly reduce memory usage during inference, but more importantly, it can maintain the inference performance of the model while compressing memory usage.

We will learn about this together when we introduce quantitative methods later.

3.3 Reduce attention head dimension

In many model families, the hidden dimension of the attention head may remain constant regardless of the model size. Therefore, it is difficult to optimize the KV cache by reducing the attention head dimension unless the model is changed. Furthermore, it’s also possible that the hidden dimension is already 128, and further reduction would negatively impact performance.

To reduce head dimensionality, one possible approach is low-rank decomposition, as LoRa utilizes this idea. DeepSeek V2’s MLA introduces a similar LoRA concept, effectively reducing the size of the key-value (KV) head. ECH applies a low-rank decomposition based on SVD to the grouped head weight matrix, achieving KV compression similar to GQA, but unlike GQA, it does not use average fusion. Neurocache applies low-rank compression to the head matrix and uses the most similar cache in attention computation.

Furthermore, there are also sparsity features targeting/utilizing channels. For example, Double Sparsity combines token sparsity with channel sparsity:

Token sparsity focuses on using important tokens to compute attention, which is sparsity in the sequence dimension.
Channel sparsity is the use of important channels to identify important tokens, which is sparsity in the hidden dimension.

We will discuss MLA in a separate article later. Here we will discuss sparsification for channels.

3.2.1 Double Sparsity

The paper “Post-Training Sparse Attention with Double Sparsity” uses token sparsity for long sequence scenarios and also introduces channel sparsity, combining token sparsity and channel sparsity to achieve good speedup.

Insight

The authors’ key insight is that the sparsity of Channels is typically relatively static, allowing for offline calibration to accurately and efficiently identify important tokens, thus making inference more efficient. Furthermore, this approach can be combined with Offload to significantly reduce memory footprint.

2416

plan

Double Sparsity is a post-training sparse attention mechanism. Its overall approach is relatively simple.

Offline Calibration

First, important channels are identified through offline calibration.

Specifically, as shown in the diagram below, drawing inspiration from the method used in AWQ to identify important channels, we use a small amount of sample data to identify which channels are more important through an offline approach. Several different strategies are employed here:

Activation Outlier: Identify important channels based on the Outlier in the Activation, indicated by the red text.
Q-Outlier: Identifies important channels based on the Outlier in the Query, indicated by the green area.
K-Outlier: Identifies important channels based on the Outlier in the Key, shown in blue.
QK-Outlier: Identifies important channels based on Query * Key, indicated by the purple section.

2417

Real-time computing

After offline processing, the actual calculations will be performed. This involves three steps:

Extract queries based on the index of important channels, and store them contiguously.
The Attention Score is calculated using a sparse but contiguously stored Query and Key Cache.
The top-k important tokens (key/value cache) are selected based on the Attention Score, and then these important tokens are used to calculate the Attention.

As shown in the figure below.

2418

Double Sparsity-Offload

As can be seen from the above, while the Double Sparsity solution can reduce GPU memory I/O to some extent, it requires additional storage for the Label Cache, which actually increases GPU memory usage. To solve this problem, the authors proposed offloading the entire Key/Value Cache to CPU memory and dynamically loading it into GPU memory when needed, which is known as Double Sparsity-Offload.

However, if the corresponding Key/Value Cache is loaded into GPU memory after each use of the Label Cache to obtain the Top-k Tokens, the latency can be significant and cannot be overlapped with the computation. To address this issue, the authors introduced the Double Buffer, whose core idea is to asynchronously prefetch the Key/Value Cache of the next layer’s relevant tokens while computing the current Transformer Layer. Why can the Key/Value Cache of the next layer be preloaded? Because the authors compared the Attention Heads of adjacent layers and found that, except for the first two and last two layers, the similarity between subsequent adjacent layers is very high, even exceeding 95%. Therefore, the current layer can be used to approximate the next layer.

3.4 Reduce Layers

3.4.1 Overall Approach

It’s important to note that this doesn’t actually reduce the number of layers, but rather caches only the key-value pairs for a portion of the model layers. For example, it retains only the critical context keys and values layer by layer, and merges the key-value caches of adjacent layers.

Regarding the layer dimension, the basic assumption is that some tasks may not require full-depth computation, and skipping some layers during pre-padding may simultaneously reduce the number of floating-point operations during pre-padding and the size of the KV cache. We will learn through several practical examples below.

2419

3.4.2 LayerSkip

The overall idea behind LayerSkip is to skip some layers and then validate them, which is essentially a speculative inference approach. During training, LayerSkip applies different dropout rates to different layers, and all Transformer layers share the same early exit loss. During inference, because the accuracy of early exits in early layers has been improved through training, there is no need to add auxiliary layers. Furthermore, LayerSkip introduces a self-speculating decoding method, using the remaining layers for validation and correction after early exits, thereby reducing memory usage and improving efficiency.

motivation

Before designing LayerSkip, the authors conducted an experiment to observe what each layer of the Transformer actually does. Before compression, it’s crucial to ask ourselves: Does the object to be compressed contain redundant information? Is a full compression necessary? If so, how can we improve inference performance without significantly degrading the model’s effectiveness? Therefore, the purpose of the LayerSkip experiment was to identify the most critical layers.

The authors used the LLaMA 7B HumanEval test suite to test code generation. They connected an LM Head layer after each TransformerLayer in LLaMA-7B and statistically analyzed the probability distribution of each token generated by each TransformerLayer throughout the generation process. An example experiment is shown below.

2420

The experimental statistical results are shown in the figure below. Two phenomena can be observed from the table in the figure:

The results generated by the earlier layers are almost unrelated to the final result, mainly because the weights of the earlier layers and the weights of the LM Head have not been adapted, resulting in poor generation. The later the layers are, the closer the generated results are to the final result, eventually converging to the final result. Moreover, intermediate layers sometimes “change their minds.” For example, the model may have predicted the label “range” at layer 7, but changed its mind between layers 22 and 26 and then re-determined it as “range”.
In most cases, not all layers need to be executed; exiting the computation early can still yield the correct final result. The authors found that, most of the time, the final token is predicted in the final few layers. In a 32-layer LLM, a token requires an average of 23.45 layers to be predicted. If a zero-computational-cost predictor could perfectly predict which layer to exit at, we could reduce the computational cost by 26%. Therefore, it is necessary to make the LLM use fewer layers to predict each token and how to “change minds.” The authors aim to reduce the reliance on later layers of the model, using only these later layers to predict difficult tokens. This avoids situations like the example above where 32 layers were used to predict the “for” loop.

2421

plan

Therefore, to address the two phenomena mentioned above, the authors propose LayerSkip, which utilizes “early exit” combined with speculative sampling to optimize inference performance. The early-exit layer performs an unembedding operation on the non-final Transformer layer of the model, effectively skipping the remaining Transformer layers. Specific contributions include:

On the training side: The early exit layer was not adapted to LMHead, resulting in poor generation performance. Therefore, during training, the earlier layers need to adapt to early exit, enabling the model to directly generate tokens. Specifically, an Early Exit Loss was added to the early exit layer during training to enhance its generation capabilities.
On the inference side: A self-speculated decoding strategy is adopted, leveraging the fast inference speed of the early-retirement layer to generate multiple draft tokens. Then, during the verification phase, the full layer is run, and the verified draft tokens are returned. The right side of the diagram below shows the overall architecture of the self-speculated decoding strategy. Light green nodes represent the state KVCache cached in GPU memory, dark green nodes represent nodes requiring computation, and transparent nodes represent nodes skipped from computation. As shown in the diagram, only the early-retirement layer and earlier layers require KVCache; subsequent layers do not. Therefore, LayerSkip has two major optimizations: accelerating prediction performance and reducing GPU memory overhead, potentially increasing the maximum service throughput.

2422

3.4.3 YOCO

LayerSkip primarily uses a “pruning” layer approach, utilizing only a portion of the KVCache layers to complete the computation of all results. Besides the “pruning” layer approach, the KVCache can also be compressed using a “sharing” layer approach, or what could be called cross-layer inference (Cross-Attention mechanism).

The paper “You Only Cache Once: Decoder-Decoder Architectures for Language Models” proposes the YOCO scheme for cross-layer sharing of key-value caches. It selects a fixed number of the first few layers to generate the key-value cache.

YOCO constructs a dual-decoder architecture consisting of two decoder modules: a self-decoder and a cross-decoder. The self-decoder efficiently encodes the global key-value cache, while the cross-decoder reuses these caches through cross-attention. The execution process of the entire model is similar to the Decoder-Only Transformer model, but it selects a fixed number of the first few layers to generate the KV cache, retaining only one global KV cache. YOCO’s computational flow also allows pre-filling to exit early, thus achieving a faster pre-filling stage without changing the final output and reducing GPU memory usage. This design can significantly reduce GPU memory requirements while maintaining global attention capabilities.

The figure below summarizes the in-layer model modification schemes in the paper “A Survey on Large Language Model Acceleration based on KV Cache Management”, and shows the characteristics of YOCO.

2423

Model Architecture

Unlike Decoder Only models such as LLaMa, YOCO adopts a Decoder-Decoder model architecture. It’s called a Decoder-Decoder architecture because the meanings of the two “Decoders” are not entirely the same. The model architecture is shown in the figure.

2424

YOCO’s two main decoders are: the Self-Decoder, which is responsible for generating the global KV Cache, and the Cross-Decoder, which is responsible for using the global KV Cache.

The characteristics of a self-decoder are as follows:

YOCO’s first L/2 layers are stacked from Self-Decoder layers, and the global KV Cache is output by the last layer (L/2 layer) Self-Decoder layer, meaning there is only one layer with a global KV Cache.
The Self-Decoder is similar to a regular decoder. However, if only the second half of the Transformer Block is replaced with a Cross-Decoder, the upper limit of memory and computation savings is also fixed, at most half. Therefore, the design of the Self-Decoder is crucial to significantly save memory and reduce computation. YOCO chose Efficient Self-Attention, such as Slide Window Attention and Multi-Scale Retention proposed in other papers by the YOCO authors. It only needs to store the key-value cache within the window. Its computational cost and key-value cache size are related to the window size or chunk size, rather than the sequence length.
The KV Cache generated by the Self-Decoder will be directly used by the subsequent Cross-Decoder.

The characteristics of Cross-Decoder are as follows:

YOCO’s last L/2 layers are stacked from Cross-Decoder layers.
The Cross-Decoder itself does not generate a KV cache; instead, the KV cache generated by the Self-Decoder is loaded and used by the first Cross-Decoder layer. This means that the KV cache for all subsequent L/2 layer Cross Attention layers is the same. In the Decoding Stage, each Step expands the newly generated KV cache from the L/2 layer into the Global KV Cache.
Cross-Decoder uses KV Cache for cross-attention.

Training and reasoning

Because it only has one KV cache layer, the prefill and decode processes during inference differ from those of a traditional Transformer. YOCO can save significant time spent on prefilling during the inference phase.

Prefill Phase: Based on YOCO’s characteristics, the Prefill phase can skip the generation of the first token, that is, skip the Cross-Decoder. In the traditional Transformer architecture, each layer calculates the KVCache, so each layer runs and generates the KVCache during the Prefill phase. However, in YOCO, only the Self-Decoder generates the KVCache, and the generation of the next token depends on the value of the current token. Therefore, during Prefill, after all the input tokens are calculated by the Self-Decoder, only the last token needs to run the subsequent Cross-Decoder to complete the generation of the first token.
Decode Phase: In this phase, the model input is only the current token, and the output is the next token. Therefore, the Self-Decoder only performs attention calculation on the current token, similar to prefilling a token to generate a KVCache for that token. Then, the Cross-Decoder performs Decode Attention calculation to generate the next token.

2425

Effect

The figure below compares the KV cache usage and computational cost of YOCO and the original Transformer, assuming the sequence length is N, the number of transformer layers is L, and D is the hidden state dimension.

During the prefill stage, the KV cache memory requirement decreases from O(LND) to O((N+L)D). Here, O(ND) corresponds to the Global KV Cache, which has only one layer. The first N/2 layers use sliding window attention or multi-scale retention, and the corresponding KV Cache is O(DL).
The time consumption for prefilling decreases from O(LN^2D) to O(LND), meaning the time consumption changes from N-squared complexity to linear complexity. This is because the prefilling stage only needs to compute the first half of the Self-Decoder Layer to generate the KV Cache, while the second half of the Cross-Decoder Layer can be generated directly. Therefore, the computational load of the prefilling stage is reduced by half.

2426

3.4.4 CLA

CLA (Cross-Layer Attention): This method alternates the layers that generate the KV cache across layers at different depths in the model. It then extends the ideas of GQA and MQA by sharing key and value headers between adjacent layers, further reducing redundancy in the KV cache. Compared to MQA, CLA can reduce the KV cache size by a factor of 2, significantly improving memory efficiency without changing computational complexity.

The paper “Reducing Transformer Key-Value Cache Size with Cross-Layer Attention” proposes CLA (Cross-Layer Attention), which is a cross-layer attention mechanism for key-value caches, coinciding with the approach of YOCO. Unlike YOCO, CLA does not select a fixed number of layers to generate the key-value cache (as YOCO uses the first L/2 layers). Instead, it distributes the layers that generate the key-value cache alternately across layers of different depths in the model, and then neighboring layers reuse the key-value cache generated by nearby layers for attention calculation.

CLA offers several flexible and versatile methods for sharing the KV cache across layers, as shown in the left diagram below. The simplest method is sharing across every other layer, called CLA2. In practice, it can also share every three layers, called CLA3, as shown in the right diagram below. Here, C represents the number of layers sharing a single KV cache. For example, C=2 indicates CLA2 mode, where two adjacent layers share a single KV cache, thus halving the total KV cache size. That is, CLA2 reduces memory usage by 2x, and CLA3 reduces memory usage by 3x.

2427

The figure below illustrates two consecutive layers in a Transformer (left) using traditional attention and a Transformer (right) using CLA. With traditional attention, each layer computes its own individual K and V activations, which must be cached layer-by-layer during autoregressive decoding. With CLA, however, CLA computes key/value predictions only for a subset of layers in the model; attention blocks in layers without key/value projections reuse K and V activations from previous layers. Only a subset of layers with key/value projections contributes to the KV cache, resulting in reduced memory footprint compared to the traditional architecture that applies individual key/value predictions in each layer.

2428

Furthermore, CLA provides a general approach to cross-layer KV cache sharing, which is essentially not contradictory to intra-layer KV cache sharing methods such as MQA/GQA. That is, MQA and GQA share key/value headers across query headers within a single layer, while CLA shares KV caches across layers. Therefore, CLA can be used in conjunction with MQA/GQA.

3.4.5 MLKV

MLKV (Multi-Layer Key-Value) introduces a simple key-value header sharing mechanism across multiple transformer layers. MLKV uses the same single key-value header as MQA within a single layer, but it also shares this key-value header with multiple layers. This extreme strategy reduces the cache size to nearly 1% of the normal GQA strategy, and experiments show that MLKV still maintains considerable performance.

From an innovation perspective, both MLKV and YOCO are special cases of CLA. Like CLA, MLKV performs cross-layer KV cache sharing, but MLKV mainly extends MQA to a more extreme level, namely MQA + cross-layer KV cache sharing, as shown in the figure below.

2429

The diagram below illustrates the differences between MLKV and several other schemes.

MHA: The original Multi Head Attention, where each head in each layer has independent key and value.
MQA: Multi Query Attention, where all heads in each layer share key and value.
GQA: Grouped Query Attention, a compromise between MHA and MQA, where the head of each layer is divided into multiple groups, and each group shares key and value.
MLKV: Multiple layers share K and V, and it is compatible with MQA and GQA mentioned above.

2430

The image below provides a further detail.

2431

3.5 Reduce sequence length

Therefore, this area has a lot of content, and we will introduce how to reduce sequence length in the next section of this article.

0x04 Optimize according to characteristics

In this section, we will delve into the topic of “optimization based on characteristics”.

4.1 Optimization for Prefill Features

Compared to the more time-consuming Decode stage, Prefill has a more prominent problem: long context. An excessively long prompt can put pressure on GPU memory. Therefore, researchers have made optimizations to address this characteristic.

4.1.1 Chunking

Chunking divides the prompt into several chunks, feeding only one chunk to the model at a time and updating the KV cache once. Because it doesn’t compute all tokens together, chunking sacrifices the parallelism of prefill computation. However, it can alleviate memory pressure.

The image below illustrates chunking. The input prompt is [value] The cat sat on the mat and saw the dog go to, and chunk_size = cache_window = sliding_window = 4; the sizes of both the chunk and cache are consistent with the sliding window size. This is a better choice in terms of both computational logic and space utilization. The sentence is then divided into three parts: “the cat sat on”, “the mat and saw”, and “the dog went to”.

In the diagram, rows represent queries, columns represent keys and values, and the entire 0/1 data block represents a mask matrix, indicating whether attention calculations are performed on the query in the row direction and the key and value in the column direction. Below the diagram, “Past” represents the cache from the previous time step, which has been refreshed by the current cache. “Cache” represents the current actual cache, and “Current” represents the Xk and Xv values that will soon be updated into the cache.

When we iterate to a certain chunk, we retrieve the current cache and this chunk to perform attention calculation, and then update the key-value values related to this chunk into this cache in the manner of Rolling Buffer Cache.

2432

4.1.2 KV-Runahead

The main idea behind KV-Runahead is to have the earlier layers do more work and the later layers do less work, thereby accelerating the hinting phase.

motivation

A key observation of KV-Runahead is that the expansion phase generates tags faster than the hint phase due to the key-value cache (KV-cache). Therefore, KV-Runahead parallelizes the hint phase by coordinating multiple processes to populate the KV-cache and minimizing the first tagging time (TTFT).

The dual-use KV cache scheme has several key advantages.

Since KV Cache is designed to utilize causal attention graphs, we automatically minimize computation and communication.
Since the KV Cache already exists in the extension phase, KV-Runahead is easy to implement.
We further propose context-level load balancing to handle uneven KV cache generation caused by causal attention and optimize TTFT.

plan

The figure below shows a comparison between KV-Runahead and Tensor+Sequence Parallel Inference (TSP/Tensor+Sequence Parallel Inference).

TSP: The focus of TSP is parallelization. Through uniform context partitioning, (Q, K, V) are computed independently on each process. Then, a collective operation, all-gather, is performed to swap K and V across all processes, allowing QKT to be uniformly distributed. All computations within the layers are symmetric and globally synchronized, but TSP does not leverage causality in LLM inference.
KV-Runahead: It begins with uneven context partitioning, where (Q, K, V) is computed independently on each process. Therefore, each process computes a different number of entries in (Q, K, V). KV-Runahead simply populates the KV cache from each process and passes it to the next process in the chain, simulating the expansion phase. As a result, only the last process will have the complete (K, V), but each process can still output A with the same shape as Q, driving the next layer. Since subsequent processes wait for the KV cache to be ready, the layers communicate asynchronously, and the user context is unevenly partitioned (for load balancing) to minimize the TTFT. Because the KV cache itself is built on causality, KV-Runahead can automatically minimize the computation of the upper triangle and reduce the number of dot products in the QKT.

KV-Runahead achieves parallel inference by utilizing multiple processes to populate the KV cache for the last process. Therefore, KV-Runahead requires good context partitioning (offline search is recommended to find the optimal partitioning scheme, and then store the results in a partition lookup table) to balance the KV cache size of each computation process and minimize the TTFT. Once partitioning is complete, each process will compute based on the KV cache of the previous process. Specifically, the current process must wait for the required KV cache from the previous process (i.e., note the layer misalignment in Figure (b)), forming a dependency chain through local point-to-point communication rather than global synchronization via all-gather.

2433

The image below shows the coverage calculation using QKT with BLAS matrix-matrix multiplication. By linking each context partition to a KV cache, we can approximate the lower triangular portion and minimize unnecessary dot products. The upper triangular portion of the attention is masked to enforce causality.

2434

4.2 Memory Management Optimization

4.2.1 Problem

Existing Large Language Model (LLM) service systems are inefficient at managing key-value (KV) cache memory. This is primarily because they store KV caches in contiguous memory locations (most deep learning frameworks require tensors to be stored in contiguous memory). However, unlike tensors in traditional deep learning workloads, KV caches have unique characteristics (each request requires a significant amount of memory, and this memory requirement is dynamically changing):

It grows and shrinks dynamically as the model generates new words. Moreover, the key-value cache grows slowly per request, meaning that only one token is added with each iteration.
Its lifespan and length are unknown. The total size of the KV cache cannot be predicted in advance.

These characteristics make the management methods of existing systems very inefficient. In the block pre-allocation scheme of existing systems, there are several main sources of memory waste: internal fragmentation caused by over-allocation for the potential maximum sequence length, slots reserved for future tokens, and external fragmentation from memory allocators (such as buddy allocators).

This is internal fragmentation caused by over-allocation of the potential maximum sequence length. It corresponds to circle number 1 in the diagram below.
- Because the total sequence length of requests is uncertain beforehand, these systems typically predict the maximum possible sequence length of requests in order to store the key-value cache of requests in contiguous space. They then pre-allocate a contiguous block of storage for a single request to accommodate the maximum sequence length, regardless of the actual input or final output length. Since the actual length of a request is often much shorter than the maximum length, a portion of this allocation will almost certainly never be used and cannot be utilized by other requests, ultimately resulting in waste (i.e., internal memory fragmentation). This internal fragmentation wastes a significant amount of GPU memory. Consequently, these systems have poor throughput and cannot support large batch sizes.
- The key-value cache is rectangular: you allocate a large rectangular memory area, where one dimension is the batch size (the maximum number of sequences the model can process at once), and the other dimension is the maximum sequence length allowed by the user. When a new sequence arrives, you allocate an entire rectangle of memory for that user. Assuming the maximum context length for each user is predicted to be 2048, you need to pre-allocate a memory space for each user that can support a cache of 2048 tokens. If the actual context length used by the user is less than 2048, this approach will obviously result in a significant waste of storage resources.
Slots reserved in advance for future tokens. Corresponding to circle number 2 in the diagram below. Even if the actual sequence length is known in advance, pre-allocation is still inefficient. This is because the KV-cache increments by one tag for each request, and the entire memory block is retained throughout the request’s lifecycle. As memory is gradually consumed, shorter requests still cannot use the unused memory blocks.
External fragmentation and other overhead from memory allocators (such as buddy allocators). Corresponds to circle number 3 in the diagram below.
- Since the KV cache is a huge matrix and must occupy contiguous memory, the operating system only allocates large contiguous memory. If the pre-allocated size is different for different requests, a lot of small memory space will inevitably be wasted, leading to external memory fragmentation.
- Prior to Paged Attention, mainstream LLM inference frameworks in the industry all suffered from inefficiencies in KV cache management. In the HuggingFace Transformers library, the KV cache dynamically allocates GPU memory space during execution. Since GPU memory allocation time is generally longer than CUDA kernel execution time, dynamic memory allocation causes significant latency overhead and introduces memory fragmentation.
- Existing systems cannot take advantage of memory sharing opportunities. LLM services often use advanced decoding algorithms, such as parallel sampling and beam search, which generate multiple outputs for each request. In these scenarios, multiple sequences contained in a request can partially share their key-value (KV) caches. However, in existing systems, the KV caches of sequences are stored in different contiguous spaces, making memory sharing impossible.

Furthermore, while compression has been proposed as a potential solution to fragmentation, implementing compression in a performance-sensitive LLM service system is impractical due to the enormous size of the KV cache. Even with compression, the pre-allocated block space for each request hinders memory sharing for specific decoding algorithms within existing memory management systems.

2435

4.2.2 Paged Attention

PagedAttention (vLLM) borrows the paging approach from operating systems for storage, dividing the key-value cache into fixed-size blocks stored in non-contiguous physical memory. This provides vLLM with more flexible paging memory management. By using relatively small blocks and allocating them on demand, PagedAttention alleviates internal memory fragmentation. Furthermore, because all blocks are the same size, it eliminates external memory fragmentation. PagedAttention also allows different sequences of the same request, or even different requests, to share memory at the block level, achieving memory sharing.

The image below shows the memory waste caused by internal fragmentation. The shaded box represents the memory occupied by KVtoken, while the white box represents allocated but unused memory.

Orca (at the top) reserves a contiguous block of virtual memory in advance for the KV cache. The allocated size corresponds to the maximum sequence length supported by the model, such as 32K. Since the model typically generates far fewer tokens than the maximum limit, a significant amount of GPU memory is wasted due to internal fragmentation.
vLLM (bottom) mitigates fragmentation through dynamic memory allocation.

2436

accomplish

PagedAttention is inspired by virtual memory and paging techniques in operating systems. Operating systems divide memory into fixed-size pages and map logical pages of user programs to physical pages. Contiguous logical pages can correspond to non-contiguous physical memory pages, allowing user programs to access memory as if it were contiguous. PagedAttention organizes the key-value cache into fixed-size key-value blocks, similar to pages in virtual memory. vLLM implements a similar virtual memory system, managing these blocks through a complex mapping mechanism. This architecture separates logical and physical key-value blocks, achieving dynamic memory allocation and flexible block management through a block table that tracks mapping relationships and populates state. Specific features are as follows.

In order to dynamically allocate physical memory, PagedAttention changes the layout of the KV Cache from contiguous virtual memory to non-contiguous virtual memory.
PagedAttention divides the key-value cache into fixed-size, relatively small memory blocks called chunks. Each chunk can contain a fixed number of key-value pairs. These chunks are analogous to pages in a process’s virtual memory, allowing for flexible management of the key-value cache, much like how an operating system manages virtual memory. PagedAttention allocates memory for one chunk at a time. These small chunks do not need to be stored in contiguous space; they can be swapped between GPU and CPU memory and shared between different requests if necessary.
vLLM does not pre-reserve the maximum sequence length of the KV cache memory; instead, it allocates the memory required by the request only when needed. This allows for continuous addition of video memory to the user during the generation process, rather than pre-allocating it. Through token block-level video memory management, the system can accurately calculate the number of token blocks that the remaining video memory can accommodate. Combined with Dynamic Batching technology, this avoids the problem of video memory overflow.
PagedAttention implements a block table to concatenate these non-contiguous allocations. It also implements an efficient memory manager, which, through fine-grained memory management, enables the storage, retrieval, addition, and deletion of key-value vectors in physically non-contiguous memory space at extremely low cost.

PagedAttention involves two layers of address translation. First, the framework tracks the virtual memory addresses of KV Cache blocks using a block table. Second, the GPU performs the virtual-to-physical address translation using a page table managed by the operating system. See the diagram below for details.

2437

Architecture

The architecture of vLLM is shown in the figure below, which is specifically divided into the quantity part:

Scheduler. vLLM uses a centralized scheduler to coordinate the execution of distributed GPU workers and their corresponding physical key-value cache memory.
KV Cache Manager. The KV Cache Manager efficiently manages the KV cache in a paginated manner using PagedAttention.

2438

KV Cache Manager

PagedAttention first loads the model to determine how much space is left for the KV cache, and then fills that space with memory blocks. These blocks can contain up to 16 or 32 tokens.

The system dynamically maintains a linked list of video memory blocks—a mapping between the logical and physical key-value blocks for each request. Each block list entry records the corresponding physical block for the logical block and the number of padding positions, thus enabling the unified management of non-contiguous physical blocks. When a request arrives, the system allocates a small block of video memory for it, typically only enough to hold 8 characters. After the request generates 8 characters, the system appends another block of video memory, allowing the request to write the result to this new block. As the generated length increases, the system continuously allocates additional video memory blocks to the user and dynamically updates the video memory block list. Separating logical and physical key-value blocks allows the vLLM to dynamically grow the key-value cache memory without pre-reserving space for all locations, eliminating most of the memory waste in existing systems.

The following diagram illustrates the algorithm. PagedAttention divides the key-value (KV) cache of each sequence into KV blocks. Each block contains a certain number of key and value vectors for each token. For example, a key block $K_j$ contains the key vector of the j-th block, while the value block $V_j$ contains the value vector from the j-th block. During the attention calculation process, the PagedAttention kernel identifies and retrieves different key-value blocks. For example, when processing a query labeled “forth”, the kernel will retrieve the query vector $q_i$ with the key vector in each block $K_j$ multiply to calculate the attention score $A_{ij}$ . Then, it compares these scores with the value vector in each block $V_j$ multiply them to obtain the final attention output.

2439

Decoding power in basic scenarios

The diagram below illustrates how vLLM performs PagedAttention and manages memory during the decoding of a single input sequence. In general, for each decoding iteration, vLLM first selects a set of candidate sequences for batch processing and allocates physical blocks for logical blocks of new requirements. Then, vLLM concatenates all input tokens from the current iteration (i.e., all tokens requested in the cueing phase and the latest token requested in the generation phase) into a single sequence and feeds it into the LLM.

vLLM does not need to reserve memory for the maximum possible sequence length in the initial stage. Instead, it only reserves enough KV blocks to accommodate the KV cache generated during cue computation. In this example, the cue has 7 tags, so vLLM maps the first two logical KV blocks (0 and 1) to two physical KV blocks (7 and 1). In the pre-filling step, vLLM uses a conventional self-attention algorithm to generate the KV cache for the cue and the first output tag. Then, vLLM stores the KV cache for the first 4 tags in logical block 0 and the subsequent 3 tags in logical block 1. The remaining space is reserved for subsequent autoregressive generation stages. This corresponds to superscript 1 in the diagram below. In the first autoregressive decoding step, vLLM uses the PagedAttention algorithm to generate new tags on physical blocks 7 and 1. Since there is still a vacancy in the last logical block, the newly generated KV cache is stored there, and the filled record in the block table is updated. In the second autoregressive decoding step, since the last logical block is full, vLLM stores the newly generated KV cache in a new logical block; vLLM allocates a new physical block (physical block 3) for it and stores this mapping in the block table.

2440

Disadvantages

However, in order to dynamically allocate physical memory, PagedAttention also increases software complexity, leads to portability issues, redundancy, and inefficiency.

The attention core (GPU code) needs to be rewritten so that it can look up all elements of the KV cache (virtual contiguous objects) by index.
This increases software complexity and redundancy (CPU code). PagedAttention also forces developers to implement a memory manager within the service framework, making it responsible for (de)allocating the KV cache and tracking the location of dynamically allocated KV cache blocks. This approach essentially reimplements on-demand paging in user code—an operating system feature.
This introduces performance overhead. PagedAttention adds runtime overhead on the critical path. First, it requires the GPU core to execute additional code related to fetching KV cache from non-contiguous memory blocks. We show that this can slow down attention computation by more than 10% in many cases. Second, the user-space memory manager can increase CPU overhead, resulting in an additional 10% cost.

4.2.3 vAttention

The authors of vAttention believe that preserving the virtual memory continuity of the key-value cache is crucial for reducing software complexity and redundancy in LLM deployments. They advocate leveraging existing virtual memory abstractions in the operating system to manage dynamic key-value cache memory, thereby simplifying deployment and achieving higher performance. Based on these considerations, the authors of vAttention proposed vAttention, which has the following characteristics.

The key-value cache is stored in contiguous virtual memory, rather than pre-allocated physical memory. In contrast, pagedAttention is not contiguous in virtual address space.
Reserve contiguous virtual space for the KV Cache and allocate physical memory as needed.
This achieves greater portability without requiring a memory manager to be implemented in the service system, while seamlessly reusing existing GPU cores. It eliminates the need for attention core developers to explicitly support paging and avoids reimplementing memory management within the service framework, enabling seamless dynamic memory management across various attention cores.

Insight

vAttention offers some insights into LLM service systems.

Observation 1: The memory requirements of the KV Cache in each iteration are predictable. This is because autoregressive decoding generates only one tag at a time. Therefore, with each iteration, the memory usage of the requested KV Cache increases uniformly by one tag until the request is completed.
Observation 2: KV Cache does not require high memory allocation bandwidth. In all layers, the memory footprint of a single tag is typically tens to hundreds of KB. Furthermore, each iteration lasts 10 to 100 milliseconds, meaning that at most a few megabytes of memory are needed per second. While batch processing can improve system throughput, the number of tags generated per second tends to plateau after reaching a certain batch size. This means that memory allocation bandwidth requirements will also saturate with large batch sizes.

Design Overview

The goal of vAttention is to improve efficiency and portability by adding dynamic memory allocation support to the existing kernel. To achieve this, vAttention utilizes system-supported dynamic memory allocation instead of implementing paging in user space.

vAttention is built upon the ability to allocate virtual and physical memory separately. Specifically, vAttention pre-allocates a large contiguous buffer for the KV-cache in virtual memory (similar to a reservation-based allocator), while deferring physical memory allocation to runtime, i.e., allocating physical memory only when needed (similar to PagedAttention). In this way, vvAttention preserves the virtual contiguousness of the KV cache without wasting physical memory. This approach is feasible because memory capacity and fragmentation only limit physical memory, while virtual memory is plentiful; for example, modern 64-bit systems provide 128TB of user-managed virtual address space per process.

Reserved virtual memory. Since virtual memory is abundant, we pre-allocate a sufficiently large virtual memory space to accommodate the maximum batch size of the KV-cache (configurable) required to support. Number of virtual memory buffers: Each layer in each LLM maintains its own K and V tensors: we refer to them as the K-cache and V-cache, respectively. We allocate separate virtual memory buffers for the K-cache and V-cache.
Physical memory is allocated on demand. vAttention prioritizes allocating physical memory page by page, only allocating more when a request has used up all previously allocated physical memory pages.

The diagram below illustrates how vAttention dynamically manages the KV Cache memory by allocating physical memory on demand, while optimizing memory usage through a delayed eviction strategy to efficiently reuse allocated physical memory when new requests arrive. This section deals with the K-cache (or V-cache) in one layer of the model, assuming a maximum batch size of two. The remaining K-cache and V-cache buffers are managed in a similar manner across all layers. Specifically, this involves five steps:

(a): The virtual memory contains virtual tensors for two requests (R1 and R2), but no physical memory allocation has been performed yet. The light-shaded area represents the portion of the virtual tensor that is not backed by any physical pages.
(b): R1 is allocated one page of physical memory. This allocation causes a portion of the virtual tensor (light shading area) to be mapped onto the actual physical memory (dark shading area).
(c): R1 is allocated two pages of physical memory, while R2 is allocated one page of physical memory. At this point, two parts of R1’s virtual tensor are mapped to physical memory, while only one part of R2’s virtual tensor is mapped to physical memory.
(d): R1 has completed its task, but vAttention does not immediately reclaim its memory (delayed reclamation). Therefore, R1’s virtual tensor still occupies physical memory.
(e): When a new request R3 arrives, vAttention assigns the tensors of R1 to R3 because these tensors already have physical memory support.

2441

4.3 Scheduling

This section mainly focuses on scheduling strategies implemented based on the characteristics of KV Cache, primarily through three approaches:

Prefix-aware scheduling strategy.
Preemptive and fairness-oriented scheduling.
Layer-specific hierarchical scheduling methods.

The following diagram illustrates the characteristics of some common solutions.

2442

4.3.1 Prefix-aware Scheduling

In traditional LRU-based cache management systems, shared key-value contexts may be evicted prematurely or unnecessarily expanded in memory. Prefix-aware scheduling, on the other hand, schedules requests based on prefixes, meaning requests with the same prefix are grouped together for scheduling.

BatchLLM uses global prefix identification and a dynamic programming algorithm based on shared public key-value cache content to schedule requests with shared prefixes together, which maximizes key-value cache reuse and reduces cache lifecycle.
RadixAttention implements fast matching, insertion, and replacement of the cache around a radix tree structure. In RadixAttention, cached tokens and running requests share the same memory pool, which is then managed using a combination of LRU eviction and reference counting.

The following diagram illustrates the overall process of BatchLLM.

First, the large batch of input prompts is analyzed to identify common prefixes; this step is completed before request scheduling.
BatchLLM utilizes a prefix tree to identify and represent common prefixes among Prompts. However, directly using the first-level prefix of the prefix tree as the common prefix may lead to poor prefix reuse. Therefore, BatchLLM designs a prefix tree-based dynamic programming algorithm to maximize the reuse of first-level tokens.
Subsequently, to simplify key-value reuse and shorten the lifetime of common prefixes in memory, BatchLLM centrally schedules requests sharing the same prefix. Specifically, Batch LLM groups requests, with each prefix group corresponding to a set of requests sharing the same prefix. These groups become the basic unit of request scheduling. Before scheduling each group, BatchLLM also reorders the groups based on the ratio of the prefix prefill length to the estimated decoding length, and postpones groups with longer prefix prefill lengths and shorter decoding lengths.
Next, batching is implemented with full consideration of KV memory usage, aiming to better mix the Decoding steps with the prefix Prefill Chunk to increase the overall token batch size.
In addition, BatchLLM also implements the fusion of Attention Kernel to reduce the long tail effect and Kernel startup overhead.

2443

4.3.2 Preemptive and Fairness-oriented scheduling

FastServe implements a proactive key-value (KV) caching management strategy. FastServe’s features include:

Innovative scheduling strategy: FastServe introduces the Skip-Join Multi-Level Feedback Queue (MLFQ) scheduler, which effectively avoids head blocking by leveraging the semi-information-independent nature of LLM inference.
Priority skipping mechanism: Based on the input length and the time of the initialization phase, tasks are assigned to appropriate queues, reducing unnecessary degradation operations.
Efficient memory management: To cope with GPU memory limitations, FastServe has designed an active key-value cache management mechanism. When the cache of a low-priority task is about to be full, the state is actively unloaded to the host memory and reloaded when needed.
Data transmission optimization: Data transmission efficiency has been significantly improved through asynchronous memory operations and pipeline technology.

The following diagram shows the architecture of FastServe.

2444

The top of the diagram below shows MLFQ, and the bottom shows the execution time flow of each scheduling algorithm.

2445

FastSwitch introduces a fairness-oriented key-value cache scheduling system, addressing the overhead challenges of preemptive scheduling in LLM services. FastSwitch employs three key mechanisms: improving I/O utilization through intelligent cache move scheduling, minimizing GPU idle time during context switches, and eliminating redundant I/O operations in multi-turn conversations. Unlike traditional block-based key-value cache memory strategies that prioritize memory efficiency at the cost of fragmentation and granularity limitations, FastSwitch implements a balanced approach that promotes smoother context switches while maintaining efficient memory usage. This integrated scheduling method enables dynamic priority adjustments to ensure fairness while minimizing the performance impact of context switches.

The following diagram shows the overall structure of FastSwitch.

2446

4.3.3 Layer-specific and Hierarchical Scheduling

LayerKV

To address the growing Time-of-First-Token (TTFT) latency challenge in large-context LLM services, LayerKV introduces a layer-wise KV cache scheduling approach. Its contribution lies in its fine-grained, layer-specific KV cache block allocation and management strategy, unlike traditional monolithic cache management methods. By implementing layer-specific KV block scheduling and offloading mechanisms, LayerKV achieves more efficient memory utilization and reduces queuing latency typically occurring when competing for limited GPU KV cache blocks in long context windows. LayerKV also employs an SLO-aware scheduler that optimizes cache allocation decisions based on service level objectives, allowing for dynamic management of memory resources across model layers. The following figure illustrates the overall structure of LayerKV.

2447

CachedAttention

Because multi-turn dialogues reuse all previously cached data in each turn, CachedAttention introduces a hierarchical scheduling method to address this issue and minimize the offload and reload process by keeping the KVCache in GPU memory as much as possible. This method specifically includes three strategies:

Layer-specific preloading. The movement of key-value caches within the storage hierarchy is coordinated through scheduler-aware fetch and eviction policies.
Asynchronous saving. This allows I/O operations to overlap with GPU computation.
Intelligent cache placement decisions. This strategy is based on scheduler hints to ensure that frequently accessed key-value caches reside in a faster memory tier.

CachedAttention also proposes a novel positional encoding decoupling mechanism that prevents KV cache invalidation during context window overflow through an effective truncation strategy.

The following diagram illustrates the overall architecture.

2448

The image below illustrates layered preloading.

2449

The image below illustrates asynchronous saving.

2450

The figure below illustrates positional encoding decoupling.

(a) is an absolute positional encoding, where the absolute positional encoding of the Key and Query is always the same.

(b) is a relative positional encoding.

(c) is the positional encoding that can decouple the Key and Query. By simply shifting the time of caching the KV before embedding the positional encoding in RPE, CachedAttention can store KVs without embedded positional encodings in the AttentionStore. When reusing KVs in CachedAttention, the KVs are embedded with new positional encodings and are further used for the following inference.

2451

ALISA

ALISA introduces a two-layer KV cache scheduling framework that combines algorithmic sparsity with system-level optimization. At the algorithmic level, a sparse window attention mechanism identifies and prioritizes the most important tokens for attention computation, creating a hybrid of global dynamic and local static sparsity patterns, significantly reducing KV cache memory requirements. At the system level, its three-phase token-level dynamic scheduler manages KV tensor allocation and optimizes the trade-off between caching and recomputation. The scheduler dynamically determines which tokens to cache in GPU memory and which to recomputation based on token importance and system resource constraints.

The figure below illustrates the calculation of Sparse Window Attention.

2452

The diagram below illustrates the dynamic scheduling mechanism.

2453

0xFF Reference

Accelerating Large Model Inference – ALISA Dawn

Let’s talk about memory management for large model inference: CachedAttention/MLA ( Dao Dao Ning)

Y. Xiong, H. Wu, C. Shao, Z. Wang, R. Zhang, Y. Guo, J. Zhao, K. Zhang, and Z. Pan, “Layerkv: Optimizing large language model serving with layer-wise kv cache management,” 2024. [Online]. Available: https://arxiv.org/abs/2410.00428

B. Gao, Z. He, P. Sharma, Q. Kang, D. Jevdjic, J. Deng, X. Yang, Z. Yu, and P. Zuo, “Cost-efficient large language model serving for multi-turn conversations with cachedattention,” 2024. [Online]. Available: https://arxiv.org/abs/2403.19708

Y. Zhao, D. Wu, and J. Wang, “Alisa: Accelerating large language model inference via sparsity-aware kv caching,” 2024. [Online]. Available: https://arxiv.org/abs/2403.17312

R. Shahout, C. Liang, S. Xin, Q. Lao, Y. Cui, M. Yu, and M. Mitzenmacher, “Fast inference for augmented large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2410.18248

Fast Distributed Inference Service for Large Language Models ( by Huang Yu)

BatchLLM: Offline Large-Batch Scenario LLM Inference Optimization Programmer Li Xuntian

Let’s talk about FastServe’s clever flow in large-scale model inference systems.

GPU-accelerated LLM paper praised | InfiniGen efficiently implements generative inference in LLM through dynamic KV cache management: Motivation for GPU developers

LLM inference latency analysis

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition Zhao Shangchun

AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in *Large Language Model Serving

Modular Attention Reuse for Low-Latency Inference(@yale.edu)

Paper: “Efficiently Programming Large Language Models using SGLang”

ZigZagKV and KVCache achieve an 80% compression rate, resulting in near-lossless performance.

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition(@microsoft.com)

Efficient Prompt Caching via Embedding Similarity(@UC Berkeley)

Modular Attention Reuse for Low-Latency Inference(@yale.edu)

33x LLM inference performance improvement: Best practices from Character.AI

[ CLA] Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

[ GQA] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

[KV Cache Optimization] MQA/GQA/YOCO/CLA/MLKV Notes: Sharing DefTruth in Intra-Layer and Inter-Layer KV Cache

[LLM Inference Service Optimization] DistServe Quick Read—Prefill & Decode Decoupling, Model Parallelism Strategy & GPU Resource Allocation Decoupling ( by A-Jie)

[LLM] KV cache detailed explanation with illustrations, video memory, computational complexity analysis, and code. Don’t laugh at Fourier’s code.

[LLM Performance Optimization] A Discussion on Performance Optimization Directions for Long Text Inference ( by A-Jie)

[ MLKV] MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

[ MQA] Fast Transformer Decoding: One Write-Head is All You Need

[Prefill Optimization][10,000 words] Principles & Diagrams of vLLM Automatic Prefix Cache (RadixAttention): First Token Latency Optimization DefTruth

[ YOCO] You Only Cache Once: Decoder-Decoder Architectures for Language Models

A Survey on Efficient Inference for Large Language Models

akaihaoshuai

Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding

Confused with four attention mechanism and their performance mentioned by paper · Issue #33 · mit-han-lab/streaming-ll

Continuous Batching: A Powerful Tool for Improving LLM Deployment Throughput (Magic Square AI)

DejaVu: KV Cache Streaming For LLM

Efficient Streaming Language Models with Attention Sinks

GPT Model and K/V Cache: A Guide to Bright Moon and Rising Sun

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

https://lmsys.org/blog/2024-01-17-sglang/

Hydragen: High-Throughput LLM Inference with Shared Prefixes (@Stanford University)

Inference without Interference: Decoupled LLM Inference Services for Mixed Downstream Workloads (Zhu Yan Jian Zhi)

How might Kimi Chat’s large model achieve a lossless context of 2 million characters? [GiantPandaCV]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

LaViT: That works too. Microsoft proposed directly using the attention weights of the previous layer to generate the attention weights of the current layer | CVPR 2024 VincentLee

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

LLaMa Principles and Source Code – Deconstruction (KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU) (Yuezero_)

LLM Inference Series: 3. KV caching unveiled

Exploring LLM Inference Optimization (2): Detailed Explanation of Transformer Model KV Caching Technology (Baihai IDP)

Exploring LLM Inference Optimization (3): How to Effectively Control the Memory Usage of KV Cache and Optimize Inference Speed? Baihai IDP

LLM Inference Beginner’s Guide ②: In-depth Analysis of Key-Value Caching OneFlow

LLM Inference Acceleration 02 KV Cache The Contemplative Stoic Nine

LLM Inference Acceleration 3: Summary of Inference Optimizations Mooncake/AttentionStore/vllm0.5/cache optimizations etc.

LLM reasoning acceleration research akaihaoshuai

A Brief Description of the LLM Inference Algorithm ( by Lu Chun)

Six latest works optimizing LLM KV caches, including MiniCache and PyramidInfer. (AI Talk)

MLKV: Cross-layer KV cache sharing, reducing memory footprint.

MLSys 2024 Keyformer: Select key tokens to reduce KV caching by 50% MLSys2024 [MLSys2024]

MODEL TELLS YOU WHAT TO DISCARD: ADAPTIVE KV CACHE COMPRESSION FOR LLMS

Mooncake (1): Making mooncakes on the dark side of the moon, Kimi’s split reasoning architecture centered on KVCache. ZHANG Mingxing

Mooncake (2): Kimi How to handle the “torrential traffic”? Predictive scheduling strategy under a separate architecture ZHANG Mingxing

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving Hand-pulled Pancake Bear

Orca: How to Serve Large-scale Transformer Models WXB506

OSDI22 ORCA: A classic example of LLM Serving system optimization, featuring dynamic batching and a 36x increase in throughput. MLSys2024

OSDI24’s latest masterpiece: InfiniGen, a high-efficiency key-value cache management framework for large model inference, achieving a 3x increase in throughput and a 32% improvement in accuracy. (MLSys2024)

SnapKV: Key-Value Cache Sparsity, Zero-Fine-Tuning Accelerates Long-Sequence LLM Inference AI Casual Discussion

SnapKV: LLM Knows What You are Looking for Before Generation

Token Sparsity + Channel Sparsity: 16x LLM Inference Acceleration for AI Chat

Transformers KV Caching Explained

TriForce: KV Cache sparsity + speculative sampling, 2.3x lossless acceleration of LLM. vAttention: used to serve LLM BBuf without Paged Attention.

YOCO: Long sequence optimization, 1M long text 10x LLM inference acceleration AI chat

You Only Cache Once: Decoder-Decoder Architectures for Language Models

ZHANG Mingxing: Mooncake (1): Making mooncakes on the dark side of the moon, Kimi’s KVCache-centered split reasoning architecture

[Tearing Down an LLM-KV Cache] Why is there no Q-Cache?? ( by Xiaodonggua AIGC)

Understanding KVCache: Why Did You Kaichao

Accelerate LLM Inference with KV Cache but No Q Cache? (DefTruth)

A Comprehensive Analysis of LLM Inference Optimization: Technology, Applications, and Challenges (AI Talk)

Random Thoughts on Mooncake by Xu Xinran

Dao Dao Ning: Notes: A Brief Analysis of Llama.cpp Code (Part 1): Parallel Mechanism and KVCache

Dao Dao Ning: Notes: A Brief Analysis of Llama.cpp Code (Part Two): Data Structure and Sampling Method

Analysis; KV Cache Optimization Technique in the Autoregressive Generation Process of the GPT Model (sudit)

Illustrated Explanation of Mixtral 8 * 7b Inference Optimization Principles and Source Code Implementation by Mengyuan

Illustrated Guide to Accelerating Large Model Computation Series: Decoupled Inference Architecture 1, Starting with DistServe and the Mighty Ape

Illustrated Guide to Accelerating Large Model Computation Series: Separate Inference Architecture 2, Chunked-Prefills for Fuzzy Separation and Merging Boundaries ( Ape)

The Taizu Long Fist of Parallel Inference in Large Models: Deciphering Jeff Dean’s Outstanding MLSys 23 Paper by Fang Jiarui

Large Model Inference Optimization — Prefill Stage seq_q Segmentation kimbaol

Large Model Inference Optimization Practice: KV Cache Reuse and Speculative Sampling [Alibaba Technology]

Large Model Inference Optimization Techniques - KV Cache [Eat Jelly Without Spitting Out the Jelly Shell] https://www.zhihu.com/people/liguodong-iot

Accelerating Inference for Large Models: The Hidden Approach of KV Cache Sparsity

Accelerating Large Model Inference: Learning Key Values from Graphs and Cache from Graphs

Understanding KV Cache for Large Model Inference Performance Optimization (Young)

100x Acceleration of Large Model Inference: Sparse KV Cache ( Zhang)

Plain Language Explanation of Continuous Batching Gnuey Iup

A Review of Inference Performance Optimization for Large Language Models (Young)

Microsoft MInference: Million Token Sequence, 10x Acceleration

Critical Thinking | H2O: Solving the Long Context Problem of LLM from a Caching Perspective DSTS Center [Data Space Technology and Systems]

Building a High-Performance Large Model Inference Platform: Prefill and Decode Separation Series (Part 1): Microsoft’s New Product SplitWise Improves GPU Utilization Through PD Separation. Doraemon is Not a Dream.

Mooncakes in the Dark: Centered on KVCache, a split inference architecture boosts LLM throughput by 75%.

Extreme Exploration: In-depth Analysis of LLM Inference Speed MLSys2024

Unorthodox Approach: Modifying the KV Cache to Allow for Infinite Output of Large Models!

Dr. Fu Yao - Full-Stack Transformer Inference Optimization Season 2: Deploying Long Context Models - Translation Reinforcement Apprentice

Let’s talk about split reasoning in large-scale model reasoning. (by Dao Dao Ning)

Let’s talk about optimization problems in large model inference services. (by Dao Dao Ning)

A-Jie: In-depth analysis of vLLM & PagedAttention papers (Part 1) – Current status and optimization strategies of LLM services

Efficient Generation: An Interpretation of the Splitwise Paper

https://yaofu.notion.site/Full-Stack-Transformer-Inference-Optimization-Season-2-Deploying-Long-Context-Models-ee25d3a77ba14f73b8ae19147f77d5e2#d03b8a2a583246cea5a1fdeb8f583586

Longformer

Efficient Streaming Language Models with Attention Sinks

https://www.evanmiller.org/attention-is-off-by-one.html?continueFlag=5d0e431f4edf1d8cccea47871e82fbc4

Massive activations in large language models. arXiv preprint arXiv:2402.17762, 2024. Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu.

PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

ZigZagKV and KVCache achieve an 80% compression rate, resulting in near-lossless compression. (Du Lingxiao)

SnapKV: LLM Knows What You are Looking for Before Generation

FastGen: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

[ Prompt Cache ]: Modular Attention Reuse for Low-Latency Inference(@yale.edu)

[ Cache Similarity ]: Efficient Prompt Caching via Embedding Similarity (@UC Berkeley)

[ Shared Prefixes ]: Hydragen: High-Throughput LLM Inference with Shared Prefixes (@Stanford University)

[ ChunkAttention ]: ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition (@microsoft.com)

ClusterKV, a near-lossless KV cache compression method. (Du Lingxiao)

Illustrated Guide to Accelerating Large Model Computation Series: Separate Inference Architecture 2, Chunked-Prefills for Fuzzy Separation and Merging Boundaries ( Ape)

Abridged version of “A Review of LLM Acceleration Research Based on KV Cache Management” by Chang Hua and Andy

[KV Cache Optimization] MQA/GQA/YOCO/CLA/MLKV Notes: Sharing DefTruth in Intra-Layer and Inter-Layer KV Cache

Venkat Raman. Essential math & concepts for llm inference, 2024. URL https://venkat.eu/essential-math-concepts-for-llm-inference#heading-insights-from-model-latency-amp-understanding-hardware-utilization-on-modern-gpus.

Yao Fu. Challenges in deploying long-context transformers: A theoretical peak performance analysis. arXiv preprint arXiv:2405.08944, 2024

A Survey on Large Language Model Acceleration based on KV Cache Management