Exploring the Transformer Series (26) --- KV Cache Optimization: PD Separation or Merging

0x00 Overview

In the inference process of large models, the task can typically be divided into two phases: the Prefill phase processes all input tokens, generates the first output token, and builds a KVCache. The Decode phase utilizes the KVCache for multiple iterations, generating one token per iteration. Because the Prefill phase processes many tokens in parallel, it is computationally intensive, and its latency is measured by the first token latency (TTFT). In contrast, the Decode phase becomes memory intensive due to the frequent loading of the ever-growing KVCache, and its latency is measured by TPOT.

Because the Pefill and Decode phases have vastly different characteristics, it’s difficult to find a perfect solution that simultaneously satisfies the needs of both phases. Researchers have employed various ingenious methods. This article will guide you through exploring the various solutions used in academia and industry.

Note:

The complete list of articles is here. It’s estimated to eventually have around 35 articles. This list will be updated after each subsequent article is published. (Cnblogs Exploring Transformer Series: Article List)
This series is a study and interpretation of papers, blogs, and code, drawing on many articles from online friends, to whom I express my gratitude and will list them in the references. Because there are so many references in this series, there may be omissions in the citations. If the original authors find any omissions, please point them out, and I will add them to the references.

0x01 Background Knowledge

1.1 Autoregression & Iteration

When performing text generation tasks, an autoregressive language model based on the Transformer architecture accepts a text sequence as input. The model completes the sequence by generating successive output tokens, which is an iterative process. We define an iteration of the model as the operation from accepting a request input to outputting a token. For each request, the model progressively generates parts of the output sequence until a stop token is generated or the maximum sequence length is reached. This means that each time the model performs a forward pass, it obtains an additional output token. For example, if we use the sentence “What is the capital of California:” as a prompt, it will require ten forward passes to obtain a complete response, i.e., ["S", "a", "c", "r", "a", "m", "e", "n", "t", "o"]. That is, an LLM request may involve multiple iterations.

The diagram below illustrates the architecture of a three-layer GPT model, where nodes and edges represent Transformer layers and inter-layer dependencies, respectively. The model executes Transformer layers according to the node numbers, with nodes in the same layer using the same model parameters and filled with the same color (different shades). The generated output labels are fed back into the model to generate the next output label, forming a sequential reasoning process.

In the example, the reasoning process includes three iterations (corresponding to the three green labels on the diagram). These three iterations can be further divided into two stages:

Initialization Phase. This phase typically consists of only one iteration, responsible for generating the first output token by processing all input tokens in parallel. In the diagram, this corresponds to the first iteration (“iter 1”), which accepts all input tokens (“I think this”) at once and generates the next token (“is”). In fact, this iteration corresponds to the prefill phase.
Incremental Phase. This phase typically consists of multiple iterations, each of which takes the output token from the previous iteration and generates the next token (only one token can be processed per iteration). In the diagram below, this phase includes the last two iterations (“iter 2” and “iter 3”). In this example, “iter 3” is the last iteration because it generates the special end token <eos>. The output generation process terminates at this point. Each iteration here corresponds to the decode phase.

2601

In summary, we call the process of a request completing prefill and producing the first token one iteration, and each time a request is decoded, it is also called one iteration.

1.2 KV Cache

The following diagram illustrates the role of the KV Cache in the prefill and decoding stages. (a) shows an example of autoregressive LLM inference at the top and a module in the transformer layer at the bottom. (b) illustrates the operation of the KV Cache. In the prefill stage, all input tokens are processed simultaneously, and the resulting intermediate KV tensors are stored, marked in dark. s and d represent the length of the input sequence and the hidden dimension of the KV tensors, respectively. In the decoding stage, the stored dark KV tensors are retrieved. The input Q, K, and V tensors are marked in light. The input Q tensor is multiplied by the concatenation of the input K and the stored K tensor, followed by a softmax of the entire attention weights. The attention weights are further multiplied by the concatenation of the input V and the stored V tensor to generate a new result. The input K and V tensors are then stored in the KV Cache. This process is repeated for each token.

The prefill phase processes all tokens in the input prompt in parallel, outputs the first token, and generates a key-value cache for future decoding. Because the input sequence is very long, the computational overhead is high; even a small batch size will fully utilize the GPU. Increasing the batch size keeps the overhead per token in the prefill phase almost unchanged.
The decode phase (with KV Cache enabled) uses the previous KV Cache to generate new tokens incrementally in an autoregressive manner. Only one token is generated per autoregressive step in the decode phase. Although the input sequence length is always 1, the KV Cache needs to be read repeatedly, resulting in significant I/O overhead. The overhead of a single token during decoding is significantly greater than in the prefill phase. The batch size for the decode phase needs to be configured very large to potentially fully utilize the GPU, but this becomes impractical due to the high read/write overhead of the KV Cache, resulting in very low GPU utilization during the decode phase.

2602

0x02 Static Batch Processing

2.1 Scheduling Strategy

As mentioned earlier, LLM is changing the way the entire industry adopts AI technology in its services, but the cost of LLM serving remains high. To reduce serving costs, many companies are currently focusing on maximizing the overall throughput of the LLM serving system, i.e., the number of requests served per second (rps). Almost all popular LLM serving engines use throughput as a primary performance metric.

To improve throughput, batch processing techniques are employed. Batch processing is a technique that feeds multiple data samples together to the model for processing. When batch processing is enabled, the execution engine merges input tensors from multiple requests into a single large input tensor before feeding it into the model for inference. Because accelerators prefer large input tensors over small ones, batch processing can process multiple samples simultaneously in a single computation compared to processing individual samples one by one. This allows for more efficient use of computational resources and improves computational speed. In LLM inference, the advantages of batch processing are mainly reflected in the following aspects:

Reduce the number of model parameter loadings: Without batch processing, model parameters need to be loaded once for each input sequence, which typically consumes a significant amount of memory bandwidth. Batch processing allows the parameters to be loaded once and then used multiple times to process multiple input sequences, thus reducing the number of loadings.
Improve memory bandwidth and computing resource utilization: GPU memory bandwidth is a limited resource. Batch processing can make more efficient use of this memory bandwidth during the computing process, thereby making more efficient use of computing resources and improving computing speed.

Traditional batch processing methods are called static batching because the batch size remains constant until the inference requests within the batch are completed. In static batching, as new requests arrive during the execution of the current batch, the engine keeps these new requests in the request queue until all requests in the current batch are completed. That is, interaction between the service system and the execution engine only occurs in two situations: when the service system schedules the next batch of requests on an idle engine; or when the engine has finished processing the current batch. The specific process is as follows:

If the current execution engine is idle, the Scheduler retrieves a batch of requests from the queue and constructs it into a batch (as shown in the diagram), $x_1$ (“I Think”) and $x_2$ (“I love”), for number 1 in the image below.
The Scheduler hands over the batch to the Execution Engine for inference, as shown in number 2 in the diagram below.
The Execution Engine processes the received batch by running multiple iterations of the model, as shown in number 3 in the diagram below.
After the Execution Engine has processed all the requests in this batch (the Execution Engine generates “this is great” and “you” as requests respectively) $x_1$ and $x_2$ the response will be uniformly returned to the Scheduler, for example, number 4 in the diagram below.
The Scheduler will retrieve another batch of requests from the request queue and build a new batch.

2603

2.2 Problems

It’s easy to see the disadvantages of static batching: it struggles to effectively handle sequences of varying lengths. Static batching can lead to inefficient GPU utilization when different sequences within a batch have different generation lengths. This is because different sequences may be generated in different iterations within the batch, and static batching waits for all sequences to complete before processing a new one. This results in inefficient GPU utilization while waiting for the last sequence to finish. Since processing a single request via autoregression may require running the model multiple times, and different requests generate text of varying lengths, the required number of iterations may differ. This can cause some requests to complete prematurely while others are still in progress. Once a sequence with a longer generation time exists, other sequences with shorter generation times are forced to wait and cannot be immediately returned to the client, increasing latency for the client and underutilizing GPU computing resources, wasting GPU power. As shown in the diagram above, $x_2$ completed ahead of schedule, but $x_1$ the iteration is not yet complete, so $x_2$ unable to return immediately. New requests arriving after the last assignment ( $x_3$ , $x_4$ ) must also wait for the current batch. $x_1$ distributing goods can only proceed after all processing is complete, which significantly increases queuing time.

Therefore, the problems we need to solve now are as follows:

Question 1: How to handle early-finished requests? Different requests generate text of varying lengths, and short text requests can only wait for long text requests to finish processing. Because the length of generated text is difficult to predict, it’s challenging to group texts of the same length into a single batch for processing. Therefore, a mechanism is needed to remove completed requests from the batch and return their results early.
Question 2: How to release resources and handle late-joining requests? New requests in the queue can only be dispatched by the scheduler after the current batch is completed, resulting in excessively long waiting times. Therefore, a mechanism is needed to insert new requests into the current processing batch.

It is important to note that the issues of early completion and delayed addition requests do not occur during model training; the training process completes the processing of the entire batch in a single iteration using the Teacher Forcing technique.

2.3 Cause Analysis

There are two reasons for the above problems, which can be analyzed from macroscopic and microscopic perspectives respectively.

2.3.1 Macroeconomic

From a macro perspective, the core issue lies in the conflict between the multi-iteration characteristics of Transformer-based generative models and request-level scheduling strategies. Multi-iteration means that different requests may require different numbers of iterations, and some requests may have very few iterations, allowing them to be processed and returned to the user early. Corresponding to this characteristic of static batch processing, traditional inference architectures employ request-level scheduling. Request-level scheduling lacks flexibility; it cannot modify the currently processed request batch, requests that complete earlier than other requests in the batch cannot be returned to the client, and newly arriving requests must wait until the current batch is completely processed. This makes current request-level scheduling mechanisms ineffective at handling workloads with multiple iterations.

Impact on indicators

From a metrics perspective, it’s clearer: static batching ensures TPOT by sacrificing TTFT and throughput. As mentioned earlier, within a batch, the prefill phase is typically implemented as an iteration by processing all input tokens in parallel, meaning subsequent iterations are all decode phases. Because all requests in the current batch must be processed before returning, static batching ensures the Execution Engine is always performing decode operations, thus guaranteeing TPOT. However, since new requests cannot be accepted during inference within this batch, throughput decreases, leading to a drop in the TTFT for new requests (new requests can only wait in the request queue and cannot perform prefill operations immediately).

Impact on the production line

Furthermore, request-based scheduling can also increase bubbles in the pipeline. When the model size is too large, multiple GPUs are often used for parallel inference, and pipelined parallelism is one common approach. The advantage of pipelined parallelism is that it only involves communication between layer activations, resulting in relatively low bandwidth requirements. In the diagram below, the model is divided into three layers and deployed on GPUs Partition1 through Partition3. A and B are two sequences, with the index indicating which model iteration is currently in progress. Assuming that both A and B are in the decoding phase, because the decoding phase involves autoregressive token output, A and B can only enter the next iteration after the first iteration is completed. This creates idle time (bubbles) on the GPU.

2604

2.3.2 Microscopic

Because the pre-filling and decoding phases share LLM weights and working memory, existing LLM service systems typically place these two phases on a single GPU and maximize overall system throughput (the number of tokens generated per second across all users and requests) by batch processing the pre-filling and decoding steps. The root cause of the problem is that current LLM systems are unaware of the different characteristics and computational patterns exhibited by the LLM pre-filling and decoding phases, nor are they clear about the different impacts of these patterns on SLO (Solution Time Limit). The pre-filling phase resembles a computationally intensive batch job, while the decoding phase resembles a memory-intensive, latency-critical task. Static batching, which places both operations on the same GPUs, is not the optimal strategy for achieving high effective throughput.

2605

Note: This refers to the processing of prefill and decoding in static batch processing, not the fusion scheme we will discuss later.

Stage characteristics

The previous chapters analyzed the differences between prefill and decode from the perspectives of technology and process characteristics. Here, we will continue the analysis by combining metrics and resource utilization.

Prefill and Decode requests have different computational characteristics and need to meet different SLOs.

Prefill. Primarily satisfies the TTFT (Time-To-Functional-Throughput) metric. This is a compute-bounded operation with extremely high computational resource requirements. Its computational cost increases quadratically with the length of the input prompt. The time consumption increases superlinearly (throughput decreases; if the time consumption increased linearly, the average throughput would remain unchanged). Even small batches of prefill tasks, or even a single long prefill task, can saturate the GPU’s computing power, fully utilizing its resources. Once the accelerator reaches saturation, the throughput of the prefill stage remains constant (we call this the accelerator saturation threshold).
Decode primarily satisfies the TPOT metric. This is a memory-bound operation, more susceptible to GPU memory bandwidth limitations. It’s a latency-sensitive task, with resource usage increasing sublinearly with the length of the generated token. Decode computation requires large batch requests to increase computational intensity and fully utilize computing resources. As the batch size increases, the processing time increases slightly (mainly due to increased data transmission time; computation time is considered relatively constant), while throughput increases approximately linearly. With increasing batch size, the throughput in the decoding stage continues to increase, but once memory bandwidth reaches saturation, throughput peaks.

Mutual influence

Running Prefill and Decode on the same GPUs can cause them to interfere with each other. Let’s examine how Prefill and Decode affect each other’s performance when the GPU is processing Prefill and Decode requests simultaneously.

For Prefill requests, since there are other Decode requests in the same batch, the computation time on the GPU will be slightly increased compared to when there are no Decode requests, thus increasing the TTFT of Prefill requests.
For Decode requests, because there are Prefill requests within the same batch, and the computation time for Prefill requests is usually much longer than that for Decode requests (generally several times or even tens of times longer), the decoding tasks in the batch must wait for the longer prefill job to complete. This causes the Decode request to be slowed down by the Prefill request, as it must wait for the Prefill request to complete its computation before the Decode request can begin calculating the next token, thus significantly increasing the TPOT of the Decode request.

This kind of interference is a classic system problem.

Mixing pre-filled requests can cause significant slowdowns because we are adding compute-intensive jobs to already saturated hardware.
Running pre-fill and decode requests together will negatively impact both, as we are simultaneously running batch processing and delay-critical jobs.
Mixed decoding requests can lead to decreased throughput because we don’t know the memory bandwidth and capacity usage, resulting in contention and queuing blockage.

Prioritizing one stage may result in the inability to meet the latency requirements of another stage. For example, a higher arrival rate (req/s) generates more pre-filled jobs. If pre-filled jobs are prioritized, more GPU time is needed to meet TTFT requirements, which in turn will negatively impact TPOT.

Merging pre-padding and decoding also couples their GPU resource allocation and parallelism strategies, preventing the implementation of different parallelism strategies better suited to meet the specific latency requirements of each stage. However, each stage has its unique computational characteristics and latency requirements, necessitating more heterogeneous resource allocation.

GPU allocation strategy. To simultaneously meet the SLOs of both TTFT and TPOT, the system must over-configure resources to meet latency targets, especially under stringent SLO requirements. These two operations have different GPU resource requirements. For the same number of input requests, Prefill requires more GPUs to meet TTFT, while Decode can require fewer GPUs to meet TPOT. However, in the original service architecture, the Prefill and Decode phases share the same GPU allocation strategy. Therefore, to meet the TTFT of Prefill operations, more GPUs are configured, which is excessive for Decode operations. To support a certain total throughput, the number of requests processed by each GPU needs to be reduced, which also requires over-configuring GPU resources. This leads to increased deployment costs, and this “pay-to-play” approach is not a long-term solution.
Model parallelism strategies. In different scenarios, due to their drastically different computational modes and latency targets, the optimal model parallelism strategies for Prefill and Decode may differ. If the service has strict TTFT requirements and low TPOT requirements, the Prefill stage tends to use Tensor Parallelism (TP) to meet the strict latency target, while the Decode stage tends to use data or pipelined parallelism to improve throughput. However, in existing service architectures, the parallelism strategies (tensor, pipelined, or data parallelism) for prefill and decode computations are essentially coupled; the Prefill and Decode stages share the same model parallelism strategy. Therefore, it is impossible to adopt a configuration that simultaneously optimizes throughput and SLOs for different stages.

De-batching and sequentially scheduling pre-filling and decoding jobs does not mitigate the interference. Decoding jobs may experience longer queuing latency because they need to wait for ongoing pre-filling jobs on the GPU. Furthermore, batches dedicated to decoding typically lead to reduced GPU utilization. Prioritizing tasks in one stage can negatively impact the latency of another stage, thus rendering priority scheduling ineffective.

2.4 Challenges

2.4.1 Addressing the issue from a macro perspective

From a macro perspective, since the problem stems from the conflict between the “multi-iteration characteristics of Transformer-based generative models” and “request-level scheduling strategies,” some researchers have proposed using iterative-level scheduling to promote concurrent processing at each stage. Orca, published at OSDI 2022, is the first paper to address this issue.

For static batching, the batch size is determined at batch startup and remains constant until any requested inference within the batch is completed. For iterative-level scheduling, however, the batch size is dynamically determined within each iteration, rather than being fixed at the start of the inference process. This means that once a sequence in the batch has been generated, a new sequence can be immediately inserted to continue computation using the GPU. This prevents the GPU from being wasted waiting for all sequences in a batch to complete, allowing for more efficient use of computing resources, resulting in higher inference throughput and faster overall inference speed.

2.4.2 Addressing the issue from a micro perspective

From a micro perspective, the current challenge is how to balance TTFT and TPOT. Assume the request pool contains requests from different users in different states; some requests are waiting for pefill, and some are waiting for decoding, but there is only one GPU.

Challenge of Question 1

Regarding question 1: How to handle requests that complete early.

Since the entire batch is not yet complete, users can only wait. Therefore, some researchers considered: could requests that have already completed inference within a batch release resources, allowing new requests to perform prefill operations? Existing requests would continue with their current operations (prefill or decode).

Challenge of Question 2

Regarding question 2: How to handle delayed join requests.

Because the pre-filling and decoding phases have different characteristics, MaaS providers set different metrics to measure their corresponding Service Level Objectives (SLOs). Specifically, the pre-filling phase mainly focuses on the delay between the arrival of a request and the generation of the first tag, i.e., Time to First Tag (TTFT). On the other hand, the decoding phase focuses on the delay between the generation of consecutive tags for the same request, called the Time Between Tag (TBT).

How can we meet these criteria? There are different approaches.

When using pipelined parallelism, some researchers wondered if it was possible to insert new requests at the bubble level to perform prefill or decode operations? This would fill the bubble without affecting the current request.
When not using pipelined parallelism, should the next step be processing requests waiting for prefill or requests waiting for decoding? This depends on which metric—TTFT (Time-To-Fulfill) or TPOT (Total Time-to-Origin)—is sacrificed for TTFT. Suppose the request pool has several prefill requests. If the system prioritizes processing prefill requests, the first token can be returned to the user quickly. However, processing several prefill requests consecutively prevents decoding. This sacrifices TPOT for TTFT. Conversely, if the system processes several decode requests consecutively to allow the user to quickly obtain subsequent tokens, prefill requests will have to wait, thus sacrificing TTFT for TPOT.

It seems that regardless of the scheduling strategy, the TTFT and TPOT metrics are strongly coupled. Are the prefill and decode phases inherently incompatible and must be developed separately? Or can they be integrated and progress together?

2.5 Factions

Regarding the questions above, there are two main schools of thought on handling Prefill and Decode: the integration school and the separation school. Here, we quote the opinion of Fang Jiarui, a prominent figure on Zhihu (a Chinese Q&A website).

2.5.1 Fusionist School

The fusion approach involves processing prefills during the decoding process or performing inference together with a specific decode step. The pioneering work was Orca, released at OSDI in 2022, which merged prefills and decodes as a batching step for forward propagation. Following this, tools like Sarathi (https://arxiv.org/pdf/2401.08671) and FastGen (https://arxiv.org/pdf/2403.02310) broke down the prefill sequence into chunks, allowing the processing of prefills for each chunk to be inserted into the decoding stage, or even merged with the decoding stage.

The advantage of the fusion approach is that the processing time after merging the prefill chunk and decode may be similar to the processing time of decode alone. This is equivalent to prefill utilizing the computing power that was not fully utilized during decode. Moreover, since both the decode and prefill stages need to read some fixed data (such as model weights), decode can also take advantage of prefill, merging multiple data reads into one read for multiple uses.

The drawback of the fusion approach is that prefill and decode run on the same device, and their parallelism is tied. If the prompt length is limited, the prefill stage accounts for a small portion of the request processing time, and computing power can still be utilized effectively. For long sequences, the proportion of prefill increases, and the inflexibility of prefill parallelism becomes apparent.

2.5.2 Secessionists

The split approach: Considering the differences in the properties of Prefill/Decode, people began to try to put Prefill and Decode on different devices for separate processing. For example, Splitwise (https://arxiv.org/abs/2311.18677) and DistServe (https://arxiv.org/pdf/2401.09670) are typical examples.

Separating pre-filling and decoding naturally resolves interference between the two stages, allowing each stage to focus on its optimization objective—TTFT or TPOT. Different resource and parallelism strategies can be used for each instance type to meet various latency requirements. By adjusting the number of GPUs and parallelism provided for both instance types, we can maximize the throughput per device across the entire system, avoid over-provisioning resources, and ultimately achieve a reduction in the cost per query that meets Quality of Service (QoS) requirements.

The separation approach allows for different parallelism levels for prefill and decode, providing greater flexibility and enabling the system to evolve in two opposing directions: better utilization of computing power and better utilization of bandwidth. This allows for better hardware optimization. For example, different types of GPUs can be used for prefill and decode, or more GPUs can be allocated to long sequences, thereby reducing the TTFT of long sequences.

The biggest challenge faced by the split approach is how to transfer the KVCache between different devices, which leads to high network costs and requires high-specification network hardware for interconnection within the cluster. Handling long sequences of parallel processing is also a thorny issue.

Next, we’ll see how the two factions handle these issues.

0x03 Fusionists

The fusion approach involves “free-riding.” Although many inference engines employ state-of-the-art batch processing and scheduling techniques, the decoding phase doesn’t fully utilize GPU computing resources. Therefore, prefill can be processed during decoding intervals, or prefill can be performed alongside a decoding step during inference. A typical example is:

ORCA: Adjusts the scheduling granularity from the request level to the iteration level, and combines it with selective batching for optimization.
Sarathi-Serve: It uses a chunked prefill strategy to prefill prompts of different lengths by splitting them into chunks of uniform length, and uses the gaps between these chunks for decoding.

Note: Currently, the industry refers to solutions based on the ORCA principle as Continuous Batching. Continuous batching is an optimization technique that allows for dynamic adjustment of the batch size during the generation process. Specifically, once a sequence is generated in a batch, it can be immediately replaced by a new sequence, thereby improving GPU utilization. The key to this method is adapting to the current generation state in real time, rather than waiting for the entire batch of sequences to complete.

Unlike static batch processing, continuous batch processing uses iterative-level scheduling. It doesn’t wait for each sequence to finish generating in a batch before processing the next. Instead, the scheduler determines the batch size as needed in each iteration. This means that before each iteration, the scheduler checks the status of all requests. Once a sequence has finished generating in a batch, a new sequence can be immediately inserted into the same position, while completed requests are removed.

3.1 ORCA

To address the challenge of handling requests that are completed early and joined late, ORCA’s solution is to reduce idle time using iterative scheduling, that is, to control execution at the iteration level rather than the request level, and to combine this with selective batching for optimization.

3.1.1 Iterative Scheduling

Target

The goal of iterative scheduling is to detect requests that have completed inference in a timely manner and remove them from the batch so that new requests can fill the positions of old requests, so that new requests and old requests can continuously form new batches.

Scheduling

As mentioned earlier, Orca was the first paper to introduce Iteration-Level Schedule. Specifically, after each iteration (prefill or decode) of all requests within a batch, the scheduler interacts with the engine to check if any requests in the batch have completed inference, thus deciding whether to update the batch. This allows scheduling operations to be inserted between GPU inference iterations, enabling the addition and deletion of batch samples and the dynamic allocation and release of GPU memory.

The diagram below illustrates the difference between request-level scheduling and iterative-level scheduling. The former requires multiple iterations of the scheduling batch before all requests are completed, while ORCA, when scheduling tasks, submits only one iteration’s computation to the Execution Engine each time, rather than waiting for the entire request to complete. This allows ORCA to dynamically change the requests to be processed in each iteration; new requests only need to wait for a single iteration to be processed, thus avoiding early-finished requests waiting for other requests to finish. Through iterative-level scheduling, the scheduler has complete control over which requests are processed and the number processed in each iteration.

2606

plan

The diagram below illustrates the architecture and overall workflow of an ORCA system employing iterative scheduling. The ORCA system includes the following modules:

Endpoint. Used to receive inference requests and send responses.
Request Pool. Newly arriving requests are placed in the request pool, and this component is responsible for managing the lifecycle of all requests in the system.
The scheduler monitors the request pool and is responsible for the following tasks: selecting a set of requests from the pool, scheduling the execution engine to perform model iterations on these requests; receiving the execution results (i.e., output tokens) returned by the execution engine, and appending each output token to the corresponding request to update the request pool.
Execution Engine. The execution engine is an abstraction layer that performs the actual tensor operations, which can be parallelized across multiple GPUs distributed across multiple machines.

Next, let’s look at the workflow in the diagram below, where the dashed lines represent interactions between components, which occur in each iteration of the execution engine. $x_{ij}$ is the j-th token of the i-th request. The shaded token represents the input token received from the client, while the non-shaded token is generated by ORCA. For example, in the request $x_1$ initially, it had two input tags ( $x_{11}$ , $x_{12}$ ). So far, two iterations have been run, with the first and second iterations generating $x_{13}$ and $x_{14}$ . On the other hand, requests $x_3$ only includes input tags $x_{31}$ , $x_{32}$ , ask $x_4$ include $x_{41}$ , $x_{42}$ , $x_{43}$ because they haven’t run any iterations yet.

The workflow consists of the following steps:

The scheduler interacts with the request pool to determine which requests to run next. (See the diagram below for the number ➀.)
The scheduler invokes the engine for the four selected requests ( $x_1$ , $x_2$ , $x_3$ , $x_4$ ) perform one iteration. At this time, because $x_3$ and $x_4$ no iterations have been run yet, therefore the scheduler is $x_3$ transfer $x_{31}$ , $x_{32}$ for the execution engine, $x_4$ transfer $x_{41}$ , $x_{42}$ , $x_{43}$ for the execution engine. Corresponds to the icon number ➁ below.
The engine runs model iterations for the four requests, corresponding to the number ➂ in the icon below.
The engine generates an output token ( $x_{15}$ , $x_{23}$ , $x_{33}$ , $x_{44}$ ). The result is returned to the scheduler, corresponding to number ➃ in the icon below. After each engine return and receiving the execution result of that iteration, the scheduler checks whether the request is complete. If the request is complete, the request pool removes the completed request and notifies the endpoint to send a response back to the client.
For newly arriving requests, they have a chance to be processed after the current iteration has finished (i.e., the scheduler may select the new request as the next to be executed). Because newly arriving requests only need to wait for one iteration, queuing latency is significantly reduced.

ORCA does not handle canceled requests. In fact, these requests should be removed from the batch and the corresponding memory should be released in a timely manner.

2607

algorithm

The diagram below details the algorithm for selecting requests in each iteration. This algorithm doesn’t ideally control the timing of K/V cache release. The K/V cache is released immediately after request generation ends. In multi-turn dialogue scenarios, this mechanism leads to redundant computation: “Previous turn generates K/V cache -> releases K/V cache memory -> generates the previous K/V cache via the current turn’s prompt.” This worsens the First Token Time (the latency to generate the first token) metric in subsequent turns.

2608

3.1.2 Selective Batch Processing

In the ORCA era, Paged Attention did not exist, so Selective Batching was needed to decouple attention computation from batching. That is, to improve computational efficiency, a way needed to enable the engine to process any selected set of requests in batches.

question

In the preceding analysis, we made a simplified assumption: that all request sequences have the same length. This is due to the unique nature of GPUs; to execute multiple requests in batches, each request should contain the same operations and consume input tensors of the same shape. However, in reality, the lengths of request sequences vary. While the padding and masking method can solve this, it severely wastes computing power and GPU memory, which is detrimental to inference GPUs with limited computing power and GPU memory.

The above challenges are exacerbated when using iterative-level scheduling. This is because:

Requests in a request pool may have different characteristics.
The calculation methods for prefill and decode are different.
The prefill process is a long sequence of parallel computations, while the decode process is token-by-token.
The prefill process does not require reading the KV cache, but the decode process does.
For prefill, the prompt length varies for each request.
For decoding, the index of the decode token is different for different requests, which means that the mask matrix for calculating attention is also different for them.
Iterative scheduling methods may result in different processing progresses for different requests within the same batch, meaning the shape of the input tensor may vary depending on the number of tokens processed.

Let’s use the architecture diagram above as an example to analyze how, even for a pair of requests $(x_i, x_j)$ , there is no guarantee that their next iteration will merge or replace them with batch versions. There are three situations that may prevent request pairs from being merged into batch versions:

Both requests are in the initialization phase, but the number of input tokens differs (as shown in the image below). $x_3$ and $x_4$ in other words, the “length” dimension of the input tensors is not equal, so the two requests cannot be batched.
Both requests are in the incremental phase, but they are processing different token indexes (e.g., $x_1$ and $x_2$ ). Because each request processes a different token index, the tensors of attention keys and values have different shapes, making it impossible to merge batches.
The requests are in different stages: some are in the initialization stage, and some are in the incremental stage (e.g., $x_1$ and $x_3$ ). Because the number of input tokens differs across iteration phases (the initialization phase processes all input tokens in parallel to improve efficiency, while the incremental phase processes only one token per iteration), batch processing cannot be merged.

In the second case, in a typical batch processing mechanism, each iteration of the Transformer layer receives a 3D input tensor [B,L,H] composed of multiple [L,H] request input tensors from the batch, where B is the batch size, L is the number of tokens processed together, and H is the hidden size of the model.

2609

However, in the diagram below, “iter 1” (initialization phase) receives an input tensor of shape [2,2,H], while “iter 2” (increment phase) receives a tensor of shape [2,1,H]. However, in the diagram above, when the scheduler decides to perform iterations on the batch (x1, x2, x3, x4), the inputs requested in the initialization phase (x3: [2,H] and x4: [3,H]) cannot be merged into a single tensor of shape [B,L,H] because x3 and x4 have different numbers of input tokens, 2 and 3 respectively.

2610

The main problem with batching described above is that the three cases mentioned correspond to irregularly shaped input (or state) tensors that cannot be merged into a single large tensor and input into a batch operation. Therefore, not all requests can be batched together in any Iteration. Batching is only applicable when two selected requests are in the same phase and have the same number of input tokens (in the initialization phase) or the same token index (in the increment phase). This limitation significantly reduces the likelihood of performing batching in real-world workloads, as the scheduler needs to find two requests that meet the batching criteria simultaneously.

Ideas

A good approach to solving these problems is to find commonalities in the computation of these requests, maximizing computational merging. Differences can then be handled separately. Let’s first review the computations a sequence undergoes using a transformer decoder block as an example. The following diagram shows the various computation types of the decoder block. As you can see, the Transformer decoder block can be viewed computationally as the sum of six operations: pre-proj, attn, post-proj, ffn_ln1, ffn_ln2, and others (such as layer normalization, activation functions, and residual connections). The Transformer outputs a tensor of shape [B, L, H]. Here, B is the batch size, L is the input token length, and H is the model’s embedding size. The KV cache size for each token is [1, H].

2611

By refining the above introduction, we can obtain the following important information: Operations in the Transformer layer can be divided into two types: Attention and non-Attention, and the operators of these two modules have different characteristics.

preproj/postproj/FFN1/FFN2 These modules mainly contain operators such as Add, Linear, and GeLU. The characteristics of these operators are:
- There’s no need to distinguish which request the token came from. Therefore, although they are token-wise independent, batch processing can be used.
- It is independent of the length of the input sequence. This means that we can flatten all the tokens in a batch into a single line for computation (while maintaining their respective position vectors). In this way, inputs of different lengths can also be grouped into batches for computation. For example, the input tensors of x3 and x4 above can be combined into a two-dimensional tensor [ΣL,H] = [5,H], without requiring an explicit batch dimension.
- We need to read the model weights from GPU memory. Reading the model weights means we should try to increase the batch size so that a single read can benefit more requests, thereby reducing the number of I/O operations.
attn The features of this module are:
- Because the computation is affected by the differences between the sequences (e.g., different mask matrices for different sequences, whether or not the KV cache needs to be read), the sequences need to be split and processed independently, meaning the batch dimension is important. However, since the attn part itself does not involve reading weights, splitting the sequences for processing will not incur additional IO overhead in this regard.
- For attention operations, neither batching token-wise nor request-wise tackling can be performed because attention operations require the concept of a request to compute the attention between tokens of the same request.
- Not batching the Attention layer has little impact on efficiency because the operations of the Attention layer do not involve the reuse of model parameters, and batching cannot reduce GPU memory reads.

plan

In summary, within the Transformer Layer, not all operators require the inputs comprising a batch to have the same shape. Based on this, Orca proposed a second core technique: Selective batching. Instead of batching all tensor operations (attention and non-attention) that make up the model, it selectively applies batching to only a few non-attention operations. That is, it applies different types of operations to different types of requests to solve the problem, as detailed below:

Each attention operation is processed separately. That is, for operators that require the same shape to batch (such as Attention), different inputs are computed separately.
Other layers (such as the MLP layer) are processed in batches.

The diagram above illustrates the selective batch processing mechanism for the request batch (x1, x2, x3, x4) described in Figure 4. The diagram shows the system generating the first token for requests x3 and x4 (referred to as the Initiation Phase in the paper), while simultaneously generating subsequent tokens for requests x1 and x2 (referred to as the Increment Phase in the paper). Because the key-value cache only exists after the first token is generated, checking the Attention Key-Value Manager to see if there is a corresponding key-value cache for each request allows the system to determine which phase each request is in. The specific mechanism is as follows:

Apply batch processing.
This batch has 7 input tokens to process, so we set the shape of the input tensor to [7,H]. The meaning of [7,H] is: given [(s1, h), (s2, h), ...] inputs of shape, we stack them into a (sum(si), h) large matrix of shape.
Apply the Linear non-attention operation to this stacked matrix. Linear computation does not involve interactions between tokens, so the input Batch dimension and Seq dimension are reshaped into a single dimension to complete the Linear Batch computation.
The results of the dense layer are split back [(s1, h), (s2, h), ...]. Before the attention operation, we insert a split operation to distinguish the tensors of each request in the batch. (Note: Some readers may wonder why the input and output dimensions of the Linear are [7,H] and [7,3H] respectively. Why is the dimension increased? This is because matrix concatenation allows QKV to be computed simultaneously with a single Linear, meaning the output contains H-dimensional Q, H-dimensional K, and H-dimensional V, totaling 3H dimensions.)
The attention operation is run separately for each request in the split tensor. The calculation of the attention involves the interaction between tokens. OCRA’s operation is to split the input by batch dimension and calculate the attention for each sample separately.
Then, the output of the attention operation is merged back into a tensor of shape [7,H] by a merge operation to restore the batch processing capability for other operations.

Although ORCA combines the Prefill and decode phases, their computation patterns are actually quite different. Performing them separately allows for better optimization (for example, when computing Linear, the computation kernel of the Initiation Phase is closer to GEMM, while the computation kernel of the Increment Phase is closer to GEMV).

The subsequent work involves performing batch calculations on the requests in the Initiation Phase, and then merging these requests with the requests already in the decoding phase for batch calculation. Some frameworks manage this issue through hyperparameters: the ratio of requests waiting to be served to the ratio of requests waiting to be marked as ending (waiting_served_ratio).

2612

3.1.3 Parallel Production Lines

In addition, ORCA’s scheduler enables worker threads in the engine to execute pipelines for multiple batches.

ORCA’s scheduler enables the execution of workers in the engine to be pipelined across multiple batches. If the number of currently scheduled batches n_scheduled reaches the number of worker threads n_workers (lines 9-10 of Algorithm 1), the scheduler stops waiting. In this way, the scheduler maintains a constant number of concurrently running batches in the engine n_workers, meaning that each worker thread in the engine is processing a batch and is not idle.

The diagram below illustrates the execution pipeline of three ORCA worker threads, with a maximum batch size of 2. Assume request A arrives before B, B arrives before C, and so on. First, the scheduler selects requests A and B based on their arrival times, and the scheduling engine processes the batch containing A and B (called the AB batch). Worker1, Worker2, and Worker3 process this batch sequentially. The scheduler only waits for the AB batch to return after injecting two additional batches, CD and EF. Once the AB batch returns, requests A and B are selected and scheduled again because they are the earliest arriving requests in the request pool.

2613

3.2 Sarathi-Serve

3.2.1 Challenges

While ORCA is excellent, it still has two problems: low GPU utilization and the pipeline may still cause bubble problems.

GPU utilization

Let’s look at an experiment conducted by sarathi-serve. The left and right graphs depict the processing time and computational intensity of the prefill and decode stages under different batch sizes. We can observe the following:

Prefill: Increasing the batch size does not significantly change the speed. Even with a batch size of 1, it saturates GPU computation and results in a nearly constant time per token across batch sizes.
Decoding: Increasing the batch size results in a very pronounced linear decrease in processing speed. This is because decoding is memory-bound, and the computational power used in the decoding stage is often insufficient. Therefore, increasing the batch size not only utilizes more computational power but also merges multiple reads into a single read, thus reducing processing speed. When the batch size reaches a certain threshold, the rate of speed reduction also reaches its bottleneck. With increasing batch size, the incremental cost of linear operators used for decoding becomes almost zero. Attention costs do not benefit from batch size adjustments because they are memory-constrained.

2614

assembly line

Although Orca can improve the problem of air bubbles in PP to some extent, it can still cause air bubble problems. Let’s take the following figure as an example:

2615

There are three types of bubbles in the diagram:

PB1 The bubble is caused by the inconsistent lengths of the prefill sequences in two consecutive micro-batches.
PB2 The bubble is caused by the difference in computation time between the prefill and decode stages.
PB3 The bubble is caused by the difference in decoding computation time between different micro-batches. This is because the length of the KV cache to be read is different when different micro-batches are decoding, which also leads to the difference in the time spent reading data.

3.2.2 Cause Analysis

The reason for the above problems is that the ORCA assembly batching process is relatively random. The number of requests for prefilling and decoding in a batch is uncertain; it is simply dynamically assembled according to a roughly first-come, first-served principle. This causes some issues.

If a batch has a large number of prefill requests or encounters very long system prompts, the prefill tokens will consume a lot of computing resources, making the entire batch compute-bound.
If a batch has a large number of decoding requests (for example, when all requests have not finished inference, or when there are no new sequences to schedule in the request queue), the batch may become memory-bound.

Therefore, we need a mechanism to ensure a balance between prefill and decode requests in each batch. In other words, assuming the number of tokens a batch can hold (i.e., the upper limit of GPU computing power) is fixed, we need to find a scheme to allocate prefill and decode tokens in a certain proportion, maximizing batch performance by overlapping these two types of requests and resolving the trade-off between throughput and latency. Under this approach, the prefill sequence must be broken down; this is what Sarathi proposed as Chunked-prefills.

3.2.3 Chunked-prefills

The core idea of the Chunked-prefills scheme is to split long sequences of prefill requests into smaller chunks of roughly equal size, and then construct a hybrid batch consisting of prefill chunks and decoders. In other words, the Chunked Prefill strategy prefills prompts of varying lengths by splitting them into chunks of uniform length, thus preventing long prompts from blocking other requests. Simultaneously, it utilizes the gaps between these chunks for decode insertion/piggyback operations, thereby reducing latency and improving overall throughput.

The overhead of the decode phase comes not only from retrieving the KV cache from GPU memory but also from extracting model parameters. However, through this piggyback approach, the decode phase can reuse the model parameters extracted during prefilling, almost transforming the decode phase from a memory-intensive operation to a computation-intensive one. Therefore, this constructed hybrid batch has near-uniform computational requirements (and increases computational intensity), allowing us to create a balanced micro-batch schedule, mitigating imbalances between iterations, minimizing GPU pipeline bubbles, and improving GPU utilization. It also minimizes the impact of computing new prefilling on the ongoing decoding’s TBT (Time-Between-Tokens), thus achieving high throughput and low TBT latency.

contrast

The figure below shows the timelines for the traditional approach, the ORCA approach, and the Sarathi approach.

The default mechanism (Figure (a) below) only performs batching at the request level. In this case, ready requests are processed in batches, but the scheduler only receives a new batch before all requests in the currently being processed batch have completed. Since requests may have a long token generation phase, this can result in long waiting times for requests to arrive in between, leading to high TTFT and high E2E latency.
Continuous batching (Figure (b) below) is an optimization of the default mechanism. In this case, scheduling decisions are made before each forward pass of the model. However, any given batch either contains only requests in the prompt phase or only requests in the decode phase. The prompt phase is considered more important because it affects the TTFT. Therefore, waiting prompt requests can preempt the decode phase. While this results in a shorter TTFT, it significantly increases the TBT, thereby increasing E2E latency.
Hybrid batching (Figure (c) below) is the Sarathi scheme. With this batching, scheduling decisions are made at each forward propagation, allowing the prefill and decode phases to run together. This reduces the impact on TBT, but doesn’t completely eliminate it, as the decode phase, scheduled alongside the prompt phase, will experience a longer runtime.

2616

Note: Here, ORCA refers to subsequent ORCA-like optimization schemes.

accomplish

To use pre-padding to accompany decoding, we need to be aware of two things.

First, we need to determine the maximum possible batch size of the decoding that can be carried, and the number of pre-padded tokens that make up the pre-padded block. Secondly, in order to truly utilize the GPU saturation pre-filling computation of hybrid batches to improve decoding efficiency, we need to merge the pre-filling block and the linear computation of batch decoding into a single operation. In addition, the key to dynamic segmentation is to divide the long pre-padding into smaller chunks, thereby combining the pre-padding chunks with multiple decoding tasks to form batches and fully utilizing the GPU. This process is called piggybacking.

To use pre-padding alongside decoding, we need to be mindful of two things. First, we need to determine the maximum possible batch size for the decoding that can be carried and the number of pre-padding tokens that make up the pre-padding block. Second, to truly leverage the GPU-saturated pre-padding computation of hybrid batches to improve decoding efficiency, we need to merge the linear computation of the pre-padding block and batch decoding into a single operation.

A crucial aspect of this implementation is determining the chunk size. Sarathi offers two chunk size strategies: “fixed” and “dynamic”.

Fixed Strategy: This strategy calculates the maximum number of tokens per batch that maximizes GPU utilization based on hardware and profilling experiments. This is the total token quota for the batch, which is kept as constant as possible during operation. The number of prefill tokens changes with the increase or decrease of decode tokens, but since the number of decode tokens is generally not large, the number of prefill tokens and the overall batch token quota will not differ significantly.
Dynamic Strategy: This strategy aims to reduce the number of prefill tokens for a request as the number of iterations increases. This is because if a prompt is particularly long, it will consume a lot of computing resources in each iteration, thus affecting the accumulated decode sequence and new requests. Therefore, for such long-sequence requests newly entering the batch, Sarathi will initially allocate a larger prefill token quota, and then reduce this quota as the number of iterations increases, reducing its impact on other iterations.

The operation process of the two-stage pipeline with chunked-prefills is as follows:

2617

The specific steps are as follows:

The maximum number of tokens that can be processed in each batch is determined based on the GPU’s performance. This corresponds to number 1 in the diagram above. The diagram shows four requests (A, B, C, D), which are each split into chunks.
When the system first starts, the batch contains only the prefill sequence. The system pre-allocates key-value cache space based on the length of the entire prefill, ensuring sufficient key-value cache space in subsequent iterations of this prefill. (Corresponds to number 2 in the diagram above.)
Add sequences that need decoding to the batch until the KV cache space is insufficient (because decoding operations require a corresponding KV cache). This corresponds to icon 3.1 in the diagram. Simultaneously, based on the remaining tokens in this batch, chunk the sequences that need prefilling and add the corresponding prefill tokens to the batch. This corresponds to icon 3.2 in the diagram.

For example, the above image shows $A_{d1}$ this requires iterative decoding. For further processing $A_{d1}$ at this point, a quota of 1 token needs to be allocated to A’s batch. $A_{d1}$ at the same time, it also needs to find request C in the waiting queue according to the FCFS (First-Come, First-Served) principle, and split C into segments according to the quota ratio. $C_{p1}$ and $C_{p2}$ , and then $C_{p1}$ add it to the batch.

After each step of the reasoning process, the scheduler reassembles the batch because Sarathi-Serve still uses iteration-level schedules.

Let’s revisit the diagram above and examine the differences between Orca and Sarathi after A has completed its prefill.

When hardware resources allow, Orca will have CD perform prefill while AB continues decoding. However, because the complete sequence of decoding and prefill is bound together, the overall decoding computation time increases. Therefore, this can actually be considered a kind of decoding pause.
sarathi-serve also allows decoding and prefilling to be done together, but it achieves almost no latency in the decoding phase by properly controlling the number of prefill tokens in each batch. This balances latency and throughput.

analyze

In the Linear layer, Decode and Prefill can be executed together, but separately in the Attention phase. However, in many scenarios, the Attention time is relatively small. This makes Decode and Prefill a computationally intensive task. In this case, Decode should aim for the largest possible batch size. However, this also introduces higher latency for Decode. Therefore, a balance needs to be struck between throughput and latency. In Prefill, after reaching the GPU computing bottleneck, longer sequences cause a sharp increase in Prefill latency. Therefore, after this inflection point, Prefill throughput no longer increases. This value is the Chunk Size we need to find. The total Token Size of Decode + Prefill should be less than or equal to this Chunk Size, but this Chunk Size is not easy to find. Furthermore, the request density of Decode and Prefill may not achieve the perfect ratio. The ideal ratio is where Decode can be batch-executed in each round of Prefill execution. If the Decode density is high, Decode might run alone. If the prefill density is high, then the prefill will run independently.

Additionally, chunked prefilling slightly increases the overhead of prefilling. This is because when computing subsequent chunks, the key-value pairs corresponding to these chunks need to be continuously read from GPU memory into the kernel. Without chunked prefilling, these key-value pairs remain in the kernel. The advantage of chunked prefilling is that by piggybacking the decoding to the chunk’s buffer, the model parameters loaded in the prefilling stage can be directly reused. This almost transforms decoding from a memory-bound operation into a compute-bound operation.

3.2.4 Chunked-prefills vs. Separate Inference Architecture

We can see that by using a chunked-prefills strategy and appropriately dividing the ratio of prefill tokens to decode tokens, it seems possible to simultaneously preserve TTFT and TPOT/TBT, making good use of the GPU. So, under these circumstances, what advantages does a split inference architecture still offer?

In fact, with the introduction of Chunked Prefill, the boundary between Prefill and Decode nodes has become blurred. The Mooncake paper points out that the necessity and best practices for designing a separate resilient prefill pool are still controversial. With the introduction of chunked prefill, is this separation still necessary? Because Chunked Prefill has two obvious advantages: 1) Without separation, all nodes are treated equally, making scheduling easier; 2) Embedding chunked prefill in the decoding batch can improve the computational intensity of the decoding batch, thereby achieving better MFU.

Mooncake retains its separate architecture (using independent Prefill nodes). Request prefilling is only inlined into the decoding batch if the request can be forwarded without chunking (e.g., for very short prompts that can be added directly to the decode’s continue batch to improve MFU) and without affecting TBT SLO. This decision has two main reasons: 1) Prefill nodes require different cross-node parallel settings to handle long contexts. 2) It provides a unique opportunity to conserve VRAM.

2618

There may also be the following reasons:

Block pre-filling incurs computational overhead, so choosing a block size significantly lower than the GPU saturation point will prolong the execution time of the pre-filling task.
There’s still a possibility that the prefill stage might not maximize MFU (Maximum Validity). This is because in chunk-prefill, we only estimate the maximum token quota for a batch on a specific device using profiling; these tokens include both prefill and decode tokens. This size is for the whole, not for each individual prefill or decode token. Furthermore, if the sequence length is not a perfect division of the tile size, additional computational overhead will occur.
Even if the block size is optimized to almost maximize GPU utilization, block pre-filling significantly increases the memory access volume of the pre-filling task because the KV cache needs to be loaded from the GPU’s HBM into SRAM for each subsequent block. Furthermore, long sequences can persistently occupy KV cache storage space and GPU computing resources.
As for TPOT, merging pre-filling and decoding in a batch will actually slow down all of these decoding tasks.

In summary, while block prefilling can help maximize overall throughput, dynamic segmentation cannot completely decouple prefilling and decoding operations, leading to resource contention and compromises between TTFT and TPOT. Decoupling becomes a better option when applications cannot make trade-offs between TTFT and TPOT but must comply with both simultaneously.

Therefore, based on these conjectures about the shortcomings of the chunked-prefills strategy, perhaps using a separate architecture and developing a separate strategy for the prefill phase could more effectively address these issues. Of course, this also depends on the specific implementation of each strategy, the business scenario, and the actual experimental results.

Next, let’s take a look at the split inference architecture.

Note: This author is not suggesting that a separate inference architecture is necessarily better than a converged one; rather, each has its advantages and disadvantages. For example, the paper “Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation” points out that the separation of PD (Programmable Detection) in LLM serving systems leads to a serious waste of GPU resources. Specifically, the utilization of HBM (Hardware Bus Memory) capacity and bandwidth of the GPU is low during the computationally intensive Prefill phase. The memory-intensive Decode phase also faces the problem of low utilization of computing resources. The paper also proposes an improvement solution, Adrenaline.

0x04 Separation Method

4.1 Overall Approach

The logic behind the separation scheme is simple: decoupling compute-bound and memory-bound pre-filling into different GPUs allows for independent optimization in both hardware allocation and parallelization strategies, handling TTFT and TPOT/TBT separately, while simultaneously improving throughput and reducing latency. This eliminates the need for constant trade-offs between the two, as in merged inference architectures, naturally solving the two problems mentioned above.

There is no interference between prefilling and decoding, allowing both stages to reach their respective SLOs more quickly. Prefilling and decoding do not affect TTFT and TPOT.
Resource allocation and parallelism strategies are decoupled, allowing each stage to scale independently by tailoring resource allocation (number of GPUs), parallelism strategies, and optimization strategies for pre-filling and decoding, thereby ensuring maximum throughput for each GPU. For example, more GPUs can be configured for long sequences, and fewer GPUs for short sequences.

2619

We will use a decoupled architecture as a starting point to discuss the benefits of decoupling the prefill and decode processes:

4.1.1 Framework

So what does a decoupled framework look like? Let’s look directly at a simplified architecture diagram provided by DistServe, which illustrates how requests are processed in this decoupled system. In this architecture, prefill and decode no longer share a single GPU/group of GPUs, but are distributed across different GPUs, each belonging to a different instance (Prefill instance and Decoding instance). A prefill instance contains a complete model, which may occupy one or more GPUs. The decode instance has a similar configuration, except that it no longer shares GPUs with the prefill instance. The processing flow of this service architecture mainly consists of the following three steps:

Prefill: When a request enters the system, it is first sent to a prefill instance. The prefill instance performs a prefill operation on the newly arrived request, obtaining the first tokens and the KVCache corresponding to the prompt;
Migrate: Migrate the prefilled request and its corresponding KVCache to the Decode instance;
Decode: Decode instance performs Decode operations on migration requests in a loop. Once generation is complete, the request leaves the system.

The three stages combined form a pipelined parallelism. GPUs from different stages are placed in different partitions for use, and data is then fed from prefill to the KV cache, and then to decode. The KV cache is used to pass intermediate results. From a unified perspective (treating the KV cache as memory management), this is a distributed system.

2620

Intuitively, a discrete architecture, compared to a merged architecture, loads an additional copy of the model (consuming GPU memory) and involves KV cache transfers between GPUs (which is time-consuming), seemingly making it worse than a merged architecture. So why do people still use this architecture? Let’s answer that question.

4.1.2 Argumentation

Next, we’ll examine how the researchers argued their case from different perspectives. We’ll start by looking at the observations and insights of DistServe and TetriInfer, and then analyze it from the perspective of KV Cache.

DistServe

The following diagram illustrates how P90 (achieving 90% SLO completion rate), TTFT, and TPOT change as the request rate increases when providing services using an existing system. The specific experiment used a single A100 80G GPU to run a 13B LLM, with input_length = 512 and output_length = 64. (See the diagram.)

The horizontal axis represents the number of requests arriving at this GPU per second, denoted as rps (requests per second).
The blue line uses a PD merging architecture for experimentation. If we simultaneously achieve the set TTFT SLO and TPOT SLO, our maximum RPS = 1.6, which we also denote as goodput.
Yellow line: When only the GPU handles prefill requests, its goodput = 5.6.
Green line: When only the GPU is allowed to handle decode requests, its goodput = 10.

It can be observed that the goodput of a card performing only prefill or only decoding is higher than the goodput of a card performing both prefill and decoding. Therefore, we can estimate that if we allocate 2 GPUs for prefill and 1 GPU for decoding, we can effectively provide the model with an overall throughput (goodput) of 10 rps, or an average goodput of 3.3 rps per GPU, which is 2.1 times higher than the combined goodput (value of 1.6).

To put it bluntly from another perspective: in a discrete architecture, our three cards can handle 10 reqs/s of traffic; while in a consolidated architecture, since a single card can only handle 1.6 reqs/s, we need six cards to handle 10 reqs/s of traffic. This shows that in order to meet latency requirements, a consolidated architecture must over-provision computing resources.

2621

The above experiments demonstrate that putting prefill and decode together leads to strong mutual interference, coupled resource allocation between them, and prevents the implementation of different parallelism strategies that are better suited to meet the specific latency requirements of each stage.

TetriInfer

There are many ways to interact with an LLM, ranging from simple chat to more complex downstream tasks such as document summarization and content creation. These interactions often differ in nature. For example, summarization tasks typically have longer input hints and shorter generated tokens, while context creation tasks are the opposite. Token lengths for different downstream tasks can vary by more than two orders of magnitude. To study the performance of these inference requests running concurrently, the authors of TetriInfer categorized inference requests into two dimensions (pre-padded and decode length) and one attribute (lightweight or heavyweight), resulting in four different request types: heavyweight pre-padded, lightweight pre-padded, heavyweight decode, and lightweight decode. Here, heavyweight pre-padded refers to longer token lengths, while lightweight pre-padded refers to shorter token lengths. Lightweight decode refers to decode requests that generate a small number of tokens, such as fewer than 100. Heavyweight decode refers to decode requests that generate a large number of tokens, such as more than 512 tokens.

This paper conducted extensive testing by mixing prefill and decode requests of varying lengths, observing severe mutual interference across all combinations. The figure below illustrates the effects of combinations such as prefill and prefill, prefill and decode, and decode and decode.

Hybrid pre-fill requests. When the total number of tokens in a batch exceeds the accelerator saturation threshold, running pre-fill requests will cause severe slowdowns because we are continuing to add compute-intensive jobs to already saturated hardware.
Prefill and Decoding. Mixing prefill and decoding requests negatively impacts both because we’re running batch processing and latency-critical jobs simultaneously. Even a single other heavyweight prefill request in the same consecutive batch can increase decoding latency by 5x! Once more than 7 lightweight decoding requests are running concurrently, prefill latency increases. Both latency increase by approximately 2.5x.
Decoding and mixed decoding requests. Mixed decoding requests can lead to decreased throughput because we lack information about memory bandwidth and capacity usage, resulting in contention and queuing congestion. Furthermore, adding heavyweight decoding requests can significantly impact throughput and latency compared to batches of all lightweight decoding requests.

2622

The fundamental reason is simple: current LLM systems are completely unaware of the distinct characteristics exhibited by the LLM pre-filling and decoding phases. The pre-filling phase resembles a computationally intensive batch processing job, while the decoding phase resembles a memory-intensive, latency-critical task. Trying to reason about both in the same way will only cause them to interfere with each other.

KV Cache

Readers may wonder: because of the decoupling, intermediate state (i.e., KV Cache) needs to be transferred between the prefill and decoding GPUs. This seems like a bottleneck, because when the model is too large, a single machine may not be able to fully accommodate a Prefill Worker and a Decode Worker. For such a model, moving the KV Cache from the prefill instance to the decoding instance incurs significant overhead, and the migration time increases dramatically.

We will now analyze this using DistServe. Regarding the KV Cache transfer issue, to avoid increased migration overhead, workers need to be strategically placed. The DistServe authors effectively hide the KV Cache transfer overhead by carefully placing pre-filling and decoding workers, leveraging high-bandwidth networks. The specific strategy is: because the KV Cache is stored layer by layer, the model can be divided into multiple segments, each placed on a different machine. Different segments of the model use parallel processing (PP), and then a single-machine model parallelism strategy is used to search for models within the same segment. This ensures that:

Cross-machine transmission only occurs between PP layers.
KVCaches at the same layer as Prefill and Decode Worker are located on the same machine, and Prefill and Decode Worker can use the intra-node NVLINK bandwidth for transmission.

This significantly reduces transmission latency. Furthermore, the larger the model, the longer the sequence, and the higher the bandwidth of multi-card communication devices, the lower the proportion of KVCache migration overhead.

The left image below shows the latency breakdown when using DistServe to provide OPT-175B on the ShareGPT dataset. The right image shows the CDF performance of KV Cache transfer time for the three OPT models. It can be seen that the transfer problem has been greatly alleviated. Through proper placement, the KV Cache transfer overhead can be effectively minimized, even to less than the time of one decoding step.

2623

Configure separately

Prefill and Decode have very different characteristics, so many of their parameter configurations differ. These include batching strategies, quantization types, total throughput (TP) size, total output (PP) size, task timeout, retry time, memory allocation strategies, RDMA hardware and software queue sizes, and whether rollback is possible. Let’s look at how to configure these using Alibaba RTP-LLM experience.

Batch size. After separation, the batch size of the prefill itself will still not be very large. To fully utilize the capabilities of the decoder, we should configure more prefills and reduce the number of decoder instances, enabling the decoder to perform large batch inference.
Resource allocation: For Prefill, choose a card with strong computing power; for Decode, prioritize a card with large video memory.
Quantization approach: FP16/W8A8 model files can be deployed on the Prefill machine to achieve relatively good Prefill performance; while the Decode machine can freely choose the W4A16/W4A8 approach to achieve better overall performance.

We will continue our study using several papers.

DistServe eliminates interference between pre-filling and decoding computations by allocating them to different GPUs. To address the application’s overall latency and per-token runtime requirements, DistServe tailors resource allocation and parallelism optimizations for each stage. Furthermore, DistServe determines how to deploy these two stages based on the service cluster’s bandwidth to minimize communication caused by task decomposition.
SplitWise features a distributed scheduling strategy, PD separation, hierarchical KV cache transmission, and adds a third machine pool specifically for handling mixed batch processing in the Prefill and Decode stages, and can flexibly adjust its size according to real-time computing needs.
MemServe: Manages KV Cache from a unified distributed perspective.
TetriInfer combines block prefilling and two-phase splitting with a predictive two-phase scheduling algorithm to optimize resource utilization, and only performs block processing on prefill. The static parallelism and partitioning strategies of SplitWise, DistServe, and TetriInfer are inflexible and cannot handle dynamic workloads.
Mooncake has built a PD separation scheduling cluster and inference architecture centered on KVCache storage and based on RDMA, forming a Prefill/Decode Pool and a distributed heterogeneous media KVCache Pool; it integrates cache awareness, load balancing and a service level objective (SLO) oriented decision-making mechanism.

4.2 DistServe

DistServe has the following features:

By assigning pre-filling and decoding computations to different GPUs, interference between the two is eliminated.
Based on the application’s TTFT and TPOT requirements, DistServe has customized a common optimization strategy for resource allocation and parallelism for each stage.
The deployment of these two phases is determined based on the bandwidth of the service cluster to minimize communication caused by task decomposition.

4.2.1 Analysis

Let’s first look at the requirements of pre-padding and decoding for parallel strategies.

Parallel partitioning

Common parallel techniques are as follows:

Data parallelism: weight replication, input data is distributed across different machines.
Model parallelism involves splitting the weights, and sometimes the input data also needs to be split along with the weights.
Pipeline parallelism: Running different subgraphs on different devices together to form a pipeline. This requires splitting the input data into batches.

Alpa proposed another partitioning dimension, namely the inter-operator and intra-operator parallel partitioning method, which distinguishes different parallelisms by “whether the tensor dimension is partitioned”.

Intra-operator parallelism: This refers to the parallelism methods that divide the tensor dimension, including data parallelism and operator parallelism.
Parallelism between operators: without splitting the tensor, the subgraphs are simply arranged and distributed in different ways, including pipeline parallelism.

2624

prefill example

For prefill instances, our optimization goal is to meet the TTFT latency requirements of the service using minimal resources.

The pre-filling step is typically computationally intensive. Once the GPU becomes computationally constrained, adding more requests to the batch will not improve GPU utilization but will proportionally increase the overall batch processing time, unintentionally delaying all requests. For pre-filling instances, it is necessary to analyze the LLM and GPU beforehand to determine the critical input length threshold. Exceeding this threshold will cause the pre-filling stage to become computationally constrained. Adding more requests to the batch should only be considered when the planned input length is below the threshold.

Therefore, the appropriate parallel strategy for prefill instances is as follows: at lower rates (req/s), execution time is the primary factor, so intra-operator parallelism is more efficient. However, as the arrival rate increases, queuing delays become more significant, at which point inter-operator parallelism is superior.

decode instance

For decoding instances, our optimization goal is to meet the application’s TPOT requirements with minimal computational resources.

Since individual decoding jobs are severely bandwidth-constrained, batching is key to avoiding low GPU utilization (and therefore high throughput per GPU). Disaggregation can distribute multiple pre-filled instances to a single decoding instance. This approach allows for the accumulation of larger decoding phase batch sizes on dedicated GPUs without sacrificing TPOT. After disaggregation, the decoding batch size may be limited by GPU memory capacity because a KV cache needs to be maintained for all active requests.

Intra-op reduces latency, but this decreases as communication and utilization decline after partitioning, while inter-op scales throughput almost linearly. Therefore, the appropriate parallel strategy for decode instances is: when TPOT and SLO requirements are stringent, intra-op is crucial for reducing TPOT to meet latency targets, while inter-op is better suited for linearly increasing throughput.

4.2.2 Optimization Directions

Because of the different characteristics of prefill and decode, different optimization directions exist within a separate framework. The first column of the table below shows the characteristics of prefill, the second column shows the characteristics of decode, and the third column shows the optimization directions. For example, for storage, we can use different hardware models; for batch strategies, we can use different strategies for prefill and decode, optimizing them separately.

prefill	decode	Optimization direction
After calculating the KV cache and sending it to the decode stage, a strategy can be used to clear the KV cache.	The key-value cache is frequently read during the token generation process, so it is necessary to save the key-value cache as much as possible.	Independent optimization of compute and storage
Because the prefill stage is compute-bound, the throughput growth trend tends to level off as the batch size increases.	Because the decode phase is memory-bound, the throughput increases more significantly with increasing batch size. Increasing the batch size increases computational intensity, thereby increasing throughput.	Batch strategy independent optimization
Under different conditions, there is a preference for parallel processing: when the rate (`req/s`) is small, TP is suitable; when the rate is large, PP is suitable.	As the number of GPUs increases, the PP method can produce higher throughput; the TP method can produce lower latency (processing individual requests faster).	Parallel Strategy Independent Optimization

Therefore, given the model, workload characteristics, latency requirements, and SLO achievement target, DistServe needs to determine:

Parallelism strategy for pre-filling and decoding instances
Number of deployments per instance type
How to place them on a physical cluster

The ultimate goal is to find a configuration strategy that maximizes the throughput of each GPU.

4.2.3 Algorithm

Next, let’s see how DistServe searches for the optimal parallel configuration of Prefill and Decode models.

Given different cluster settings, a key design consideration is managing communication between the pre-filling and decoding phases of the decomposition process. DistServe uses two algorithms: one for clusters with high-speed cross-node networks, and another for environments lacking such infrastructure, which introduces additional constraints. The general idea is to list the possible TP and PP configurations based on the constraints and given existing hardware resources such as the number of GPUs. Then, a modeler is used to calculate the Goodput of each card under each possibility, and the TP and PP configuration that maximizes the Goodput of each worker is selected.

How to place on a cluster with high node affinity

On clusters with high node affinity, typically equipped with InfiniBand, the transmission overhead of the KV Cache between nodes is negligible, allowing DistServe to efficiently deploy pre-filled and decoded instances without constraints. We propose a two-layer placement algorithm for this situation: first, we optimize the parallelism configuration of the pre-filled and decoded instances to achieve maximum per-GPU throughput at the stage level; then, we use replication to match the overall traffic rate.

Algorithm 1 outlines the process. We enumerate all feasible parallel configurations of prefill and decode instances under cluster capacity constraints. For example, for a specific prefill stage configuration, we use simu_prefill to simulate and find their maximum throughput (similar to using simu_decode for decoding). After determining the optimal parallel configuration of prefill and decode instances, we replicate them based on their throughput to achieve the overall throughput rate required by the user.

2625

Placement of low node affinity clusters

The algorithm above assumes that pre-padding and decoding can be placed between any two nodes in the cluster, and that KV cache transfers utilize high bandwidth. However, in many real-world clusters, the GPUs within nodes access data via high-bandwidth NVLINK, and the GPUs’ bandwidth is limited. This paper will next develop an algorithm to address this constraint.

The key to the algorithm is that the transmission of intermediate states only occurs between the corresponding layers of the pre-padding and decoding instances. The algorithm is as follows:

We first enumerate inter-layer parallelism to obtain all possible instance segments. Leveraging inter-layer parallelism, we group layers into stages and divide each instance into segments called instance segments, each segment maintaining a specific inter-layer stage. By placing pre-filled and decoded segments of the same stage on the same node, we force intermediate state transfers to occur only via NVLINK. However, for large models, an 8-GPU node might not be able to handle even a single pre-filled and decoded instance pair. We treat this as an additional placement constraint and optimize it along with model parallelism.
Given the limitation that each node typically has only 8 GPUs, for each segment, we enumerate all possible in-node configurations by calling get_intra_node_configs.
Then, we use the simulator to find the optimal configuration and replicate it to meet the target traffic rate.

2626

4.3 SplitWise

SplitWise and DistServe share similar ideas, as detailed below.

Phase identification: LLM inference requests are divided into two phases: prompt computation and token generation. The different phases are executed on different machines.
Hardware adaptation: Select the most suitable hardware for each stage and optimize resource usage.
Cluster design: Homogeneous and heterogeneous clusters were designed to optimize throughput, cost, and power consumption.
Scheduling strategy: A two-level scheduling system is used, including a cluster-level scheduler (CLS) and a machine-level scheduler (MLS), to optimize request allocation and machine resource management.

Therefore, we will only look at its special features, or rather, only look at the parts that we did not introduce in DistServe, in order to fill in the gaps.

4.3.1 Scheduling Scheme

When the lengths of the prompts arriving at a request differ, an imbalance in pressure arises between prefilling and decoding. Since overall throughput depends on the global resource utilization of P and D, neither cost nor performance is optimal when P is overloaded but D is idle, or vice versa. Therefore, load balancing between P and D needs to be considered. This can be achieved either by directly switching the roles of P and D at the entire node level, or by enabling P and D nodes to handle some mixed requests, such as through chunked prefilling.

First, let’s look at the solution the author provided. The following diagram shows the overall architecture of SplitWise.

First, the architecture diagram shows three resource pools: the Prompt Pool, the Token Pool, and the Mixed Pool. The Prompt Pool performs only the Prefill phase, and the Token Pool performs only the Decode phase, achieving separate deployment of Prefill and Decode. The Mixed Pool executes batches of Prefill and Decode in a merged deployment manner. When request load changes drastically, instances cannot scale quickly enough, or roles change, the Mixed Pool executes the mixed batches. Secondly, the architecture diagram also includes two levels of schedulers: the cluster-level scheduler (labeled 1 in the diagram) and the machine-level scheduler (labeled 2 in the diagram). The cluster-level scheduler (labeled 1 in the diagram) is responsible for routing incoming requests to specific machines and reallocating them. The machine-level scheduler (labeled 2 in the diagram) maintains the queue of requests to be processed and manages the batch processing of requests on each machine.

2627

Cluster-level scheduling

The cluster-level scheduler (labeled 1 in the diagram) is responsible for routing incoming requests to specific machines and reallocating machines. Its main functions are as follows:

Request routing. CLS schedules requests using a “Join the Shortest Queue (JSQ)” routing strategy. When a request arrives, CLS determines which instance should be prefilled and then decoded. Each machine reports any changes to its memory capacity or pending queue to CLS.
Machine Management. CLS primarily maintains two machine pools: an instant machine pool and a token machine pool. Splitwise initially assigns machines to a pool based on the input/output token distribution and expected load (i.e., requests per second). If these values deviate significantly from the initial assumptions, Splitwise performs coarse-grained reallocation of machines, moving them between the instant and token machine pools.
Hybrid Machine Pool. To meet SLOs and avoid performance catastrophic drops under high load, Splitwise maintains a special hybrid pool, which acts as an intermediate buffer. Machines in the hint or token pool can be dynamically moved in and out of the hybrid pool. The hybrid pool dynamically grows and shrinks based on request rate and the distribution of input/output tokens, without noticeable pool switching latency. If CLS attempts to allocate immediate and token machines to requests using JSQ and finds that the queue for the selected machine exceeds a threshold, it will look for a target machine in the hybrid pool. Machines in the hybrid pool are identical to non-Splitwise machines, running in a hybrid batch processing manner. Once the hybrid request queue is finished, CLS transitions the machine back to its original pool. For example, when the queue is too long, we can move an immediate machine into the hybrid pool to run tokens; once the machine has finished running tokens, we transition it back to the immediate pool.

Machine-level scheduling

The machine-level scheduler (labeled 2 in the diagram) runs on each machine and is responsible for tracking GPU memory utilization, maintaining the queue of pending requests, selecting the batch size and batch requests for each iteration, and reporting the relevant status to the CLS.

Prompt machines. MLS uses a First-Come, First-Served (FCFS) scheduling algorithm to schedule requests. Because throughput begins to decline after a certain threshold is reached, MLS limits the batch processing of multiple immediate requests to a total of 2048 tokens. This is a configurable value and may vary for different models or hardware.
Token machines. MLS uses the FCFS scheduling algorithm to schedule tokens and batch processing as much as possible. Token generation throughput increases with batch size until the machine runs out of memory. Therefore, when the machine is close to running out of memory, MLS starts putting tokens into a queue.
Mixed machines. To meet the SLO of TTFT, MLS needs to prioritize immediate requests and immediately schedule any new immediate requests into the processing queue. If a machine is running the decode phase and has no capacity, MLS will preempt tokens. To avoid decode phase starvation due to preemption, we increase the priority of older tokens and limit the number of preemptions that can be performed per token.

4.3.2 KV Cache Transmission

In Splitwise, we need to transfer the KV Cache from the prompt machine to the token machine for inference. The KV Cache is generated during the request prompt phase and grows continuously during the token generation phase. Therefore, the authors propose that the biggest overhead introduced by their split architecture is the cost of migrating the KV Cache from the Prefill phase to the Decode phase. This transfer latency is the main overhead associated with Splitwise. In this section, we will discuss the impact of KV cache transfer and how to optimize it.

2628

There are two ways to transfer active KV Cache from prefill to decode instance: by layer or by request.

The request-based method transmits the KV Cache after the prefill phase is complete. The left side of the diagram above uses an unoptimized, serially running linear KV Cache transmission. The time experienced by a single request batch includes the prompt time, the time waiting for the KV Cache transmission, and the time for token generation. The KV Cache transmission only begins after the immediate phase is complete and the first token is generated. Furthermore, it needs to be completed before the next output token can be generated. This directly impacts the maximum TBT and end-to-end latency of inference. The transmission time depends on the size of the KV Cache (proportional to the number of immediate tokens) and the interconnect bandwidth between the immediate machine and the token machine. Even with fast InfiniBand connections between machines, the overhead of the KV Cache can easily become a significant part of the TBT for large immediate sizes.

The layer-by-layer approach transmits the KV cache after computation is completed at each layer. The right side shows the asynchronous KV cache transmission optimized by SplitWise. Because pre-filling is processed layer by layer, the KV cache for that layer can be sent out after computation at that layer is completed. Therefore, the transmission and dumping of the KV cache can be overlapped with computation. Before attention computation begins at each layer, the model waits for the asynchronous loading of that layer’s KV cache to complete and triggers the asynchronous loading of the next layer’s KV cache. After attention computation is completed, asynchronous storage of that layer’s KV cache is initiated. This transmission overlap allows the execution time of pre-filled instances to be approximately equal to the KV cache loading time or the standard pre-filling time, depending on the relative proportion of the prefix cache. Splitwise found that the layer-by-layer approach is superior to the request-by-request approach because the layer-by-layer approach overlaps computation and communication.

Since the number of tokens in a batch is known at the start of the computation, Splitwise selects the optimal KV Cache transfer technique.

On the one hand, for larger prompts, Splitwise uses layer-by-layer transfer. Layer-by-layer KV cache transfer is performed in parallel with the fast computation of the next layer. This requires fine-grained synchronization at each layer to ensure correctness. Therefore, it may introduce performance interference and increase TTFT.
Therefore, for small prompts, Splitwise uses serialized KV cache transmission. This is because the KV cache for small prompts is small, eliminating the need to pay the overhead of fine-grained layer synchronization required for each layer of transmission.

4.4 MemServe

MemServe manages distributed CPU DRAM and GPU HBM resources across service instances through its MemPool system. It manages active and historical KV caches via a comprehensive distributed memory pool API. MemServe provides a token-based fast indexing layer for historical KV cache retrieval, abstracts away cross-instance data exchange mechanisms for hardware heterogeneity, and includes a global scheduler implementing a localized policy based on hint trees to enhance cache reuse. These policies work together to significantly improve job completion time and first-token performance.

Because technologies such as context caching and distributed inference have optimized and extended the lifespan and domain of KV caches, LLM services have transitioned from stateless to stateful systems, necessitating a new architectural approach. To address this issue, the authors propose MemServe (Memory-enhanced model Servin), which handles optimizations between and within requests within a unified system. While context caching and distributed inference primarily optimize from a pipeline perspective, the MemServe approach manages the KV cache as distributed memory. This provides a unified distributed system perspective, where both the decode and prefill machines interact with the pool, moving beyond a purely pipelined logic.

MemServe has the following features:

To address the challenges of managing key-value caches across distributed instances, MemServe introduces a cross-service elastic memory pool, or MemPool, for managing distributed memory and key-value caches. It serves as the foundation for managing all cluster memory, including CPU DRAM and GPU HBM. In other words, MemServe abstracts the MemPool component and then builds distributed inference as a use case for MemPool.
MemPool provides a rich set of APIs for managing distributed memory and key-value caches. Using these APIs, MemServe combines context caching with distributed inference.
To maximize the reuse of the KV Cache, MemServe employs a global scheduler that uses a novel local awareness strategy based on a global hint tree to enhance cache reuse.

4.4.1 Motivation

Techniques for optimizing KV cache using dependencies in LLM can be divided into two types: cross-request and intra-request.

In-request optimization: This type of optimization leverages dependencies within a single request to enhance performance. Two notable examples are distributed inference (splitting a request into two sub-requests for better scheduling) and sequence parallelism (splitting a request into multiple sub-requests to distribute the load).
Inter-request optimization: This type of optimization leverages dependencies between requests to achieve better performance. Context caching is the only known technique in this category. To build a context cache, the model stores and reuses a key-value cache from a self-attention mechanism to avoid redundant computations between similar or duplicate requests. This is particularly useful when multiple requests share a common prefix or context.

A common theme among these dependency exploitation techniques is that they require novel logic to manage and transport KV caches.

Within the request-based method, the scope of the KV cache needs to be extended from a single instance to distributed instances. This requires an efficient mechanism to transfer the KV cache between instances, as is the case with distributed inference and sequence parallelism.
The inter-request approach needs to extend the lifecycle of the KV cache from a single request to potentially indefinite, i.e., across requests. This relies on two mechanisms: First, an index is needed to find dependencies between requests, thereby identifying the retained KV cache. Second, the inference engine and attention core need to be modified to reuse historical KV caches.

The current solution lacks a mechanism to manage the intermediate KV cache data of the LLM, which presents two key problems.

LLM service systems cannot simultaneously apply any existing inter-request and intra-request dependency utilization optimizations. Current context caching (inter-request) approaches were not designed with intra-request scenarios in mind. Therefore, separate inference (intra-request) cannot benefit from context caching because it lacks a mechanism to leverage the KV cache to return from decoding to a pre-populated instance for future reuse. Similarly, sequence parallelism distributes the KV cache across multiple instances, but lacks the mechanisms and algorithms needed to retain and reuse it. This problem arises because intra-request techniques decompose a tightly coupled request into multiple loosely coupled sub-requests, complicating KV cache management in distributed environments.
The LLM service system lacks a comprehensive, top-down design to effectively leverage existing inter-request techniques. Context caching reuses historical key-value caches by running requests with the same prefix within the same service instance. However, the current LLM service system schedules requests across multiple service instances based on load or session ID, which fails to maximize key-value cache reuse across sessions.

These problems arise because existing LLM service systems are built on the assumption that the KV cache is merely intermediate data for a single request on a single instance. With the emergence of new dependency exploitation technologies, the lifecycle of the KV cache has lengthened, and its management scope has expanded to distributed settings. This paradigm shift necessitates a fundamental rethinking of the LLM service architecture.

4.4.2 Scheme

MemServe is designed as a large-scale LLM service system capable of efficiently handling inter-request and intra-request optimizations. It comprises three main components: a global scheduler, multiple types of inference instances, and a resilient memory pool (MemPool). MemServe describes itself as follows: “We propose Memory-enhanced model Serving, or MemServe, to handle inter-request and intra-request optimizations within a unified system.”

Here are a few key points:

Unified system. This involves several instance types: P-only, D-only, and PD-colocated machine types (PD means prefill-decode). At the same time, during operation, it is divided into (1) PD-colocated (only using the third type of instance), (2) PD-colocated with caching, (3) PD-disaggregated (using the first two types of instance, or using them together), and (4) PD-disaggregated with caching.
Memory-enhanced. A resilient memory pool, MemPool, is introduced under the distributed inference setting to manage distributed memory and KVCache across service instances.

2629

Elastic memory pool

MemPool is the core component of MemServe, providing three types of APIs: memory, indexing, and distributed data transfer. It runs within each inference instance, using a fixed-size memory allocator to manage all local memory.

The memory pool manages all memory in the inference cluster, including CPU DRAM and GPU HBM. The memory pool runs on each inference instance and collectively provides a set of distributed memory pool APIs. It manages the active KV cache used by ongoing requests and the historical KV cache retained after requests are completed.

The diagram below illustrates the use cases enabled by MemPool. Circle 1 represents context caching. Circle 2 represents decomposed inference. Circle 3 represents sequence parallelism. The gray solid line represents MemPool index API calls. The solid line represents the MemPool distributed API. MemPool supports all use cases on a single platform.

2630

index

As the KVCache grows larger, the layout within the memory pool becomes a significant issue. Therefore, MemPool uses internal indexes to map hint tokens to the KV history cache, managing the active KV cache for ongoing requests as well as the historical KV cache retained after requests are completed, ensuring efficient retrieval of cached data.

With an index, memory retrieval relies on it; essentially, it adds an ID to each page in pagedattention, allowing other retrieval mechanisms to optimize the process. The paper presents a global prompt trees structure for this optimization.

Whenever the engine invokes operations such as insert, match, or delete, MemPool traverses the index. There are three indexing methods in the LLM service world: tag-based, session-based, and document ID-based. Tag-based indexing is highly versatile because it works for any shared hint prefix situation. Session and document ID indexing are simpler, but shared hints can only be reused within a chat session or when the same document is used across sessions. This paper adopts a tag-based indexing method for broader applicability. To implement this index, MemPool utilizes the radix tree proposed by SGLang.

Scheduling

MemServe’s global scheduler is responsible for the entire framework and scheduling, and also maintains global prompt trees and locality-aware scheduling policies to optimize memory management and KVCache management. Each node maintains a distributed global tree-structured cache. During scheduling, the requested prompt word is queried globally across all types of trees. The policy module then selects the instance with the longest common prefix (i.e., the largest retained historical KVCache) based on the distributed load, achieving optimal retrieval and access efficiency.

2631

Transport API

MemPool provides a simple data transfer API that abstracts away three heterogeneities: parallelism, network, and memory media. MemServe bridges the gap between context caching (between requests) and distributed inference (within requests) in four steps using the MemPool API: (a) first, distributed inference is replicated using the distributed API; (b) then, a cache is added to pre-populated instances using the indexing API; (c) the same cache is applied to decode-only instances; and (d) finally, decode-to-pre-populated data transfer is enabled. The diagram below illustrates how the MemPool API enhances decomposed inference through context caching. The engine box refers to a tuned inference engine, such as vLLM. The circled numbers represent the steps taken to build the solution. A-KV is the active KV cache. H-KV is the historical KV cache.

2632

Next, we’ll look at how to gradually build a complete design using four different levels of caching design mechanisms.

PD-Basic. This is a naive/vanilla mechanism, essentially the basic de-aggregation inference architecture proposed by DistServe and Splitwise. It serves as an optimized baseline.
PD-Caching-1 enables caching only on P-only instances. Active KV caches are retired to historical KV caches so that future inference can utilize the saved data to reduce recomputation (step 2 in the diagram above). This caching design only retains historical KV caches generated during the prefilling phase, not those from the decoding phase, making it suitable for workloads sharing long common prefix hints, such as system prefixes. The main drawback of this design is that in multi-turn dialogue scenarios (e.g., document QA), the prefilling instance needs to repeatedly forward the same set of active KV caches to the decoding instance, wasting bandwidth and impacting the timing of the second tag. Therefore, we propose the next design to address this issue.
PD-Caching-2 caches data on D-only decoder nodes. This design enables caching within the decoding instance to reduce redundant data movement. We made two key changes to PD-Caching-1. First, the prefilling instance now calls transfer_with_insert instead of transfer, allowing the decoding instance to insert the transferred KV cache generated during the prefilling phase into its local index. Second, after a request is completed, the decoding instance calls insert to retain the KV cache generated during the decoding phase in its local index. Thanks to locally aware scheduling, the prefilling instance now only needs to incrementally transfer new KV cache data. While this design reduces data movement from the prefilling instance to the decoding instance, it does not improve the context caching of the prefilling instance because it lacks historical KV cache from the decoding phase. Therefore, in multi-turn dialogue scenarios, the benefits of context caching remain unchanged as cues increase. Therefore, we propose the next design to address this issue.
PD-Caching-3 allows P-only and D-only machines to transfer and maintain indexes on both sides. This design implements a comprehensive context cache for the de-aggregated inference architecture. We made a change to PD-Caching-2: after a request is completed, the decoding instance calls transfer_with_insert to transfer the KV cache generated during the decoding phase to the pre-fill instance (step 5 in the diagram above). Therefore, as the historical KV cache retained by the pre-fill instance grows, the benefits of context caching increase linearly with the number of rounds. We can actually understand the Caching Pool here as an asynchronous sharing mechanism that does not completely synchronize all data; various types of nodes need to maintain their own caches and also push and pull some remote data on demand.

Transmission operation

There are two ways to transfer the active KV Cache from prefill to the decoding instance: by layer or by request. The by-layer method transfers the KV Cache after computation is completed at one layer. The by-request method transfers the KV Cache after the prefill phase is complete. Splitwise found that the by-layer method is superior to the by-request method because it overlaps computation and communication, thus speeding up the time to the second token (or TTST).

The MemServe authors also observed the same phenomenon when the load was low. However, as the load increased, both incurred very high overhead, the root causes of which were (1) a discrete memory layout and (2) a lack of network primitives.

Paging-based dynamic memory management, such as PagedAttention, is the de facto standard in current LLM service systems. Regardless of whether the paging mechanism is implemented in the engine or the driver, the KV cache is partitioned and stored in fixed-size memory blocks. The block size is configurable, typically in units of the number of tokens. Existing engines manage the KV cache in a fine-grained manner. For example, vLLM allocates two blocks per LLM layer. For an LLM with L layers and 8 tokens per block, the engine needs 2 * L blocks to store the KV cache for 8 tokens. While paging improves utilization, the discrete memory layout presents significant challenges when implementing distributed inference using existing AI network stacks.

The network stack technologies in AI are mostly collection communication libraries such as NCCL. These libraries are best suited for typical AI workloads using tensor or pipelined parallel processing, but they perform poorly in in-request optimizations that support LLM services, such as distributed inference or sequence parallelism. These new patterns of LLM services require efficient point-to-point, collection, and distribution primitives between HBM or DRAM. However, because KV caches are discrete, the number of network API calls equals the number of discrete memory blocks, regardless of whether a layer-by-layer or per-request approach is used. This is the fundamental reason why both incur overhead as the load increases.

To address the challenges posed by pagination and the lack of network primitives, we propose reducing fragmentation by aggregating smaller key-value blocks into larger blocks, similar to using large pages. Specifically, we aggregate two blocks per level into one block: the new block size is equal to 2 * L smaller blocks. This effectively reduces the number of network API calls by a factor of 2 * L. This optimization only applies to per-request methods, as per-level methods inevitably require at least [number] network API calls.

2633

The figure above compares the cross-memory layout and transport timelines for layer-based, request-based, and request-based aggregation (the optimization proposed in the paper). The paper’s tests show that under low load, layer-based achieves the lowest JCT (Job Completion Time), but under high load, bylayer-aggregation outperforms layer-based under reduced network calls.

4.5 TetriInfer

The paper “Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads” proposes TetriInfer. The idea behind TetriInfer is to stack space like Tetris blocks, avoiding leaving gaps randomly.

4.5.1 Motivation

A simple solution to avoid interference is to statically allocate resources for each downstream task. However, given the high cost of LLM service infrastructure, this solution is impractical. Therefore, TetriInfer built a distributed LLM inference service system that addresses these issues by carefully scheduling and grouping tasks based on request characteristics. The authors of TetriInfer’s approach are as follows.

TetriInfer separates prefill and decode operations onto different hardware. These are virtual concepts with dedicated prefill and decode instances, each capable of independent scaling and role reversal as load changes. TetriInfer schedules prefill requests only to the prefill instance, and the same applies to decode requests. The prefill instance then transfers its prefilled KV cache to the decode instance.
Limiting the prefill size means finding a batch size that is exactly at the critical point between compute-bound and memory-bound.
- To avoid interference from the pre-filling phase, the number of tokens processed in a single pre-filling iteration should be limited to fully utilize the hardware without incurring additional penalties. Therefore, TetriInfer segments and fills the input hints into fixed-size blocks, allowing pre-filling to run within a fixed-size computational unit so that the accelerator always approaches its computational saturation limit.
- Sarathi also proposed chunked-prefills for the same purpose, but Sarathi mixed prefilling and decoding together. In contrast, TetriInferj only involves running prefilled blocks because TetriInferj has decomposed the prefilling and decoding of LLM into separate instances.
Decode phase tasks are scheduled using output-based length prediction to avoid scheduling hotspots. TetriInfer utilizes an LLM-based length prediction model to infer the number of tokens generated for decoding requests and then schedules them accordingly.
- The length predictor runs on each pre-padded instance, allowing it to make informed decisions based on the load of the decoding instances, ensuring sufficient resources are available for certain decoding requests. The length predictor uses a small LLM model for prediction and continues to use fixed-size batches instead of chunked pre-padded. This is because the model’s small size means it doesn’t have the obvious computational saturation threshold of larger models.
- Use an intelligent two-level scheduling algorithm, combined with predicted resource usage, to avoid decoding scheduling hotspots.

4.5.2 Implementation

2634

TetriInfer is divided into four main modules: centralized control plane, pre-filled instances, decoding instances, and length prediction model.

A centralized control plane. It consists of a global scheduler and a cluster monitor. The global scheduler sends requests to pre-population instances based on load and receives streaming output from decoding instances. The cluster monitor collects statistics from pre-population and decoding instances and periodically broadcasts load information to pre-population instances. It adds, removes, and flips pre-population or decoding instances.
Pre-filled instances. They only run the pre-filling phase of LLM inference requests. Each pre-filled instance has a local scheduler, a length predictor, a main LLM engine, and a scheduler.
- To avoid interference between pre-fill requests, we use a pre-fill scheduler and chunked pre-fill to sort and partition all prompts into fixed-size chunks.
- The scheduler for pre-populated instances is crucial for improving latency and throughput during the pre-population phase. The scheduler maintains a raw request queue to store requests from the global scheduler, and a scheduled queue to store ordered requests. In this work, we designed and implemented three scheduling strategies: First-Come, First-Served (FCFS), Shortest Job First (SJF), and Longest Job First (LJF). We can use the latter two strategies because we can accurately estimate the pre-population time of requests based on the number of flags in the hints.
Decoding instances. These are virtually separate from pre-filled instances and only run the decoding phase of LLM inference requests. Each decoding instance can receive requests from any pre-filled instance. It runs a local scheduler with three predefined policies for selecting decoding requests to run in the main LLM engine.
Length Prediction Model. The prediction model is a small, offline-fine-tuned LLM model used to predict the generated length of LLM inference requests. TetriInfer’s pre-filled scheduler and the local scheduler for decoding instances utilize speculative information to schedule decoding instances, avoiding the hotspot problem measured in §2.2.3.

4.6 Mooncake

Mooncake’s core concept is to trade storage resources for computational efficiency. It adopts a split architecture centered on KVCache, separating the prefetch and decode clusters (decomposing different parts of the LLM service into dedicated resource pools) and using RDMA to build a cross-node shared cache. It also utilizes the CPU, DRAM, and SSD resources of the GPU cluster to implement tiered caching of KVCache.

4.6.1 Insight

The core objectives and key insights of this paper are as follows:

Because the need for long contexts will always exist, the capacity of KVCache will remain high for a long time.
Reuse KVCache as much as possible to reduce the required computing resources and the time spent transferring data between nodes.
Maximize the number of tokens in each batch to improve model FLOPs utilization.
Remote calls to KVCache will extend the TTFT, and a large batch size will result in an even larger TBT.
KVCache stored on lower-level storage can cause more problems.

4.6.2 Implementation

Decoupled architecture

As shown in the diagram below, Mooncake employs a split architecture. This split has two meanings: not only does it separate the pre-filling and decoding nodes, but it also groups the CPU, DRAM, SSD, and RDMA resources of the GPU cluster to achieve a separate KVCache. This split cache utilizes underutilized resources, providing ample cache capacity and transmission bandwidth.

The computational resources for Prefill and Decode are separated. The optimization goal of the Prefill stage is to leverage the opportunity of common prefixes among requests to reuse the KVCache as much as possible, while satisfying TTFT, maximizing MFU, and ensuring that the KVCache is less than the CPU memory limit. The optimization goal of Decode is to maximize throughput, satisfy TBT, and ensure that the KVCache is less than the GPU memory limit.

Mooncake separates the KVCache from computation. It groups the CPU, DRAM, SSD, and RDMA resources of the GPU cluster into a Distributed KVCache Pool, improving hit rates and reducing redundant computations through shared caching. The KVCache is also managed in a paged manner, with blocks partitioned.

Therefore, Mooncake breaks down and reorganizes the resources of a single homogeneous GPU cluster into three resource pools that can be independently and elastically scaled.

The prefill pool processes user input and is primarily responsible for the TTFT. Because prefilling is relatively computationally intensive, this part also plays a role in improving overall resource utilization.
The decoding pool. After prefill processing, the corresponding KVCache is sent to the Decode Pool for autoregressive streaming output. Although we aim to aggregate as many tokens as possible in a batch to improve MFU, this part is primarily responsible for TBT.
KV Cache Pool: A global Prefix Cache is created on each HGX machine. This global Prefix Cache, through global scheduling, significantly improves reuse rates, thereby increasing overall throughput. This raises a series of questions regarding how to schedule, allocate, and replicate the KV Cache, and it must be designed in conjunction with Prefill/Decode scheduling. Therefore, we call Mooncake’s architecture a KV Cache-centric architecture.

2635

Next, let’s look at several design ideas for pools.

prefill pool

To design a separate pre-filled node pool to seamlessly handle dynamic allocation of context lengths, Mooncake employs chunked pipeline parallelism (CPP) to extend the processing of a single request across multiple nodes. Compared to solutions based on traditional sequence parallelism, CPP reduces network overhead and simplifies reliance on frequent elastic scaling. Furthermore, with further layer-wise prefilling, KVCache streaming can overlap latency. This is essential for reducing the TTFT of long context inputs.

Layer-wise Prefill

Prefilling is accomplished in blocks/layers, i.e., a block/layer-wise design. This can be understood as a refinement of the page in pagedattention, and also as using a pool with a unified perspective to uniformly maintain the storage of KVCache.

The main goal of the pre-filling phase is to reuse the KVCache as much as possible to avoid redundant computations. Therefore, the most frequently used KVCache blocks should be replicated to multiple nodes to avoid congestion, while less frequently used blocks should be swapped out to reduce reservation costs.

Because preloading is performed layer by layer and is computationally constrained, the transport and dumping of KVCache can be overlapped with computation, further reducing its overhead and effectively lowering the latency of long context requests. In Mooncake, KVCache loading and storage are performed asynchronously through start and wait operations. Before attention computation begins at each layer, the model waits for the asynchronous loading of that layer’s KVCache to complete and triggers the asynchronous loading of the next layer’s KVCache. After attention computation is complete, asynchronous storage of that layer’s KVCache is initiated. Once computation at all layers is complete, the process waits for all asynchronous storage operations to complete. Transport overlap makes the execution time of preloaded instances roughly equivalent to the KVCache loading time or the standard preload time, depending on the prefix cache ratio relative to the input length.

Multi-node Prefill

Because of the rich parallelism inherent in long context pre-filling, many parallel strategies have been proposed.

Data parallelism: The batch size of long sequence prefill is 1, so data parallelism is not possible.
Tensor parallelism: While it is desirable to process long context requests in parallel using multiple GPUs, scaling tensor parallelism beyond a single node requires two expensive, RDMA-based global reduction operations per layer, which significantly reduces the MFU (Mean Functionality Fully Required) of pre-filled nodes. Therefore, tensor parallelism is difficult to scale beyond a single node.
Sequence Parallelism (SP). Sequence parallelism partitions the requested input sequence across different nodes to achieve speedup. These SP methods leverage the associative properties of attention operators, requiring at least one cross-node communication per layer in the implementation of Ring Attention or Striped Attention. This significantly reduces network overhead and improves MFU (Mean Functionality). However, using SP still results in worse MFU than using a single-node TP (Translation Principle). Moreover, SP inference requires copying model parameters on each card, which is a heavy burden for large models. Furthermore, the communication between each SP layer consumes valuable network bandwidth that should ideally be reserved for KV cache transmission.

Ideally, a deployment should organize pre-populated nodes into two groups: one group using only TP (Transaction Processing) and the other using SP (Service Packet). Requests are only assigned to the SP group when a TTFT SLO (Time To See Requests and Requests for Transaction Optimization) needs to be met. However, this further decomposition leads to problems when dynamically adjusting the number of nodes in each group, as a static parallel setup can result in reduced utilization of the entire cluster.

To address this issue, Mooncake leverages the autoregressive properties of the decoder-only model and implements block-based pipelined parallelism (CPP) for long context pre-filling. Mooncake groups every X nodes in the pre-filling cluster into a pipelined pre-filling node group. For each request, its input tokens are divided into blocks, each block not exceeding a pre-filled block. Different blocks of the same request can be processed simultaneously by different nodes, thus achieving parallelization of processing and reducing TTFT. CPP offers two main benefits: 1) Similar to pipelined parallelism in training, it only requires cross-node communication at the boundaries of each pipeline stage, which can easily overlap with computation. This leads to better MFU and less network resource contention with KVCache transfers. 2) It is naturally applicable to both short and long contexts, with no significant overhead for short context pre-filling, and avoids frequent dynamic adjustments to node partitions.

This pipeline-based acceleration method has been explored in training systems, but Mooncake is applying it for the first time in the inference phase.

4.6.3 Decode Pool

Decode pools are the most common practice in the industry, so we will skip that here.

4.6.4 KVCache pool

The distributed KVCache design supports shared caching across multiple conversations, avoiding redundant storage and computation; it also provides tiered storage management: high-frequency KVCache resides in DRAM, while low-frequency data is moved to SSD.

Direct PD (Programmer, Controller, and Decoder) involves the pre-filling node directly sending the KV cache to the decoding node. Its advantage is low latency, but it couples the P and D nodes together. If the decoding node fails, rescheduling cannot only schedule the D node; the entire pre-filling and decoding process must be re-executed, resulting in high costs. Using a KV Cache Store/Pool adds an intermediate storage between the P and D nodes. The pre-filling node writes the KV cache to the intermediate storage, and the decoding node reads from it. This involves an extra data transmission, increasing latency and implementation complexity. However, it decouples the P and D nodes, improving fault tolerance. This intermediate storage can also be used for prefix caching during the pre-filling phase.

Depending on the request pattern, the KVCache pool can be sized using cache eviction algorithms such as LRU (Least Recently Used), LFU (Least Frequently Used), or cache eviction algorithms based on request characteristics. The transfer of these KVCache blocks between the CPU and GPU is handled by a separate (GPUDirect) RDMA-based component, Messenger. This architecture also allows us to provide a context caching API to external users for greater reuse of the KVCache.

The diagram below illustrates the KVCache pool in CPU memory, along with the storage and transfer logic for KVCache blocks. Each block is assigned a hash value, determined by its own hash value and a prefix used for deduplication.

2636

4.6.5 Workflow

To schedule all these distributed components, Mooncake implements a global scheduler called Conductor. Conductor is responsible for dispatching requests based on the current KVCache distribution and workload. It also copies or swaps certain KVCache blocks if it is beneficial for future reasoning.

Mooncake aims to fully utilize the underutilized CPU, DRAM, and SSD resources within its GPU cluster. Simply put, it tries to maximize the use of GPU, CPU, and memory resources. Therefore, it needs to determine the current inference bottleneck. If it’s a computational bottleneck, but all GPUs are fully utilized, can the computation be moved to the CPU? If it’s a storage bottleneck, but the GPU memory is full, can the computation be moved to system memory? If it’s a transmission bottleneck, can quantization or other methods be used?

For each new request, the Conductor estimates the corresponding execution time based on the request length and prefix_len (which varies by instance). The estimated wait times for that request are then summed to obtain the TTFT on that instance. The Conductor assigns the request to the instance with the shortest TTFT and updates the cache and queuing times for that instance accordingly. To predict the computation time of the request prefilling phase, Mooncake uses a prediction model derived from offline test data. This model estimates the prefilling computation time based on the request length and prefix cache hit length. The queuing time for requests is calculated by aggregating the prefilling times of all queued requests.

Due to high instance load, requests may not always be directed to the prefilled instance with the longest prefix cache length. In this case, if the estimated additional prefilling time is less than the transmission time, the Conductor will forward the cache location and request to an alternative instance. If the optimal remote prefix match length is no greater than the current locally reusable prefix multiplied by a threshold, we prefer to directly compute the input token.

2637

The diagram above illustrates a typical request workflow. When a request arrives, the Conductor selects a pair of preprocessing nodes and a decoding node, and initiates a four-step workflow:

KVCache reuse. This is a KVCache-centric scheduling mechanism. The selected prefilled nodes (groups) receive a request consisting of three parts: the original input, the reusable prefix cache block ID, and the full cache block ID assigned to the request. The Conductor loads the prefix cache from remote CPU memory to GPU memory based on the prefix cache block ID to initiate the request. If the prefix cache does not exist, this step is skipped. This selection balances three objectives: reusing KVCache as much as possible, balancing the workload of different prefilled nodes, and guaranteeing TTFT SLO.
Incremental prefilling. Prefilling nodes (groups) complete the prefilling phase using a prefix cache and store the newly generated incremental KVCache back to CPU memory. If the number of uncached input tokens exceeds a certain threshold (prefill_chunk), the prefilling phase is divided into multiple chunks and executed in a pipelined manner. This threshold is set to a value that can fully utilize the computing power of the corresponding GPU, typically greater than 1000 tokens.
KVCache Transfer. A Messenger service is deployed on each node to manage and transfer these caches. Each Messenger runs as an independent process within its corresponding inference instance. The Messenger streams the KVCache generated for each model layer to the CPU memory of the target decoding node. This KVCache transfer is performed asynchronously and overlaps with the incremental pre-filling step described above, thus reducing latency.
Decoding. Once the decoder node’s CPU DRAM has received all the KVCache data, Mooncake adds the next batch of requests in a continuous batch process. The Conductor pre-selects decoder nodes based on their current load to ensure that the TBT SLO is not violated. However, the local scheduler checks this SLO again because the expected load may have changed after the prefill phase. This double check may result in request rejection, in which case the corresponding prefill cost is wasted.

4.6.6 Scheduling

Mooncake implements a request scheduling algorithm based on KVCache that balances instance load and user experience.

Algorithm 1 below details the cache-aware prefill scheduling mechanism. For each new request, its input token is divided into several blocks, and a hash key is calculated for each block. This involves generating a token hash key within a block that is concatenated with the hash key of the previous block (if available). The request’s block key is then compared one by one with the cache key of each prefill instance to identify the prefix match length (prefix_len).

Why can global scheduling be performed? This is because for each X-byte KVCache, the computing power required to regenerate it is proportional to X * hd multiplied by a large constant, where hd is the hidden dimension of the model. Therefore, as long as the ratio of computing power per card to communication bandwidth per card is less than hd multiplied by this constant, transmitting the KVCache from a remote location not only reduces the amount of computation but also reduces the TTFT compared to recompiling in-place.

Additionally, an early rejection strategy is discussed below. Mooncake estimates the load in both phases and directly rejects or delays processing requests that might exceed the capacity before processing begins, thus optimizing server overload issues. The early rejection strategy reduces wasted computational resources in overload scenarios. If no decoding slots are available after the pre-filling phase, we need to reject some requests early to save wasted computational resources.

2638

0xFF Reference

Orca: A Distributed Serving System for Transformer-Based Generative Models
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
Hand-caught cake bear: MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
Hand-pulled Pancake Bear: CachedAttention (formerly AttentionStore)
Hand-pulled Pancake Bear: Attention Optimization for Large Model Prefix Scenes (Part 3)
Let’s talk about optimization issues in large model inference services. (by Daodao Ning)
Breakthrough in Large-Scale Inference: Exploration and Practice of Distributed Inference Technology (GeekBang Technology InfoQ)
Five Tiger Generals of the Large Model Inference Separation Architecture Zhao Shangchun
Some Thoughts on Deepseek’s Use of the EP Inference Method (Yang Pengcheng)
Understanding the computational characteristics of prefill and decode based on chunked prefill: Chayenne Zhao
https://zhuanlan.zhihu.com/p/718715866
The Architectural Issues Behind the Separation of LLM and PD - Geek Boge
https://zhuanlan.zhihu.com/p/27836625742
Splitwise: Efficient generative llm inference using phase splitting
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
https://hao-ai-lab.github.io/blogs/distserve/
akaihaoshuai: LLM Reasoning Acceleration Research
Inference without interference: Disaggregate llm inference for mixed downstream workloads
Mooncake Reading Notes: In-depth Study of Cache-Centric Scheduling Concepts, Writing a New Chapter in Cost Reduction and Efficiency Improvement for LLM Services (by Fang Jiarui)
Mooncake: Design and Analysis of LLM Service Architecture Based on KVCache (by Chang Hua and Andy)
What are the latest updates to vLLM? OLDPAN
Breakthrough in Large-Model Inference: Exploration and Practice of Distributed Inference Technology (QCon [InfoQ])
Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation (Yunkai Liang, Zhangyu Chen, Pengfei Zuo, Zhi Zhou, Xu Chen, Zhou Yu)