Memserve: Context Caching For Disaggregated LLM Serving With Elastic Memory Pool
Memserve: Context Caching For Disaggregated LLM Serving With Elastic Memory Pool
stateless to stateful systems, utilizing techniques like context the KV cache across multiple inference instances, extending
caching and disaggregated inference. These optimizations ex- its domain from a single instance to distributed instances.
tend the lifespan and domain of the KV cache, necessitating a However, deploying a stateful LLM serving system with these
new architectural approach. We present MemServe, a unified optimizations is challenging due to conflicting or missing
system that integrates both inter-request and intra-request mechanisms for managing the LLM’s intermediate KV cache
optimizations. MemServe introduces MemPool, an elastic data. We have identified two key problems.
memory pool managing distributed memory and KV caches The first problem is that LLM serving systems cannot
across serving instances. Using MemPool APIs, MemServe simultaneously apply any existing inter-request and intra-
combines context caching with disaggregated inference for request dependency-exploiting optimizations. Current con-
the first time, supported by a global scheduler that enhances text caching (inter-request) methods are designed without
cache reuse through a global prompt tree-based locality- considering intra-request scenarios. As a result, disaggre-
aware policy. Tests show that MemServe significantly im- gated inference (intra-request) cannot benefit from context
proves job completion time and time-to-first-token. caching because it lacks the mechanisms to utilize the KV
cache from decode back to prefill instances for future reuse.
Similarly, sequence parallelism distributes the KV cache
1 Introduction across multiple instances and lacks the mechanisms and
Large language models (LLMs) and their underlying trans- algorithms needed to preserve and reuse it. This issue arises
former architecture have revolutionized AI, becoming foun- because intra-request techniques break a tightly coupled
dational to many emerging applications and a crucial work- request into multiple loosely coupled sub-requests, compli-
load in data centers. While high-quality models are essential, cating KV cache management in a distributed setting.
it is equally important to serve these models on a massive The second problem is that LLM serving systems lack a
scale at a reasonably low cost. As a result, numerous ap- holistic, top-down design to effectively utilize existing inter-
proaches have been proposed to enhance the cost-efficiency request techniques. Context caching benefits from reusing
of LLM serving, such as context caching [37, 40], disaggre- historical KV cache by running requests that share a com-
gated inference [23, 41], and sequence parallelism [17]. mon prefix in the same serving instance. However, current
As a result, LLM serving has evolved from a stateless to a LLM serving systems schedule requests across multiple serv-
stateful system, leveraging dependencies inherent in infer- ing instances based on load or session IDs, which fails to
ence requests. These dependency-exploiting techniques can maximize KV cache reuse across sessions.
be classified into two types: inter-request and intra-request. These issues arise because existing LLM serving systems
Inter-request techniques exploit dependencies across requests. are built on the assumption that the KV cache is merely inter-
The notable one is context caching [40], which reuses the KV mediate data scoped to a single request on a single instance.
cache for requests that share the same prompt prefix, thereby With emerging dependency-exploiting techniques, the lifes-
speeding up the prefill phase. Intra-request techniques, on pan of the KV cache has been extended, and its management
the other hand, exploit dependencies within a single request. has expanded to a distributed setup. This paradigm shift calls
Two prominent examples are disaggregated inference, which for a fundamental rethinking of LLM serving architectures.
splits a request into two sub-requests for better schedul- In this work, we propose Memory-enhanced model Serv-
ing [23], and sequence parallelism, which divides a request ing, or MemServe, to handle inter-request and intra-request
into multiple sub-requests to distribute load [17]. optimizations within a unified system. To tackle the chal-
A common theme in these dependency-exploiting tech- lenges of managing the KV cache across distributed instances,
niques is that they require novel logic to manage and transfer MemServe introduces an elastic memory pool, or MemPool,
the KV cache, which is the intermediate data produced dur- which is a substrate for managing all cluster memory, in-
ing LLM inference. Inter-request methods preserve the KV cluding CPU DRAM and GPU HBM. MemPool offers a rich
1
set of APIs for managing distributed memory and KV cache. and relatively short generation lengths, disaggregated infer-
Utilizing these APIs, MemServe implements context caching ence boosts JCT by up to 10.8% compared to PD-colocated
over standard prefill-decode-colocated (PD-colocated) in- setups. Additionally, context caching offers further enhance-
stances [40] and disaggregated inference [12, 41]. Moreover, ments, potentially improving JCT by 26.9%.
MemServe enhances disaggregated inference with context In summary, we make the following contributions:
caching, reaping both benefits. Finally, to maximize KV cache
reuse, MemServe employs a global scheduler that incorpo- • We propose MemPool, a memory pool designed for
rates a locality-aware policy using novel global prompt trees LLM serving with a rich set of APIs.
for best-effort routing. • We build the first disaggregated inference with context
The MemPool is a core component of MemServe, provid- caching in MemServe based on MemPool APIs.
ing three types of APIs: memory, indexing, and distributed • We propose a novel prompt trees-based locality-aware
data transfer. It runs within each inference instance, man- policy for scheduling LLM requests.
aging all local memory with a fixed-size memory allocator.
The indexing APIs are crucial for building context caching. 2 Background
MemPool uses an internal index to map prompt tokens to the Generative LLM Inference LLM inference involves gen-
KV cache, managing both the active KV cache for ongoing erating a sequence of output tokens in response to an input
requests and the historical KV cache retained after requests prompt. This process consists of two distinct phases: prefill
are completed. The MemPool offers a simple data transfer and decode. During the prefill phase, the model processes the
API that abstracts three heterogeneities: parallelism, net- prompt to generate the key-value (KV) cache. The KV cache
work, and memory medium. As a unified platform, MemPool comprises key-value pairs generated in the self-attention
supports all known inter-request and intra-request optimiza- mechanism. In the decode phase, the model uses the KV
tions as well as any combinations (see Figure 3). cache to generate tokens iteratively. The size of the KV cache
MemServe bridges the gap between context caching (inter- grows linearly with increasing number of generated tokens.
request) and disaggregated inference (intra-request) in four Inter-Request Optimization This type of optimization
steps using MemPool APIs: (a) we first use a distributed exploits dependencies among requests for better performance.
API to reproduce disaggregated inference, (b) we then add Context caching is the only known technique in this cate-
caching to prefill-only instances using index APIs, (c) we gory. To build context caching, the model stores and reuses
apply the same caching to decode-only instances, (d) finally the KV cache from the self-attention mechanism to avoid
we enable decode-to-prefill data transfer, as illustrated in redundant computations across similar or repeated requests.
Figure 4. However, it is challenging to hit two birds with one This is useful in scenarios where multiple requests share
stone. We observed increasing overheads due to naive dis- common prefixes or contexts. Two mechanisms are essen-
crete memory layouts and point-to-point network primitives tial. First, an index is required to find dependencies among
from existing AI network stacks. To address this, MemServe requests and consequently find the preserved KV cache (see
proposes co-optimizing memory layout and network transfer Table 2). Second, a modified inference engine and attention
using huge pages. kernel to reuse the historical KV cache (see SGLang [40]).
We implement MemPool and global scheduler from scratch, Intra-Request Optimization This type of optimization
5.6K SLOC in Python and 1.6K SLOC in C++ We modify exploits dependencies within a request to enhance perfor-
vLLM [14] to build context caching with disaggregated in- mance. Two notable examples are disaggregated inference [12,
ference, 200 SLOC in Python and 400 SLOC in CUDA C++ 23, 27, 41] and sequence parallelism [17]. Generally, disag-
We use NCCL send and recv pairs for data transmission gregating prefill from decode reduces interference between
between GPUs and socket if any side is DRAM. these two stages and allows each to scale independently with
We run all tests atop a single server with eight H800-80G heterogeneous hardware. However, this breaks a single re-
GPUs. We evaluate MemServe across four settings: (1) PD- quest into two sub-requests and requires rigorous KV cache
colocated, (2) PD-colocated with caching, (3) PD-disaggregated, transmission from prefill to decode. The same goes for se-
and (4) PD-disaggregated with caching. The first setting runs quence parallelism, in which distributed instances need to
a vanilla vLLM. The last three settings are MemServe run- exchange the outputs of self-attention in a rigorous man-
ning adapted vLLM using MemPool APIs. While running ner. Overall, intra-request optimization demands efficient
ShareGPT workload [25], the PD-disaggregated with caching mechanisms for transferring the KV cache among instances.
setting outperforms others. Specifically, MemPool-based dis-
aggregated inference improves JCT by up to 42% compared
to PD-colocated. Enhancing disaggregated inference with 3 MemServe Overview
context caching can further improve JCT by 29%! When exe- MemServe is designed as a large-scale LLM serving system
cuting the LooGLE dataset, which features extended prompts that efficiently handles inter-request and intra-request op-
timizations. It comprises three main components: a global
2
Inference Request Inference Response
Prefill-Only
Prefill-OnlyInstance
Instance Decode-Only
Decode-OnlyInstance
Instance PD-Colocated
PD-ColocatedInstance
Instance
CPU CPU CPU
CPU CPU CPU
DRAM DRAM DRAM
DRAM DRAM DRAM
Infer Mem H-KV Infer Mem H-KV Infer Mem H-KV
Infer Mem H-KV Infer Mem H-KV Infer Mem H-KV
Engine Pool Engine Pool Engine Pool
Engine Pool A-KV Engine Pool A-KV Engine Pool A-KV
A-KV A-KV A-KV
HBM
HBM Transfer HBM
HBM Transfer HBM
HBM
HBM
HBM HBM
HBM HBM
HBM
GPU Historial
Historial Active
Active KV GPU Historial
Historial Active
Active KV GPU Historial
Historial Active
Active
GPU
GPU Historial
Historial
Historial Active
Active
Active GPU
GPU Historial
Historial
Historial Active
Active
Active GPU
GPU Historial
Historial
Historial Active
Active
Active
GPU KV
Historical
KV KV
Active
KV GPU KV
Historical
KV KV
Active
KV GPU KV
Historical
KV KV
Active
KV
KV
KV
KV KV
KV
KV KV
KV
KV KV
KV
KV KV
KV KV
KV
KV KV KV KV KV KV
Interconnect (NVLINK/PCIe/RoCE/HCCS/UB)
Figure 1. MemServe Architecture. It supports three types of inference instances: prefill-only, decode-only, and PD-colocated.
Each inference engine runs over one or multiple AI servers, depending on the parallelism configuration.
scheduler, multiple types of inference instances, and an elas- Table 1. Elastic Memory Pool APIs. Type can be HBM-only,
tic memory pool (MemPool), as shown in Figure 1. The Mem- DRAM-only, or mixed. Each address encodes instance ID.
Pool offers a set of APIs for memory allocation, index man- Transfer flags can control on-demand allocation.
agement, and distributed transfer (§4). MemServe builds con-
text caching atop both regular and disaggregated inference API Parameters Description
architectures using MemPool APIs (§5). The global scheduler alloc_mem size, type, id alloc a certain type of memory on a
forwards inference requests from users to the right infer- given instance (@id), return addrList
free_mem addrList free memory
ence instance. It uses locality-aware policies based on novel
insert tokenList, ad- insert prompt token and KV cache
distributed prompt trees, maximizing KV cache reuse (§6). drList, flags address mapping into local index
match tokenList find prompt’s cached data if any, re-
4 Elastic Memory Pool turn addrList
delete tokenList delete prompt’s cached data if any
The MemPool manages all memory in the inference clus-
swap_out num_blocks swap a given number of blocks from
ter, including CPU DRAM and GPU HBM. MemPool runs HBM to DRAM
within each inference instance, collectively offering a set of swap_in addrList swap blocks with given address from
distributed memory pool APIs (§4.1). It manages both the DRAM to HBM
active KV cache used by ongoing requests and the historical transfer id, srcAddrList, transfer data to the specified in-
KV cache retained after requests are completed. An index- dstAddrList, stance (@id), dstAddrList is optional,
flags, private flags control behaviors at the desti-
ing layer maps prompt tokens to the historical KV cache
nation, and private carries user data
(§4.2), ensuring efficient retrieval of cached data. The Mem- transfer_ id, tokenList, transfer tokenList and its cached
Pool has efficient mechanisms for exchanging data between with_insert srcAddrList, data to the specified instance. The
instances, alleviating inference engines from dealing with dstAddrList, receiver will call an extra insert.
heterogeneous hardware (§4.3). Overall, this design makes flags, private
MemPool a versatile and generic platform capable of sup-
porting both intra-request and inter-request optimizations
within a unified system (§4.5). The engine can also call the index APIs for context caching
solutions. For example, once requests are finished, the en-
4.1 API gine can call insert to transition the active KV cache into
We show MemPool APIs in Table 1, broadly divided into three the historical KV cache and create a mapping from prompt
categories: memory block, index, and distributed transfer. tokens to the KV cache. The engine can invoke distributed
The inference engine can use memory block APIs to allocate APIs, such as transfer, to exchange the KV cache across in-
fixed-size memory blocks for storing KV cache or other data. stances when building inter-request optimizations. Overall,
3
Table 2. Compare Indexing Methods. MemServe’s MemPool Sender Receiver Sender Receiver
(TP=1) (TP=2)
uses prompt tokens for its generality.
alloc_mem scatter
Indexing Description DstAddrList / TP
transfer-with-insert
Token ID Use prompt tokens. Universally applicable.
transfer
Transmit
Session ID Use client-server session ID. Limited scope. (b)
Document ID Use document file ID. Limited scope.
Sender Receiver
HBM HBM
insert
the MemPool provides a rich framework for managing dis-
tributed memory and implementing efficient context caching DRAM DRAM
4.2 Indexing
Figure 2. MemPool Transfer API. The left shows the work-
The MemPool has an index layer to map prompt tokens
flow of transfer and transfer_with_insert. The right
to the historical KV cache. MemPool traverses the index
shows asymmetric parallelism and memory medium.
whenever engines call insert, match, delete, etc. The LLM
serving world has three indexing methods: token, session,
and document IDs (see Table 2). Token-based indexing is
known for its generality, as it works for any shared prompt- and its parallelism configuration to the sender, completing
prefix cases [40]. The session and document ID indexing are the allocation step. Then, the sender transmits the KV cache
simpler but can only reuse shared prompts within a chat to the receiver using the fastest available path. Once all data
session or across sessions using the same document [10, is received, the receiver notifies the sender, completing the
29]. We adopt the token-based indexing method for broad transmission step. Next, The receiver checks whether this is
applicability. To implement this index, MemPool utilizes the a transfer_with_insert call. If so, it invokes the insert
radix tree proposed by SGLang [40], with two key extensions. function locally to insert the newly transmitted prompt to-
First, because MemPool manages both GPU HBM and CPU kens and historical KV cache into its local index, completing
DRAM, we enable the radix tree to reference data located the insertion step. Finally, the sender completes the transfer
anywhere in the system. Second, since we also use the radix API call once the receiver returns ok.
tree to build the global prompt tree in the global scheduler We propose the transfer_with_insert as it can avoid an
(§6), we add a field to indicate which inference instance extra network round-trip for establishing mapping, which is
holds the data. Note that while mixed indexing methods are particularly useful for transferring historical KV cache from
possible, we will explore this in future work. a decode-only instance to a prefill-only instance.
To minimize data reshaping overhead, we maintain the Additionally, users can call the transfer API with a spe-
original memory layout when transitioning the active KV cific destination address list, allowing them to skip the ini-
cache to the historical KV cache before inserting it into Mem- tial allocation step. This feature is particularly useful for
Pool. Consequently, MemPool’s indexing granularity aligns constructing layer-by-layer transmissions in disaggregated
with the inference engines’ configuration. For example, in inference (see Figure 5).
our tests with vLLM, which uses a block size of 16 tokens, The transmit step is the most challenging as it must deal
our radix tree nodes point to KV cache blocks of 16 tokens. with three types of heterogeneities between the sender and
the receiver: parallelism, memory, and network. To man-
4.3 Distributed Transfer age asymmetric parallelism, the sender first checks how the
The MemPool exposes distributed APIs for exchanging data KV cache is partitioned along tensor-parallel or pipeline-
among inference instances. They serve as the building blocks parallel dimensions. Once determined, the sender partitions
for intra-request or inter-request dependency-exploiting tech- its local cache and invokes the appropriate network primi-
niques. Our design rationale is to expose simple APIs that tives (top-right in Figure 2). Memory asymmetry can occur
mask the underlying heterogeneity from inference engines. if the historical KV cache has been swapped out to DRAM
Figure 2 shows the transfer workflow and our approach to (bottom-right in Figure 2). MemPool always tries to transmit
handling heterogeneity. We break down the workflow into data using the fastest link with the least data copies. But
three steps: allocation, transmission, and insertion. When this is highly hardware-dependent. If MemPool uses the lat-
the sender inference instance initiates a transfer, it sends a est hardware, such as NVIDIA SuperPOD, where all HBM
request to the receiver inference instance. Upon receiving and DRAM are connected by high-speed NVLINK, handling
this request, the receiver invokes alloc_mem locally to al- memory asymmetry is as simple as performing a memory
locate HBM or DRAM based on the type specified by the copy. However, on regular GPU servers, additional memory
sender. The receiver then returns the allocated address list copies in the data path are inevitable. While implementing
4
Global Distributed Index APIs
Prefill-Only Decode-Only
Scheduler Prompt Tree
Dist APIs Engine Engine
Insert
2 Insert 4
2 3 match
PD Prefill-Only Decode-Only SP SP
MemPool 1 transfer
MemPool
Instance Instance Instance Instance Instance
A-KV A-KV
Caching 1 Caching Caching Caching Caching index
3 transfer_w_insert index
H-KV 5 transfer_w_insert H-KV
We show how to gradually build towards a full-fledged de- layer 1 K V layer 1 K V layer 1
layer 2 K V layer 2 K V layer 2
KV
sign in four design milestones in Table 4, utilizing five key .. ..
layer N K V layer N K V layer N
MemPool APIs as highlighted in Figure 4.
(a) PD-Basic. This is the basic disaggregated inference P D P D P D
architecture proposed by DistServe [41] and Spliwise [23].
TTFT
To realize this design, we make minor changes to an ex-
isting inference engine (e.g., vLLM [14]). As a result, the
prefill instance will call MemPool’s transfer API to trans- TTST
AddrList
Responses
Load
to store the KV cache of 8 tokens. Although paging improves Table
Policy Module
CDF
0.4 0.4 settings: (1) PD denotes PD-colocated. (2) PD-CC denotes
ShareGPT ShareGPT PD-colocated with context caching. (3) 1P1D denotes dis-
0.2 ReAct 0.2 ReAct
0.0 LooGLE
0.0 LooGLE aggregated inference with a single prefill-only and a single
0 1000 2000 3000 4000 0 250 500 750 1000 decode-only instance. The numbers can vary. (4) 1P1D-CC
Num. of tokens Num. of tokens
(c) Prompt / Generation (d) Percentage of shared prefix denotes 1P1D with context caching (PC-caching-3). Note
1.0 1.0 ShareGPT that PD-colocated runs vanilla vLLM. The other three set-
0.8 0.8 ReAct
LooGLE tings are run with MemServe. The request rate is calculated
0.6 0.6 per instance. Assume a 5 req/s rate, then a 1P1D setup will
CDF
CDF
20 1.00 0.08
6 4 0.02
0.75 0.06
4 0.50
10 2 0.01 0.04
2 0.25 0.02
0 0 0.00 0 0.00 0.00
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
(g) LooGLE Average JCT (h) LooGLE P99 JCT 0.5
(i) LooGLE Average TTFT (j) LooGLE P99 TTFT (k) LooGLE Average TPOT (l) LooGLE P99 TPOT
1.4 0.14
0.8 0.8 0.0150
1.2 0.4 0.12
0.0125
0.6 1.0 0.6 0.10
Time (s)
0.3 0.0100
0.8 0.08
0.4 0.2 0.4 0.0075 0.06
0.6
0.4 0.0050 0.04
0.2 0.1 0.2
0.2 0.0025 0.02
0.0 0.0 0.0 0.0 0.0000 0.00
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
(m) ReAct Average JCT (n) ReAct P99 JCT (o) ReAct Average TTFT (p) ReAct P99 TTFT (q) ReAct Average TPOT (r) ReAct P99 TPOT
8 0.08
12 35 3.5 0.30
30 3.0
10 6 0.06 0.25
25 2.5
Time (s)
8 0.20
20 2.0 4 0.04
6 1.5 0.15
15
4 10 1.0 0.10
2 0.02
2 5 0.5 0.05
0 0 0.0 0 0.00 0.00
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Request rate (req/s) Request rate (req/s) Request rate (req/s) Request rate (req/s) Request rate (req/s) Request rate (req/s)
Figure 8. End-to-End Evalution. The x-axis is the request rate per inference instance, 1P1D counts as two instances.
150
(a) Memory API (b) Index API 120
(a) Memory Layout Study 12
(b) Comm. Cost Study
allocate insert_256 Agg_Block 1MB 120
125 free 600 match_256 100 Orignal 10 2MB 100
Time (ms)
100 8MB 80
Time (us)
Time (us)
insert_64
400 match_64 60 6 16MB
75 60
40 4 40
50
200 20 2 20
25
0 0 0
0 0 1 2 4 8 16
Num. of Comms
1T2C
1T4C
2T2C
4T4C
C
1T2C
1T4C
1 C
1T1T8C
2T26C
4T4C
C
108TT8C
10C
1T1
Figure 9. MemPool API Study. (a) The latency of Memory Figure 11. Network and Memory Layout Optimization Study.
APIs with varied numbers of blocks. (a) The latency of key T is short for threads. C is short for NCCL communicators.
Index APIs with varied cache ratio and number of blocks. The right figure compares the performance and HBM usage
with varied NCCL buffer sizes. The default is 4 MB.
Time (ms)
102
several key NCCL parameters: communicator, stream, buffer
100 size, and threads. Figure 11 presents the results. First, the
aggregation method outperforms the vanilla memory layer
101 10 1
32 64 128 256 512 102420484096 32 64 128 256 512 102420484096
by a large margin. Second, a single communicator is enough
Num. of Tokens Num. of Tokens when the memory block is large. When the memory block
is smaller, multiple communicators are required for better
Figure 10. Caching Study. PD-Colocated. Hash is vanilla
performance, but as the right figure shows, increasing the
vLLM. Radix is an adapted vLLM with MemPool.
number of communicators consumes extra HBM.
By-Req-Agg Study. We run a 1024-prompt-32-decode
workload to understand these mechanisms. We vary the
Block Aggregation Study. We study how the proposed request rate and show results in Figure 12. The proposed
memory aggregation helps. We compare two settings: (1) by-req-agg outperforms both by-layer and by-req.
original discrete memory layout (Original) and (2) proposed Context Caching Cost Model. Figure 13 presents the
aggregated memory layout (Agg_Block). The test transmits result with several key takeaways. (1) The benefit of caching
10
5 (a) P50 Latency 10 (b) P99 Latency (a) P50 TTFT (b) P99 TTFT
0.12 RR 0.30 RR
4 8 Prefix Prefix
0.10 Session 0.25 Session
3 6
Time (s)
Time (s)
0.08 0.20
Time (s)
Time (s)
2 Req-P50 4 Req-P99 0.06 0.15
1 Layer-P50 2 Layer-P99 0.04 0.10
MM-Block-P50 MM-Block-P99
0 0 0.02 0.05
5 10 20 40 80 5 10 20 40 80
Req/s Req/s 0.00 0.00
2 3 2 3
Share Ratio Share Ratio
Figure 12. Compare By-Layer, By-Req, and By-Req-Agg.
Figure 14. Global Scheduler Policy. Share Ratio represents
the ratio of the number of identical requests.
120 (a) Prefix Cache Profit 125 (b) Profit in Different BatchSize
TTFT: Cache/Origin (%)
100 100
80
75
9 Related Work
p1024_bs1
60 p32 p2048_bs1
p4096_bs1 Our work is unique in proposing a standalone MemPool
p128 50 p512_bs2
40 p512 p1024_bs2
module and developing a holistic serving system MemServe
p1024 p2048_bs2
20 p2048 25 p256_bs4
p4096
p512_bs4
p1024_bs4
using MemPool APIs.
010 20 30 40 50 60 70 80 90 100 010 20 30 40 50 60 70 80 90 100 Disaggregated Inference. Four papers propose the disag-
Cache Ratio (%) Cache Ratio (%)
gregated inference idea concurrently within a short 3-month
125 (c) Profit in Different BlockSize 120 (d) Profit in Different Location
p1024_b8
p2048_b8 span: Splitwise [23], TetriServe [12], DistServe [41], and De-
TTFT: Cache/Origin (%)
100 p4096_b8
p1024_b16
100
p2048_b16 javu [27]. Generally, disaggregating prefill from decode re-
p4096_b16 80
75 p1024_b32
p2048_b32 duces interference between these two stages and allows each
p4096_b32 60 p1024_dram
50 p1024_hbm to scale independently with heterogeneous hardware. More
40 p2048_dram
p2048_hbm recently, LoongServe [29] takes a step further by enabling
25 20 p4096_dram
p4096_hbm dynamic scaling. All prior work builds disaggregated infer-
010 20 30 40 50 60 70 80 90 100 010 20 30 40 50 60 70 80 90 100
Cache Ratio (%) Cache Ratio (%) ence by modifying the inference engine in an ad-hoc manner.
Our work takes a different approach by first abstracting out
Figure 13. Context Caching Cost Model. All figures have the MemPool component and then building disaggregated
cached-ratio has the x-axis. Each line represents a different inference as a use case of MemPool.
prompt length. All y-axis represent the TTFT improvement Context Caching. Caching reduces recomputation, hence
over the no-caching case. (a) studies the prompt-len factor. reducing TTFT and improving throughput. The benefits of
(b) studies the batch-size factor. (c) studies the block-size context caching are well-studied in Pensieve [37], Cache
factor. (d) studies the cached-location factor. Gen [18], SGLang [40], and Prompt Cache [9]. More recently,
Google started a commercial offering of context caching
for their Gemini models [10]. All prior work builds context
caching in a PD-colocated setup. Using MemPool APIs, we
take a step-by-step approach to building the first-ever con-
improves with a larger cached-ratio. (2) For the same cached- text caching solution atop disaggregated inference.
ratio, longer prompts have higher improvement. (3) Batch Scheduling. Scheduling plays a key role in improving
size effectively translates to prompt length. Hence, we need serving efficiency. At the local layer: Orca [36] proposes
to consider batch size along with cached-ratio. (4) When iterative-level scheduling to reduce bubbles. Sarathi [1, 2]
the historical KV cache data is located in DRAM, we must proposes chunked-prefill to overcome suboptimal prefill pro-
swap it into HBM before using it during prefill. Yet, the cessing. FastServe[30] utilizes a multi-level priority feedback
benefit of reducing computation largely offsets the cost of queue to minimize JCT. At the global layer: MuxServe [7]
data movement. Regardless of where data is located, TTFT formulates a multiplexing problem and proposes a novel
improves once the cached-ratio exceeds a certain threshold. placement algorithm and adaptive batch scheduling strategy
Global Scheduler Study. We compare policies as listed to identify optimal colocations in LLM serving.
in Table 6. We selected 80 sessions from LooGLE, roughly 250 Generic Memory Optimization. Many works try to
requests. We propose a share ratio. A share ratio of 2 means optimize memory usage. For example, using quantization[5,
duplicating this set of sessions. While running a 3P1D setup, 6, 8, 13, 16, 26, 31, 34] to compress the model weights into
Figure 14 shows that compared to intra-session scheduling, lower precision, using paging to reduce fragmentation [14],
prompt-tree-based scheduling improves P99 TTFT by 59% and low-level algorithm and kernel optimizations [3, 4, 11,
since it maximizes KV cache reuse. 19, 28, 35, 39]. We refer readers to [38, 42] for more details.
11
10 Conclusion [15] Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. Loogle:
Can long-context language models understand long contexts? arXiv
In this paper, we presented MemServe, a novel system de- preprint arXiv:2311.04939, 2023.
signed to enhance the efficiency of LLM serving by unify- [16] Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan
ing inter-request and intra-request optimizations. The core Klein, and Joey Gonzalez. Train big, then compress: Rethinking model
of MemServe is a distributed MemPool that manages KV size for efficient training and inference of transformers. In Interna-
caches across distributed instances. MemServe builds con- tional Conference on machine learning, 2020.
[17] Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li, Hanyu Zhao,
text caching, disaggregated inference, and their combo using
Wencong Xiao, Qi Xu, Xiafei Qiu, Shen Li, et al. Infinite-llm: Efficient
MemPool APIs. End-to-end results show MemServe can im- llm service for long context with distattention and distributed kvcache.
prove JCT, TTFT, TPOT by a large margin. arXiv preprint arXiv:2401.02669, 2024.
[18] Yuhan Liu, Hanchen Li, Kuntai Du, Jiayi Yao, Yihua Cheng, Yuyang
References Huang, Shan Lu, Michael Maire, Henry Hoffmann, Ari Holtzman, et al.
Cachegen: Fast context loading for language model applications. arXiv
[1] Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun preprint arXiv:2310.07240, 2023.
Kwatra, Bhargav S Gulavani, Alexey Tumanov, and Ramachandran [19] Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei
Ramjee. Taming throughput-latency tradeoff in llm inference with Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. Ram-
sarathi-serve. arXiv preprint arXiv:2403.02310, 2024.
mer: Enabling holistic deep learning compiler optimizations with
[2] Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, {rTasks}. In 14th USENIX Symposium on Operating Systems Design
Bhargav S Gulavani, and Ramachandran Ramjee. Sarathi: Efficient and Implementation (OSDI 20), 2020.
llm inference by piggybacking decodes with chunked prefills. arXiv [20] NVIDIA. NCCL. https://docs.nvidia.com/deeplearning/nccl/user-
preprint arXiv:2308.16369, 2023. guide/docs/overview.html.
[3] Tri Dao. Flashattention-2: Faster attention with better parallelism and [21] NVIDIA. RDMA Verbs. https://docs.nvidia.com/networking/display/
work partitioning. arXiv preprint arXiv:2307.08691, 2023. rdmaawareprogrammingv17/rdma+verbs+api.
[4] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. [22] NVIDIA. TensorRT-LLM. https://github.com/NVIDIA/TensorRT-LLM.
Flashattention: Fast and memory-efficient exact attention with io- [23] Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Aashaka
awareness. Advances in Neural Information Processing Systems, 2022. Shah, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient genera-
[5] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. tive llm inference using phase splitting. arXiv preprint arXiv:2311.18677,
Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv 2023.
preprint arXiv:2208.07339, 2022. [24] Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee,
[6] Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis and Ashish Panwar. vattention: Dynamic memory management for
Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, serving llms without pagedattention. arXiv preprint arXiv:2405.04437,
Torsten Hoefler, and Dan Alistarh. Spqr: A sparse-quantized 2024.
representation for near-lossless llm weight compression. arXiv [25] Sharegpt teams. https://sharegpt.com/.
preprint arXiv:2306.03078, 2023. [26] Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max
[7] Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and
Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. Muxserve: Flexi- Ce Zhang. Flexgen: High-throughput generative inference of large
ble multiplexing for efficient multiple llm serving. arXiv preprint language models with a single gpu. In International Conference on
arXiv:2404.02015, 2024. Machine Learning, 2023.
[8] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Optq: [27] Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski,
Accurate quantization for generative pre-trained transformers. In The and Ana Klimovic. Déjàvu: Kv-cache streaming for fast, fault-tolerant
Eleventh International Conference on Learning Representations, 2022. generative llm serving, 2024.
[9] In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khan-
[28] Xiaohui Wang, Ying Xiong, Yang Wei, Mingxuan Wang, and Lei Li.
delwal, and Lin Zhong. Prompt cache: Modular attention reuse for Lightseq: A high performance inference library for transformers. arXiv
low-latency inference. Proceedings of Machine Learning and Systems, preprint arXiv:2010.13887, 2020.
6:325–338, 2024. [29] Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe
[10] Google. Context Caching. https://ai.google.dev/gemini-api/docs/ Liu, and Xin Jin. Loongserve: Efficiently serving long-context large
caching?lang=python. language models with elastic sequence parallelism. arXiv preprint
[11] Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun arXiv:2404.09526, 2024.
Liu, Kangdi Chen, Hanyu Dong, and Yu Wang. Flashdecoding++: [30] Bingyang Wu, Yinmin Zhong, Zili Zhang, Gang Huang, Xuanzhe Liu,
Faster large language model inference on gpus. arXiv preprint and Xin Jin. Fast distributed inference serving for large language
arXiv:2311.01282, 2023. models. arXiv preprint arXiv:2305.05920, 2023.
[12] Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang [31] Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth,
Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, and Song Han. Smoothquant: Accurate and efficient post-training
et al. Inference without interference: Disaggregate llm inference for quantization for large language models. In International Conference
mixed downstream workloads. arXiv preprint arXiv:2401.11181, 2024. on Machine Learning, 2023.
[13] Berivan Isik, Hermann Kumbong, Wanyi Ning, Xiaozhe Yao, Sanmi
[32] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W
Koyejo, and Ce Zhang. Gpt-zip: Deep compression of finetuned large Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa:
language models. In Workshop on Efficient Systems for Foundation A dataset for diverse, explainable multi-hop question answering. arXiv
Models@ ICML2023, 2023. preprint arXiv:1809.09600, 2018.
[14] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin [33] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik
Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting
Efficient memory management for large language model serving with in language models. arXiv preprint arXiv:2210.03629, 2022.
pagedattention. In Proceedings of the 29th Symposium on Operating
Systems Principles, 2023.
12
[34] Zhewei Yao, Cheng Li, Xiaoxia Wu, Stephen Youn, and Yuxiong He. A [39] Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang
comprehensive study on post-training quantization for large language Zhang, Zizhong Chen, Xin Liu, and Yibo Zhu. Bytetransformer: A
models. arXiv preprint arXiv:2303.08302, 2023. high-performance transformer boosted for variable-length inputs. In
[35] Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, 2023 IEEE International Parallel and Distributed Processing Symposium
Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable (IPDPS), 2023.
post-training quantization for large-scale transformers. Advances in [40] Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue
Neural Information Processing Systems, 2022. Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E
[36] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Gonzalez, et al. Efficiently programming large language models using
and Byung-Gon Chun. Orca: A distributed serving system for sglang. arXiv preprint arXiv:2312.07104, 2023.
{Transformer-Based} generative models. In 16th USENIX Sympo- [41] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu-
sium on Operating Systems Design and Implementation (OSDI 22), 2022. anzhe Liu, Xin Jin, and Hao Zhang. Distserve: Disaggregating prefill
[37] Lingfan Yu and Jinyang Li. Stateful large language model serving with and decoding for goodput-optimized large language model serving,
pensieve. arXiv preprint arXiv:2312.05516, 2023. 2024.
[38] Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei [42] Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao
Guo, Xusheng Chen, and Yizhou Shan. The cap principle for llm Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A
serving. arXiv preprint arXiv:2405.11299, 2024. survey on efficient inference for large language models. arXiv preprint
arXiv:2404.14294, 2024.
13