0% found this document useful (0 votes)

50 views13 pages

Memserve: Context Caching For Disaggregated LLM Serving With Elastic Memory Pool

Uploaded by

dr.huangxi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views13 pages

Memserve: Context Caching For Disaggregated LLM Serving With Elastic Memory Pool

Uploaded by

dr.huangxi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

MemServe: Context Caching for Disaggregated LLM

Serving with Elastic Memory Pool

Cunchen Hu2,3 , Heyang Huang2,3 , Junhao Hu4 , Jiang Xu1 , Xusheng Chen1 , Tao Xie4 ,
Chenxi Wang2,3 , Sa Wang2,3 , Yungang Bao2,3 , Ninghui Sun2,3 , Yizhou Shan1

1 Huawei Cloud, 2 UCAS 3 ICT, CAS 4 Peking University

Abstract cache across requests, extending its lifetime from a single
Large language model (LLM) serving has transformed from request to potentially infinite. Intra-request methods transfer
arXiv:2406.17565v2 [cs.DC] 26 Jun 2024

stateless to stateful systems, utilizing techniques like context the KV cache across multiple inference instances, extending
caching and disaggregated inference. These optimizations ex- its domain from a single instance to distributed instances.
tend the lifespan and domain of the KV cache, necessitating a However, deploying a stateful LLM serving system with these
new architectural approach. We present MemServe, a unified optimizations is challenging due to conflicting or missing
system that integrates both inter-request and intra-request mechanisms for managing the LLM’s intermediate KV cache
optimizations. MemServe introduces MemPool, an elastic data. We have identified two key problems.
memory pool managing distributed memory and KV caches The first problem is that LLM serving systems cannot
across serving instances. Using MemPool APIs, MemServe simultaneously apply any existing inter-request and intra-
combines context caching with disaggregated inference for request dependency-exploiting optimizations. Current con-
the first time, supported by a global scheduler that enhances text caching (inter-request) methods are designed without
cache reuse through a global prompt tree-based locality- considering intra-request scenarios. As a result, disaggre-
aware policy. Tests show that MemServe significantly im- gated inference (intra-request) cannot benefit from context
proves job completion time and time-to-first-token. caching because it lacks the mechanisms to utilize the KV
cache from decode back to prefill instances for future reuse.
Similarly, sequence parallelism distributes the KV cache
1 Introduction across multiple instances and lacks the mechanisms and
Large language models (LLMs) and their underlying trans- algorithms needed to preserve and reuse it. This issue arises
former architecture have revolutionized AI, becoming foun- because intra-request techniques break a tightly coupled
dational to many emerging applications and a crucial work- request into multiple loosely coupled sub-requests, compli-
load in data centers. While high-quality models are essential, cating KV cache management in a distributed setting.
it is equally important to serve these models on a massive The second problem is that LLM serving systems lack a
scale at a reasonably low cost. As a result, numerous ap- holistic, top-down design to effectively utilize existing inter-
proaches have been proposed to enhance the cost-efficiency request techniques. Context caching benefits from reusing
of LLM serving, such as context caching [37, 40], disaggre- historical KV cache by running requests that share a com-
gated inference [23, 41], and sequence parallelism [17]. mon prefix in the same serving instance. However, current
As a result, LLM serving has evolved from a stateless to a LLM serving systems schedule requests across multiple serv-
stateful system, leveraging dependencies inherent in infer- ing instances based on load or session IDs, which fails to
ence requests. These dependency-exploiting techniques can maximize KV cache reuse across sessions.
be classified into two types: inter-request and intra-request. These issues arise because existing LLM serving systems
Inter-request techniques exploit dependencies across requests. are built on the assumption that the KV cache is merely inter-
The notable one is context caching [40], which reuses the KV mediate data scoped to a single request on a single instance.
cache for requests that share the same prompt prefix, thereby With emerging dependency-exploiting techniques, the lifes-
speeding up the prefill phase. Intra-request techniques, on pan of the KV cache has been extended, and its management
the other hand, exploit dependencies within a single request. has expanded to a distributed setup. This paradigm shift calls
Two prominent examples are disaggregated inference, which for a fundamental rethinking of LLM serving architectures.
splits a request into two sub-requests for better schedul- In this work, we propose Memory-enhanced model Serv-
ing [23], and sequence parallelism, which divides a request ing, or MemServe, to handle inter-request and intra-request
into multiple sub-requests to distribute load [17]. optimizations within a unified system. To tackle the chal-
A common theme in these dependency-exploiting tech- lenges of managing the KV cache across distributed instances,
niques is that they require novel logic to manage and transfer MemServe introduces an elastic memory pool, or MemPool,
the KV cache, which is the intermediate data produced dur- which is a substrate for managing all cluster memory, in-
ing LLM inference. Inter-request methods preserve the KV cluding CPU DRAM and GPU HBM. MemPool offers a rich
1
set of APIs for managing distributed memory and KV cache. and relatively short generation lengths, disaggregated infer-
Utilizing these APIs, MemServe implements context caching ence boosts JCT by up to 10.8% compared to PD-colocated
over standard prefill-decode-colocated (PD-colocated) in- setups. Additionally, context caching offers further enhance-
stances [40] and disaggregated inference [12, 41]. Moreover, ments, potentially improving JCT by 26.9%.
MemServe enhances disaggregated inference with context In summary, we make the following contributions:
caching, reaping both benefits. Finally, to maximize KV cache
reuse, MemServe employs a global scheduler that incorpo- • We propose MemPool, a memory pool designed for
rates a locality-aware policy using novel global prompt trees LLM serving with a rich set of APIs.
for best-effort routing. • We build the first disaggregated inference with context
The MemPool is a core component of MemServe, provid- caching in MemServe based on MemPool APIs.
ing three types of APIs: memory, indexing, and distributed • We propose a novel prompt trees-based locality-aware
data transfer. It runs within each inference instance, man- policy for scheduling LLM requests.
aging all local memory with a fixed-size memory allocator.
The indexing APIs are crucial for building context caching. 2 Background
MemPool uses an internal index to map prompt tokens to the Generative LLM Inference LLM inference involves gen-
KV cache, managing both the active KV cache for ongoing erating a sequence of output tokens in response to an input
requests and the historical KV cache retained after requests prompt. This process consists of two distinct phases: prefill
are completed. The MemPool offers a simple data transfer and decode. During the prefill phase, the model processes the
API that abstracts three heterogeneities: parallelism, net- prompt to generate the key-value (KV) cache. The KV cache
work, and memory medium. As a unified platform, MemPool comprises key-value pairs generated in the self-attention
supports all known inter-request and intra-request optimiza- mechanism. In the decode phase, the model uses the KV
tions as well as any combinations (see Figure 3). cache to generate tokens iteratively. The size of the KV cache
MemServe bridges the gap between context caching (inter- grows linearly with increasing number of generated tokens.
request) and disaggregated inference (intra-request) in four Inter-Request Optimization This type of optimization
steps using MemPool APIs: (a) we first use a distributed exploits dependencies among requests for better performance.
API to reproduce disaggregated inference, (b) we then add Context caching is the only known technique in this cate-
caching to prefill-only instances using index APIs, (c) we gory. To build context caching, the model stores and reuses
apply the same caching to decode-only instances, (d) finally the KV cache from the self-attention mechanism to avoid
we enable decode-to-prefill data transfer, as illustrated in redundant computations across similar or repeated requests.
Figure 4. However, it is challenging to hit two birds with one This is useful in scenarios where multiple requests share
stone. We observed increasing overheads due to naive dis- common prefixes or contexts. Two mechanisms are essen-
crete memory layouts and point-to-point network primitives tial. First, an index is required to find dependencies among
from existing AI network stacks. To address this, MemServe requests and consequently find the preserved KV cache (see
proposes co-optimizing memory layout and network transfer Table 2). Second, a modified inference engine and attention
using huge pages. kernel to reuse the historical KV cache (see SGLang [40]).
We implement MemPool and global scheduler from scratch, Intra-Request Optimization This type of optimization
5.6K SLOC in Python and 1.6K SLOC in C++ We modify exploits dependencies within a request to enhance perfor-
vLLM [14] to build context caching with disaggregated in- mance. Two notable examples are disaggregated inference [12,
ference, 200 SLOC in Python and 400 SLOC in CUDA C++ 23, 27, 41] and sequence parallelism [17]. Generally, disag-
We use NCCL send and recv pairs for data transmission gregating prefill from decode reduces interference between
between GPUs and socket if any side is DRAM. these two stages and allows each to scale independently with
We run all tests atop a single server with eight H800-80G heterogeneous hardware. However, this breaks a single re-
GPUs. We evaluate MemServe across four settings: (1) PD- quest into two sub-requests and requires rigorous KV cache
colocated, (2) PD-colocated with caching, (3) PD-disaggregated, transmission from prefill to decode. The same goes for se-
and (4) PD-disaggregated with caching. The first setting runs quence parallelism, in which distributed instances need to
a vanilla vLLM. The last three settings are MemServe run- exchange the outputs of self-attention in a rigorous man-
ning adapted vLLM using MemPool APIs. While running ner. Overall, intra-request optimization demands efficient
ShareGPT workload [25], the PD-disaggregated with caching mechanisms for transferring the KV cache among instances.
setting outperforms others. Specifically, MemPool-based dis-
aggregated inference improves JCT by up to 42% compared
to PD-colocated. Enhancing disaggregated inference with 3 MemServe Overview
context caching can further improve JCT by 29%! When exe- MemServe is designed as a large-scale LLM serving system
cuting the LooGLE dataset, which features extended prompts that efficiently handles inter-request and intra-request op-
timizations. It comprises three main components: a global
2
Inference Request Inference Response

Global Prompt Tree

Global
Request Streaming Cluster
Locality-Aware Match Update
Scheduler Scheduling Output Management Heartbeat

Prefill-Only
Prefill-OnlyInstance
Instance Decode-Only
Decode-OnlyInstance
Instance PD-Colocated
PD-ColocatedInstance
Instance
CPU CPU CPU
CPU CPU CPU
DRAM DRAM DRAM
DRAM DRAM DRAM
Infer Mem H-KV Infer Mem H-KV Infer Mem H-KV
Infer Mem H-KV Infer Mem H-KV Infer Mem H-KV
Engine Pool Engine Pool Engine Pool
Engine Pool A-KV Engine Pool A-KV Engine Pool A-KV
A-KV A-KV A-KV

HBM
HBM Transfer HBM
HBM Transfer HBM
HBM
HBM
HBM HBM
HBM HBM
HBM
GPU Historial
Historial Active
Active KV GPU Historial
Historial Active
Active KV GPU Historial
Historial Active
Active
GPU
GPU Historial
Historial
Historial Active
Active
Active GPU
GPU Historial
Historial
Historial Active
Active
Active GPU
GPU Historial
Historial
Historial Active
Active
Active
GPU KV
Historical
KV KV
Active
KV GPU KV
Historical
KV KV
Active
KV GPU KV
Historical
KV KV
Active
KV
KV
KV
KV KV
KV
KV KV
KV
KV KV
KV
KV KV
KV KV
KV
KV KV KV KV KV KV

Interconnect (NVLINK/PCIe/RoCE/HCCS/UB)

Figure 1. MemServe Architecture. It supports three types of inference instances: prefill-only, decode-only, and PD-colocated.
Each inference engine runs over one or multiple AI servers, depending on the parallelism configuration.

scheduler, multiple types of inference instances, and an elas- Table 1. Elastic Memory Pool APIs. Type can be HBM-only,
tic memory pool (MemPool), as shown in Figure 1. The Mem- DRAM-only, or mixed. Each address encodes instance ID.
Pool offers a set of APIs for memory allocation, index man- Transfer flags can control on-demand allocation.
agement, and distributed transfer (§4). MemServe builds con-
text caching atop both regular and disaggregated inference API Parameters Description
architectures using MemPool APIs (§5). The global scheduler alloc_mem size, type, id alloc a certain type of memory on a
forwards inference requests from users to the right infer- given instance (@id), return addrList
free_mem addrList free memory
ence instance. It uses locality-aware policies based on novel
insert tokenList, ad- insert prompt token and KV cache
distributed prompt trees, maximizing KV cache reuse (§6). drList, flags address mapping into local index
match tokenList find prompt’s cached data if any, re-
4 Elastic Memory Pool turn addrList
delete tokenList delete prompt’s cached data if any
The MemPool manages all memory in the inference clus-
swap_out num_blocks swap a given number of blocks from
ter, including CPU DRAM and GPU HBM. MemPool runs HBM to DRAM
within each inference instance, collectively offering a set of swap_in addrList swap blocks with given address from
distributed memory pool APIs (§4.1). It manages both the DRAM to HBM
active KV cache used by ongoing requests and the historical transfer id, srcAddrList, transfer data to the specified in-
KV cache retained after requests are completed. An index- dstAddrList, stance (@id), dstAddrList is optional,
flags, private flags control behaviors at the desti-
ing layer maps prompt tokens to the historical KV cache
nation, and private carries user data
(§4.2), ensuring efficient retrieval of cached data. The Mem- transfer_ id, tokenList, transfer tokenList and its cached
Pool has efficient mechanisms for exchanging data between with_insert srcAddrList, data to the specified instance. The
instances, alleviating inference engines from dealing with dstAddrList, receiver will call an extra insert.
heterogeneous hardware (§4.3). Overall, this design makes flags, private
MemPool a versatile and generic platform capable of sup-
porting both intra-request and inter-request optimizations
within a unified system (§4.5). The engine can also call the index APIs for context caching
solutions. For example, once requests are finished, the en-
4.1 API gine can call insert to transition the active KV cache into
We show MemPool APIs in Table 1, broadly divided into three the historical KV cache and create a mapping from prompt
categories: memory block, index, and distributed transfer. tokens to the KV cache. The engine can invoke distributed
The inference engine can use memory block APIs to allocate APIs, such as transfer, to exchange the KV cache across in-
fixed-size memory blocks for storing KV cache or other data. stances when building inter-request optimizations. Overall,
3
Table 2. Compare Indexing Methods. MemServe’s MemPool Sender Receiver Sender Receiver
(TP=1) (TP=2)
uses prompt tokens for its generality.
alloc_mem scatter
Indexing Description DstAddrList / TP

transfer-with-insert
Token ID Use prompt tokens. Universally applicable.

transfer
Transmit
Session ID Use client-server session ID. Limited scope. (b)
Document ID Use document file ID. Limited scope.
Sender Receiver

HBM HBM
insert
the MemPool provides a rich framework for managing dis-
tributed memory and implementing efficient context caching DRAM DRAM

and data exchange mechanisms.

(a) (c)

4.2 Indexing
Figure 2. MemPool Transfer API. The left shows the work-
The MemPool has an index layer to map prompt tokens
flow of transfer and transfer_with_insert. The right
to the historical KV cache. MemPool traverses the index
shows asymmetric parallelism and memory medium.
whenever engines call insert, match, delete, etc. The LLM
serving world has three indexing methods: token, session,
and document IDs (see Table 2). Token-based indexing is
known for its generality, as it works for any shared prompt- and its parallelism configuration to the sender, completing
prefix cases [40]. The session and document ID indexing are the allocation step. Then, the sender transmits the KV cache
simpler but can only reuse shared prompts within a chat to the receiver using the fastest available path. Once all data
session or across sessions using the same document [10, is received, the receiver notifies the sender, completing the
29]. We adopt the token-based indexing method for broad transmission step. Next, The receiver checks whether this is
applicability. To implement this index, MemPool utilizes the a transfer_with_insert call. If so, it invokes the insert
radix tree proposed by SGLang [40], with two key extensions. function locally to insert the newly transmitted prompt to-
First, because MemPool manages both GPU HBM and CPU kens and historical KV cache into its local index, completing
DRAM, we enable the radix tree to reference data located the insertion step. Finally, the sender completes the transfer
anywhere in the system. Second, since we also use the radix API call once the receiver returns ok.
tree to build the global prompt tree in the global scheduler We propose the transfer_with_insert as it can avoid an
(§6), we add a field to indicate which inference instance extra network round-trip for establishing mapping, which is
holds the data. Note that while mixed indexing methods are particularly useful for transferring historical KV cache from
possible, we will explore this in future work. a decode-only instance to a prefill-only instance.
To minimize data reshaping overhead, we maintain the Additionally, users can call the transfer API with a spe-
original memory layout when transitioning the active KV cific destination address list, allowing them to skip the ini-
cache to the historical KV cache before inserting it into Mem- tial allocation step. This feature is particularly useful for
Pool. Consequently, MemPool’s indexing granularity aligns constructing layer-by-layer transmissions in disaggregated
with the inference engines’ configuration. For example, in inference (see Figure 5).
our tests with vLLM, which uses a block size of 16 tokens, The transmit step is the most challenging as it must deal
our radix tree nodes point to KV cache blocks of 16 tokens. with three types of heterogeneities between the sender and
the receiver: parallelism, memory, and network. To man-
4.3 Distributed Transfer age asymmetric parallelism, the sender first checks how the
The MemPool exposes distributed APIs for exchanging data KV cache is partitioned along tensor-parallel or pipeline-
among inference instances. They serve as the building blocks parallel dimensions. Once determined, the sender partitions
for intra-request or inter-request dependency-exploiting tech- its local cache and invokes the appropriate network primi-
niques. Our design rationale is to expose simple APIs that tives (top-right in Figure 2). Memory asymmetry can occur
mask the underlying heterogeneity from inference engines. if the historical KV cache has been swapped out to DRAM
Figure 2 shows the transfer workflow and our approach to (bottom-right in Figure 2). MemPool always tries to transmit
handling heterogeneity. We break down the workflow into data using the fastest link with the least data copies. But
three steps: allocation, transmission, and insertion. When this is highly hardware-dependent. If MemPool uses the lat-
the sender inference instance initiates a transfer, it sends a est hardware, such as NVIDIA SuperPOD, where all HBM
request to the receiver inference instance. Upon receiving and DRAM are connected by high-speed NVLINK, handling
this request, the receiver invokes alloc_mem locally to al- memory asymmetry is as simple as performing a memory
locate HBM or DRAM based on the type specified by the copy. However, on regular GPU servers, additional memory
sender. The receiver then returns the allocated address list copies in the data path are inevitable. While implementing
4
Global Distributed Index APIs
Prefill-Only Decode-Only
Scheduler Prompt Tree
Dist APIs Engine Engine
Insert
2 Insert 4
2 3 match
PD Prefill-Only Decode-Only SP SP
MemPool 1 transfer
MemPool
Instance Instance Instance Instance Instance
A-KV A-KV
Caching 1 Caching Caching Caching Caching index
3 transfer_w_insert index
H-KV 5 transfer_w_insert H-KV

Elastic Memory Pool

KV $ KV $ KV $ GPU HBM
Index Figure 4. Enhancing Disaggregated Inference with Context
KV $ KV $ KV $ CPU DRAM Caching using MemPool APIs. The engine box means an
adapted inference engine such as vLLM. Circled numbers
Figure 3. Use Cases Enabled By MemPool. Circle 1 is con- mean steps taken to build the solution. A-KV is active KV
text caching. Circle 2 is disaggregated inference. Circle 3 is cache. H-KV is historical KV cache.
sequence parallelism. Solid gray lines mean MemPool index
API calls. Solid red lines mean MemPool distributed APIs.
MemPool enables all use cases in one platform. Table 4. Towards Full-Fledged Context Caching in Disag-
gregated Inference. Refer to Figure 4 for step numbers.

Design Steps Description

Table 3. Atomic Serving Scenarios Supported by MemPool. PD-Basic 1 Basic PD, no caching
As a unified platform, MemPool supports any combo of inter- PD-Caching-1 1+2 Caching at P
request or intra-request optimizations in one system. PD-Caching-2 1+2+3+4 Caching at D
PD-Caching-3 1+2+3+4+5 Full-fledged caching

Scenario Type APIs Used

Context Caching inter index (insert,match,delete,evict,etc) 4.5 Use Cases
Disagg. Inference intra dist (transfer,transfer_with_insert)
Sequence Parallel intra dist (transfer)
MemPool is a versatile and generic platform designed to
Request Migration N/A dist (transfer) support both inter-request and intra-request dependency-
exploiting techniques. Figure 3 illustrates how various ex-
isting inter-request and intra-request optimizations can be
integrated into a unified system using MemPool APIs. Table 3
MemPool distributed APIs, we find existing network primi- lists the advanced APIs employed in these optimizations. To
tives ill-fit for handling emerging LLM inference workloads. build context caching atop regular PD-colocated inference
We will discuss this in §7. instances, one can use index APIs such as insert and match.
To build disaggregated inference, one can call transfer API
4.4 Failure Handling and Scaling to send active KV cache from a prefill-only to a decode-only
We discuss how MemPool handles failures and dynamic scal- instance. To build sequence-parallel (SP) inference, one can
ing during runtime. As Figure 1 shows, MemPool is deployed use the transfer API to exchange attention outputs among
as part of an inference instance. Hence, the failure and scaling instances, akin to InfiniteLLM [17].
granularity is a single instance, which can be one or multiple What sets MemPool apart is its ability to seamlessly enable
AI servers, depending on the parallelism configuration. these optimizations within a single system using a common
MemServe has a cluster management (CM) module as set of APIs. Next, we will showcase how to enhance dis-
shown in Figure 1, which is a centralized service for maintain- aggregated inference with context caching. We leave other
ing cluster configuration. The CM is responsible for adding combinations for future work.
or removing instances and monitoring cluster health. In our
current design, memory block and distributed transfer APIs 5 Caching for Disaggregated Inference
can modify the states of remote instances. When an instance Context caching exploits dependency across requests, while
fails, any in-flight requests from other instances will time disaggregated inference exploits dependency within a re-
out. The CM detects such failures through regular heartbeats quest. However, they fail to coexist due to missing mecha-
and broadcasts updated cluster information to all running nisms around KV cache management. We enhance disaggre-
instances. Upon receiving this notification, each instance gated inference with context caching using MemPool APIs.
releases any memory blocks allocated by the failed instance To the best of our knowledge, this is the work introducing
to prevent memory leaks. caching to disaggregated inference.
5
5.1 Design By Layer By Req By Req-Agg

We show how to gradually build towards a full-fledged de- layer 1 K V layer 1 K V layer 1
layer 2 K V layer 2 K V layer 2
KV
sign in four design milestones in Table 4, utilizing five key .. ..
layer N K V layer N K V layer N
MemPool APIs as highlighted in Figure 4.
(a) PD-Basic. This is the basic disaggregated inference P D P D P D
architecture proposed by DistServe [41] and Spliwise [23].
TTFT
To realize this design, we make minor changes to an ex-
isting inference engine (e.g., vLLM [14]). As a result, the
prefill instance will call MemPool’s transfer API to trans- TTST

fer the active KV cache produced after the prefill phase to

the decode instance. We carry essential metadata the decode
By-layer By-Req By-Req-Agg
instance requires in transfer’s private field, such as request Num. of Transfer API 2*L 1 1
Num. of Network API 2*L 2*L 1
ID, sampling parameters, prompt tokens, etc.
(b) PD-Caching-1. This is the first caching-enhanced
Figure 5. Optimize Network&Memory for Disagg. Inference.
disaggregated inference design. We enable caching at the
prefill-only instance by calling insert to retire the active KV
cache as the historical KV cache such that future inferences
can utilize the saved data to reduce recomputation (step 2 and the benefit of context caching increases linearly with
in Figure 4). This caching design only preserves historical the number of turns.
KV cache produced by the prefill phase but none from the In all, we illustrate how MemPool’s simple APIs can be
decode phase, so it works well for workloads that share long used to build a range of advanced solutions, from basic disag-
common prefix prompts, e.g., system prefix [40]. The major gregated inference to full-fledged context caching. Neverthe-
drawback of this design is that in a multi-turn chat scenario less, MemPool only provides primitves for transferring data
(e.g., document QA [15]), the prefill-only instance needs to among inferences and managing local data. It is up to the
repeatedly forward the same set of active KV cache to the users (e.g., inference engines) to decide the memory layout
decode-only instance, wasting bandwidth and affecting the and the number of API calls. Next, we will discuss design
time-to-second-token. We therefore propose the next design challenges around memory and network.
milestone to address this issue.
5.2 Memory and Network Optimization
(c) PD-Caching-2. This design enables caching at the
decode-only instance to reduce repeated data movement. We discuss how memory and network play a key role in
We make two key changes atop PD-Caching-1. First, the building context caching with disaggregated inference. As
prefill-only instance now calls transfer_with_insert in- Splitwise [23] points out, there are two ways to transfer ac-
stead of transfer such that the decode-only instance will tive KV cache from prefill to decode instance: by-layer or
insert the transmitted KV cache produced by the prefill phase by-request. The by-layer approach transfers the KV cache
into its local index (§4.3). Second, after a request finishes, once a layer has finished computation. The by-request ap-
the decode-only instance calls insert to preserve the KV proach transfers the KV cache once the prefill phase is com-
cache produced by the decode phase into its local index. With pleted. Splitwise found that by-layer outperforms by-request
the help of locality-aware scheduling (§6), the prefill-only because it overlaps computation and communication, hence
instance now only needs to transfer new KV cache data incre- speeding up time-to-second token (or TTST). We make the
mentally. Though this design reduces data movement from same observation when the load is low. However, both incur
prefill-only to decode-only instances, it does not improve non-trivial overhead with increasing load due to excessive
context caching at the prefill-only instance since it lacks the network transfers. We find the root causes are (1) discrete
historical KV cache from the decode phase. As a result, the memory layout and (2) inadequate network primitives.
benefit of context caching stays flat with increasing prompt Paging-based dynamic memory management introduced
in a multi-turn chat scenario. We therefore propose the next by PagedAttention [14] is now the de facto standard in LLM
design milestone to address this issue. serving systems, e.g., in vLLM [14], TensorRT-LLM [22]. Re-
(d) PD-Caching-3. This design enables full-fledged con- gardless of where the paging mechanism is implemented
text caching for disaggregated inference architecture. We (engine [14] or driver [24]), the KV cache is partitioned and
make one change atop PD-Caching-2: after a request finishes, stored in fixed-sized memory blocks. The block size is con-
the decode-only instance calls transfer_with_insert to figurable, usually in the number of tokens, e.g., 8 tokens
transmit the KV cache produced by the decode phase to worth of KV cache per block. Existing engines manage the
the prefill-only instance (step 5 in Figure 4). As a result, the KV cache in a fine-grained manner. For example, vLLM allo-
prefill-only instance’s preserved historical KV cache grows, cates two blocks per LLM layer. Given an LLM model with 𝐿
layers and 8 tokens per block, the engine needs 2 ∗ 𝐿 blocks
6
Table 5. Context Caching Cost Model Factors.
prompt tokenizer
IDs
Factor Description
Global Prompt Trees
Prompt-Length The length of the current prompt.
Prefill-Only Decode-Only PD-Colocated
Cached-Ratio The ratio of cached tokens Trees Trees Trees
Streaming
Cached-Locations Historical KV cache locations. Output
Update
Batch-Size The running batch size.
Block-Size Size of paging blocks.

AddrList
Responses
Load
to store the KV cache of 8 tokens. Although paging improves Table
Policy Module

utilization [14], the discrete memory layout presents huge

challenges when implementing disaggregated inference us- (Instance, KV-AddrList)

ing existing AI network stacks.

The de facto network stack in AI is collective libraries Figure 6. Global Scheduler Architecture. We highlight the
such as NCCL [20]. These libraries work best for typical global prompt trees-based locality-aware scheduling, it con-
AI workloads using tensor or pipeline parallelism, but they sists of a lookup path (left) and an update path (right).
fall short in supporting LLM serving’s intra-request opti-
mizations such as disaggregated inference [12] or sequence
parallel [17]. These new patterns require efficient point-to- Table 6. Global Request Scheduling Policies. We compare
point, gather, and scatter primitives between HBM or DRAM, whether they improve inter-session or intra-session context
similar to RDMA verbs [21]. As discussed in §7, we imple- caching. A session can be an HTTP session.
ment transfer using NCCL send and recv APIs, and each
Name Intra-Session Inter-Session
call only transmits a single block. Since the KV cache is dis-
crete, the number of network API calls equals the number of Least Load N N
Session-ID-Based Y N
discrete memory blocks, regardless of whether the by-layer
Prompt-Tree-Based Y Y
or by-request approach is used. This is the root cause of why
both incur overhead with increasing load.
To address challenges caused by paging and poor network the instance(s) to run the request and locates the historical
primitives, we propose to reduce fragmentation by aggregat- KV cache. Next, requests arrive at a specific instance for
ing smaller KV blocks into large ones, akin to using huge prefill. We call MemPool’s match API to check whether the
pages. Specifically, instead of having two blocks per layer, new request has locally cached data. If so, the match returns
we aggregate them into one block; the new block size equals a list of local block addresses. Along with the historical KV
2 ∗ 𝐿 smaller blocks. This effectively reduces the number of cache list sent from the global scheduler, we now know
network API calls by 2 ∗ 𝐿 times. This optimization works the cache ratio and cache locations for a specific request.
only for the by-request approach, as the by-layer approach Then, instead of invoking the cost model for each request
inevitably needs to call the network APIs at least 𝐿 times. Our individually, the instance’s local request scheduler groups
test shows this technique improves network performance multiple requests to consider the Batch Size factor for a single
alone by a large margin, as shown in Figure 11. cost model invocation. This is because the same cache ratio
We compare by-layer, by-request, and by-request-agg (pro- has varied benefits depending on the batch’s prompt length.
posed optimization) in Figure 5 across memory layout and Finally, assuming the cost model decides to reuse context
transmission timeline. Our test shows that under low load, caching for a batch of requests, the instance’s scheduler
by-layer achieves the lowest JCT, but under high load, by- moves missing data from its DRAM or remote instances to
layer-agg outperforms by-layer thanks to reduced network its HBM using MemPool APIs. The instance starts the prefill
calls, as shown in Figure 12. phase once all historical KV cache is in the HBM.

5.3 Cost Model for Context Caching 6 Locality-Aware Global Scheduling

We design a cost model for prefill-only instances (and PD- In this section, we describe MemServe’s global scheduler (GS).
colocated instances) to decide whether context caching is The GS routes requests from external services to underlying
beneficial. The cost model concerns factors listed in Table 5. inference instances and returns generated responses in a
Our cost model employs fitted curves derived from experi- streaming fashion. To improve context caching at a large
mental data in Figure 13. scale, we propose global prompt trees and a locality-aware
Requests arrive at MemServe go through the following scheduling policy. Figure 6 shows GS’s architecture.
steps. First, requests pass through the global scheduler’s Global Prompt Trees. Since MemServe runs three types
locality-aware scheduler (described in §6), which determines of inferences, the GS employs three types of prompt trees, for
7
prefill-only, decode-only, and PD-colocated instances. Each Table 7. Workloads Used in Our Work.
tree type has a set of radix trees, the same as the ones used
by MemPool with one extra field per tree node pointing to Type Workload Description
the instance storing the KV cache. The global prompt trees Chat ShareGPT Chat history with ChatGPT
support regular insert and match APIs as listed in Table 1. QA LooGLE Long document QA
Agent ReAct Agent with acting & reasoning
For now, both GS’s global prompt tree and each inference’s
local prompt tree share the same indexing granularity.
Scheduling. When a request arrives at the GS, it goes we only have a single AI machine. As we’ve mentioned ear-
through the following steps. First, the GS runs a tokenizer to lier, NCCL is a collective library designed for typical tensor
turn prompt strings into token IDs. Second, the GS queries and pipeline parallel AI workloads but not ideal for point-to-
the global prompt tree by concurrently calling match against point communication. Specifically, its send and recv APIs
all types of trees. Third, the GS sends query results along only specify source addresses but no destination addresses.
with current load info to a policy module. The policy module Hence, ensuring ordering between a sender and a receiver is
chooses an instance with the longest common prefix (i.e., the challenging, especially if we aim to achieve high parallelism
largest preserved historical KV cache). Once the instance is using multiple threads. As a result, we end up using a single
chosen, the GS checks whether there exist instances storing thread per NCCL communicator to ensure ordering. Addi-
extra historical KV cache that is not present in the chosen tionally, as NCCL has no gather or scatter APIs, we call the
instance. If so, the policy engine also outputs a list of such send-recv API pair multiple times to transmit data across
instances and the corresponding token IDs. Finally, the GS heterogeneous parallelism instances (Figure 2).
sends the request and metadata to the chosen instance. We Context Caching with Disaggregated Inference. We
update the global prompt trees when instances return re- adapt vLLM [14] to using MemPool APIs. Specifically, we re-
sponses back to callers. place its original cache engine and hash-based prefix caching
Discussion. (1) Our proposed prompt-tree-based policy with MemPool. To realize block aggregation, we modify sev-
is a best-effort scheduling policy. It tries to maximize context eral CUDA kernels such as the paged_attention, swap_blocks,
caching reusing opportunities. Since the GS only updates reshape_and_cache.
its prompt tree when responses pass through it, the GS is
unaware of local eviction events in underly instances. There- 8 Evaluation
fore, the GS’s local prompt tree can be outdated. We address
this issue by configuring the global prompt trees with a time- 8.1 Setup
to-live (TTL), commonly in minutes. (2) We compare three We describe the physical server, baseline systems, and LLM
global request scheduling policies in Table 6 across two di- model used in our evaluation.
mensions. The least-load policy means selecting the instance Server. We run all tests on a single NVIDIA DGX H800
with the least load, unaware of any locality. The session-ID- server. It has 8 H800-80GB GPUs interconnected by NVLink
based policy routes requests based on a connection ID (e.g., (400GB/s bandwidth). It has 192-core Intel Xeon Platinum
HTTP session). This policy enables context caching within CPUs @2.4 GHz and 2 TB DRAM. We use Ubuntu 20.04 with
a session. Our prompt-tree-based policy can exploit caching Linux kernel 5.16.7 and CUDA 12.2.
opportunities across sessions, reusing most context caching. Baseline. We use vLLM-0.4.0 as our baseline for running
PD-colocated instances. MemServe uses the version for build-
ing context caching enhanced disaggregated inference.
Model. We use Llama2-13B with tensor parallel (TP) con-
7 MemServe Implementation figured as 2 for all our tests. We use this model size mainly
MemServe has three key parts: MemPool, context caching because it allows us to create four inference instances within
with disaggregated inference, and global scheduler. We im- a single server; a larger model would lead to fewer instances.
plement MemPool from scratch, 5K SLOC in Python and Metrics. For end-to-end benchmarking, we report the fol-
1.6K SLOC in C++. We modify vLLM [14] to build context lowing metrics: time-to-first-token (TTFT), job completion
caching with disaggregated inference, 200 SLOC in Python time (JCT), and time-per-output-token (TPOT).
and 400 SLOC in CUDA C++. The global scheduler and the
cluster management have 600 SLOC in Python. 8.2 Workloads
MemPool. It has two parts: a Python-based library that ex- We describe workloads used for our end-to-end tests. We
poses API to the inference engine and a C++ core part that ex- use three representative workloads as listed in Table 7. (1)
ecutes data transmission. Currently, MemPool uses NCCL’s ShareGPT [25] is a real-world dataset containing user-shared
send and recv point-to-point APIs to transmit data between ChatGPT conversations. Requests from the same conversa-
HBM and uses socket API if any side contains DRAM. We tion form a session and share causal dependencies: clients
have not implemented RDMA-based transmission because send a request to the system only after they receive the
8
(a) Prompt (b) Generation 8.3 End-to-End Applications
1.0 1.0
0.8 0.8 We study the benefits of context caching, disaggregated infer-
0.6 0.6 ence, and when both are combined. We create four different
CDF

CDF
0.4 0.4 settings: (1) PD denotes PD-colocated. (2) PD-CC denotes
ShareGPT ShareGPT PD-colocated with context caching. (3) 1P1D denotes dis-
0.2 ReAct 0.2 ReAct
0.0 LooGLE
0.0 LooGLE aggregated inference with a single prefill-only and a single
0 1000 2000 3000 4000 0 250 500 750 1000 decode-only instance. The numbers can vary. (4) 1P1D-CC
Num. of tokens Num. of tokens
(c) Prompt / Generation (d) Percentage of shared prefix denotes 1P1D with context caching (PC-caching-3). Note
1.0 1.0 ShareGPT that PD-colocated runs vanilla vLLM. The other three set-
0.8 0.8 ReAct
LooGLE tings are run with MemServe. The request rate is calculated
0.6 0.6 per instance. Assume a 5 req/s rate, then a 1P1D setup will
CDF

CDF

0.4 0.4 take 10 req/s. We ensure an equal number of instances in

ShareGPT
0.2 ReAct 0.2 all tests. Also, we use prompt-tree-based scheduling and
LooGLE
0.0 0.0 memory aggregation (by-req-agg). All results are in Figure 8.
0 100 200 300 400 0.00 0.25 0.50 0.75 1.00
Ratio Percentage ShareGPT. Compared to PD-colocated, disaggregated in-
ference (1P2D over PD) improves average and P99 JCT by
Figure 7. Workload Statistics. (a) Prompt length. (b) genera- 30% and 42%, respectively. Enhancing disaggregated infer-
tion length. (c) The ratio of prompt-len over generated-len. ence (e.g., 1P2D) with context caching (e.g., 1P2D-CC) further
(d) The percentage of the shared prefix in each workload. improves average and P99 JCT by 17% and 29%, respectively.
It also improves average and P99 TTFT by 58% and 45%, re-
spectively. Since ShareGPT has the longest generation length
of all three workloads, compared to 2P1D, 1P2D improves
JCT because it improves TPOT but at the cost of heavily
response to the conversation’s previous request. We will re- loaded prefill instances, hurting TTFT.
place the given response with the LLM-generated content. (2) LooGLE and ReAct. Both have long prompts and rela-
LooGLE [15] is an evaluation benchmark for LLM long con- tively short generation lengths. For LooGLE, disaggregated
text understanding, containing long documents with QAs. inference improves average and P99 JCT by 10.3% and 10.8%,
Similar to ShareGPT, requests constructed from the same respectively. Context caching further improves average and
document form a session and share causal dependencies. (3) P99 JCT by 26.9% and 22.5%, average and P99 TTFT by 56.2%
ReAct [33] is an agent acting and reasoning framework. We and 45.2%. For ReAct, disaggregation increases average and
use traces generated from running the ReAct agent with P99 JCT by 40.8% and 53.1%. Caching further enhances these
HotpotQA [32] dataset. metrics by 26.7% and 21.4%, and average and P99 TTFT by
Workload Statistics. We study the above workloads in 78.5% and 84.9%.
Figure 7. We show the distribution of prompt and generated
token length, their ratio, and the percentage of shared pre-
fixes representing potential context caching benefits. Gener-
ally, ShareGPT exhibits uniform distribution across all four 8.4 Microbenchmarks
dimensions. LooGLE has long prompts, short generation MemPool API Study. We first study the main MemPool
lengths, and a large percentage of shared prefixes because APIs in Figure 9. Without loss of generality, we show a few
each request has a long document in its prompt. Since the key APIs. Memory APIs’ latency increases linearly with the
document exceeds our model’s context window, we only number of blocks, taking roughly 800 ns per block. For index
take the first 1k tokens of the document and keep the first APIs, we mainly run insert and match. We vary the num-
five associated questions. ReAct also has long prompts and ber of blocks. A 256 block equals 4K tokens. The latency
a large percentage of shared prefixes because each request mostly stays flat with an increasing cached ratio. It takes at
has a long two-shot example in its prompt. Unlike LooGLE, most 0.7 ms to insert a 4K prompt. In all, MemPool APIs are
requests from ReAct have relatively long generation lengths lightweight and fast.
because of the long and thorough reasoning and actions MemPool Caching Study. We compare vanilla vLLM’s
generated from LLMs. hash-based index with MemPool’s radix-based index. We run
Arrival Pattern. None of the above workloads has arrival both on a PD-colocated instance with no cached data. We
patterns. We simulate a request’s arrival time by sampling it record the prefill time, which consists of two parts: check in-
from a Poisson distribution under different request rates. We dex and model forward. Figure 10 shows that vanilla vLLM’s
maintain the causal dependency for requests belonging to hash-based prefix mechanism incurs a huge overhead as the
the same session: a request is only sent to the system after prompt length increases. In all, using MemPool for basic
receiving the response to the session’s previous request. context caching incurs minimal overhead.
9
PD PD-CC 1P1D 1P1D-CC 1P2D 1P2D-CC 2P1D 2P1D-CC
(a) ShareGPT Average JCT (b) ShareGPT P99 JCT (c) ShareGPT Average TTFT (d) ShareGPT P99 TTFT (e) ShareGPT Average TPOT (f) ShareGPT P99 TPOT
12 40 1.75 0.04 0.14
1.50 8
10 0.12
30
8 1.25 6 0.03 0.10
Time (s)

20 1.00 0.08
6 4 0.02
0.75 0.06
4 0.50
10 2 0.01 0.04
2 0.25 0.02
0 0 0.00 0 0.00 0.00
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
(g) LooGLE Average JCT (h) LooGLE P99 JCT 0.5
(i) LooGLE Average TTFT (j) LooGLE P99 TTFT (k) LooGLE Average TPOT (l) LooGLE P99 TPOT
1.4 0.14
0.8 0.8 0.0150
1.2 0.4 0.12
0.0125
0.6 1.0 0.6 0.10
Time (s)

0.3 0.0100
0.8 0.08
0.4 0.2 0.4 0.0075 0.06
0.6
0.4 0.0050 0.04
0.2 0.1 0.2
0.2 0.0025 0.02
0.0 0.0 0.0 0.0 0.0000 0.00
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
(m) ReAct Average JCT (n) ReAct P99 JCT (o) ReAct Average TTFT (p) ReAct P99 TTFT (q) ReAct Average TPOT (r) ReAct P99 TPOT
8 0.08
12 35 3.5 0.30
30 3.0
10 6 0.06 0.25
25 2.5
Time (s)

8 0.20
20 2.0 4 0.04
6 1.5 0.15
15
4 10 1.0 0.10
2 0.02
2 5 0.5 0.05
0 0 0.0 0 0.00 0.00
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Request rate (req/s) Request rate (req/s) Request rate (req/s) Request rate (req/s) Request rate (req/s) Request rate (req/s)

Figure 8. End-to-End Evalution. The x-axis is the request rate per inference instance, 1P1D counts as two instances.

150
(a) Memory API (b) Index API 120
(a) Memory Layout Study 12
(b) Comm. Cost Study
allocate insert_256 Agg_Block 1MB 120
125 free 600 match_256 100 Orignal 10 2MB 100

Comm Size (GB)

insert_128 4MB
match_128 80 8
Time (ms)

Time (ms)
100 8MB 80
Time (us)

Time (us)

insert_64
400 match_64 60 6 16MB
75 60
40 4 40
50
200 20 2 20
25
0 0 0
0 0 1 2 4 8 16
Num. of Comms
1T2C
1T4C
2T2C
4T4C
C
1T2C
1T4C
1 C
1T1T8C
2T26C
4T4C
C
108TT8C
10C

0 32 64 96 128 160 192 224 256 0 10 20 30 40 50 60 70 80 90 100

1T1

Num. of Blocks Cache Ratio (%)

Figure 9. MemPool API Study. (a) The latency of Memory Figure 11. Network and Memory Layout Optimization Study.
APIs with varied numbers of blocks. (a) The latency of key T is short for threads. C is short for NCCL communicators.
Index APIs with varied cache ratio and number of blocks. The right figure compares the performance and HBM usage
with varied NCCL buffer sizes. The default is 4 MB.

(a) Prefill Time (b) Index Time

Hash 102 Hash
Radix Radix
101
the KV cache generated from a 2048-token prompt. We tune
Time (ms)

Time (ms)

102
several key NCCL parameters: communicator, stream, buffer
100 size, and threads. Figure 11 presents the results. First, the
aggregation method outperforms the vanilla memory layer
101 10 1
32 64 128 256 512 102420484096 32 64 128 256 512 102420484096
by a large margin. Second, a single communicator is enough
Num. of Tokens Num. of Tokens when the memory block is large. When the memory block
is smaller, multiple communicators are required for better
Figure 10. Caching Study. PD-Colocated. Hash is vanilla
performance, but as the right figure shows, increasing the
vLLM. Radix is an adapted vLLM with MemPool.
number of communicators consumes extra HBM.
By-Req-Agg Study. We run a 1024-prompt-32-decode
workload to understand these mechanisms. We vary the
Block Aggregation Study. We study how the proposed request rate and show results in Figure 12. The proposed
memory aggregation helps. We compare two settings: (1) by-req-agg outperforms both by-layer and by-req.
original discrete memory layout (Original) and (2) proposed Context Caching Cost Model. Figure 13 presents the
aggregated memory layout (Agg_Block). The test transmits result with several key takeaways. (1) The benefit of caching
10
5 (a) P50 Latency 10 (b) P99 Latency (a) P50 TTFT (b) P99 TTFT
0.12 RR 0.30 RR
4 8 Prefix Prefix
0.10 Session 0.25 Session
3 6
Time (s)

Time (s)
0.08 0.20

Time (s)

Time (s)
2 Req-P50 4 Req-P99 0.06 0.15
1 Layer-P50 2 Layer-P99 0.04 0.10
MM-Block-P50 MM-Block-P99
0 0 0.02 0.05
5 10 20 40 80 5 10 20 40 80
Req/s Req/s 0.00 0.00
2 3 2 3
Share Ratio Share Ratio
Figure 12. Compare By-Layer, By-Req, and By-Req-Agg.
Figure 14. Global Scheduler Policy. Share Ratio represents
the ratio of the number of identical requests.

120 (a) Prefix Cache Profit 125 (b) Profit in Different BatchSize
TTFT: Cache/Origin (%)

TTFT: Cache/Origin (%)

100 100
80
75
9 Related Work
p1024_bs1
60 p32 p2048_bs1
p4096_bs1 Our work is unique in proposing a standalone MemPool
p128 50 p512_bs2
40 p512 p1024_bs2
module and developing a holistic serving system MemServe
p1024 p2048_bs2
20 p2048 25 p256_bs4
p4096
p512_bs4
p1024_bs4
using MemPool APIs.
010 20 30 40 50 60 70 80 90 100 010 20 30 40 50 60 70 80 90 100 Disaggregated Inference. Four papers propose the disag-
Cache Ratio (%) Cache Ratio (%)
gregated inference idea concurrently within a short 3-month
125 (c) Profit in Different BlockSize 120 (d) Profit in Different Location
p1024_b8
p2048_b8 span: Splitwise [23], TetriServe [12], DistServe [41], and De-
TTFT: Cache/Origin (%)

TTFT: Cache/Origin (%)

100 p4096_b8
p1024_b16
100
p2048_b16 javu [27]. Generally, disaggregating prefill from decode re-
p4096_b16 80
75 p1024_b32
p2048_b32 duces interference between these two stages and allows each
p4096_b32 60 p1024_dram
50 p1024_hbm to scale independently with heterogeneous hardware. More
40 p2048_dram
p2048_hbm recently, LoongServe [29] takes a step further by enabling
25 20 p4096_dram
p4096_hbm dynamic scaling. All prior work builds disaggregated infer-
010 20 30 40 50 60 70 80 90 100 010 20 30 40 50 60 70 80 90 100
Cache Ratio (%) Cache Ratio (%) ence by modifying the inference engine in an ad-hoc manner.
Our work takes a different approach by first abstracting out
Figure 13. Context Caching Cost Model. All figures have the MemPool component and then building disaggregated
cached-ratio has the x-axis. Each line represents a different inference as a use case of MemPool.
prompt length. All y-axis represent the TTFT improvement Context Caching. Caching reduces recomputation, hence
over the no-caching case. (a) studies the prompt-len factor. reducing TTFT and improving throughput. The benefits of
(b) studies the batch-size factor. (c) studies the block-size context caching are well-studied in Pensieve [37], Cache
factor. (d) studies the cached-location factor. Gen [18], SGLang [40], and Prompt Cache [9]. More recently,
Google started a commercial offering of context caching
for their Gemini models [10]. All prior work builds context
caching in a PD-colocated setup. Using MemPool APIs, we
take a step-by-step approach to building the first-ever con-
improves with a larger cached-ratio. (2) For the same cached- text caching solution atop disaggregated inference.
ratio, longer prompts have higher improvement. (3) Batch Scheduling. Scheduling plays a key role in improving
size effectively translates to prompt length. Hence, we need serving efficiency. At the local layer: Orca [36] proposes
to consider batch size along with cached-ratio. (4) When iterative-level scheduling to reduce bubbles. Sarathi [1, 2]
the historical KV cache data is located in DRAM, we must proposes chunked-prefill to overcome suboptimal prefill pro-
swap it into HBM before using it during prefill. Yet, the cessing. FastServe[30] utilizes a multi-level priority feedback
benefit of reducing computation largely offsets the cost of queue to minimize JCT. At the global layer: MuxServe [7]
data movement. Regardless of where data is located, TTFT formulates a multiplexing problem and proposes a novel
improves once the cached-ratio exceeds a certain threshold. placement algorithm and adaptive batch scheduling strategy
Global Scheduler Study. We compare policies as listed to identify optimal colocations in LLM serving.
in Table 6. We selected 80 sessions from LooGLE, roughly 250 Generic Memory Optimization. Many works try to
requests. We propose a share ratio. A share ratio of 2 means optimize memory usage. For example, using quantization[5,
duplicating this set of sessions. While running a 3P1D setup, 6, 8, 13, 16, 26, 31, 34] to compress the model weights into
Figure 14 shows that compared to intra-session scheduling, lower precision, using paging to reduce fragmentation [14],
prompt-tree-based scheduling improves P99 TTFT by 59% and low-level algorithm and kernel optimizations [3, 4, 11,
since it maximizes KV cache reuse. 19, 28, 35, 39]. We refer readers to [38, 42] for more details.
11
10 Conclusion [15] Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. Loogle:
Can long-context language models understand long contexts? arXiv
In this paper, we presented MemServe, a novel system de- preprint arXiv:2311.04939, 2023.
signed to enhance the efficiency of LLM serving by unify- [16] Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan
ing inter-request and intra-request optimizations. The core Klein, and Joey Gonzalez. Train big, then compress: Rethinking model
of MemServe is a distributed MemPool that manages KV size for efficient training and inference of transformers. In Interna-
caches across distributed instances. MemServe builds con- tional Conference on machine learning, 2020.
[17] Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li, Hanyu Zhao,
text caching, disaggregated inference, and their combo using
Wencong Xiao, Qi Xu, Xiafei Qiu, Shen Li, et al. Infinite-llm: Efficient
MemPool APIs. End-to-end results show MemServe can im- llm service for long context with distattention and distributed kvcache.
prove JCT, TTFT, TPOT by a large margin. arXiv preprint arXiv:2401.02669, 2024.
[18] Yuhan Liu, Hanchen Li, Kuntai Du, Jiayi Yao, Yihua Cheng, Yuyang
References Huang, Shan Lu, Michael Maire, Henry Hoffmann, Ari Holtzman, et al.
Cachegen: Fast context loading for language model applications. arXiv
[1] Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun preprint arXiv:2310.07240, 2023.
Kwatra, Bhargav S Gulavani, Alexey Tumanov, and Ramachandran [19] Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei
Ramjee. Taming throughput-latency tradeoff in llm inference with Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. Ram-
sarathi-serve. arXiv preprint arXiv:2403.02310, 2024.
mer: Enabling holistic deep learning compiler optimizations with
[2] Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, {rTasks}. In 14th USENIX Symposium on Operating Systems Design
Bhargav S Gulavani, and Ramachandran Ramjee. Sarathi: Efficient and Implementation (OSDI 20), 2020.
llm inference by piggybacking decodes with chunked prefills. arXiv [20] NVIDIA. NCCL. https://docs.nvidia.com/deeplearning/nccl/user-
preprint arXiv:2308.16369, 2023. guide/docs/overview.html.
[3] Tri Dao. Flashattention-2: Faster attention with better parallelism and [21] NVIDIA. RDMA Verbs. https://docs.nvidia.com/networking/display/
work partitioning. arXiv preprint arXiv:2307.08691, 2023. rdmaawareprogrammingv17/rdma+verbs+api.
[4] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. [22] NVIDIA. TensorRT-LLM. https://github.com/NVIDIA/TensorRT-LLM.
Flashattention: Fast and memory-efficient exact attention with io- [23] Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Aashaka
awareness. Advances in Neural Information Processing Systems, 2022. Shah, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient genera-
[5] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. tive llm inference using phase splitting. arXiv preprint arXiv:2311.18677,
Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv 2023.
preprint arXiv:2208.07339, 2022. [24] Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee,
[6] Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis and Ashish Panwar. vattention: Dynamic memory management for
Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, serving llms without pagedattention. arXiv preprint arXiv:2405.04437,
Torsten Hoefler, and Dan Alistarh. Spqr: A sparse-quantized 2024.
representation for near-lossless llm weight compression. arXiv [25] Sharegpt teams. https://sharegpt.com/.
preprint arXiv:2306.03078, 2023. [26] Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max
[7] Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and
Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. Muxserve: Flexi- Ce Zhang. Flexgen: High-throughput generative inference of large
ble multiplexing for efficient multiple llm serving. arXiv preprint language models with a single gpu. In International Conference on
arXiv:2404.02015, 2024. Machine Learning, 2023.
[8] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Optq: [27] Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski,
Accurate quantization for generative pre-trained transformers. In The and Ana Klimovic. Déjàvu: Kv-cache streaming for fast, fault-tolerant
Eleventh International Conference on Learning Representations, 2022. generative llm serving, 2024.
[9] In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khan-
[28] Xiaohui Wang, Ying Xiong, Yang Wei, Mingxuan Wang, and Lei Li.
delwal, and Lin Zhong. Prompt cache: Modular attention reuse for Lightseq: A high performance inference library for transformers. arXiv
low-latency inference. Proceedings of Machine Learning and Systems, preprint arXiv:2010.13887, 2020.
6:325–338, 2024. [29] Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe
[10] Google. Context Caching. https://ai.google.dev/gemini-api/docs/ Liu, and Xin Jin. Loongserve: Efficiently serving long-context large
caching?lang=python. language models with elastic sequence parallelism. arXiv preprint
[11] Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun arXiv:2404.09526, 2024.
Liu, Kangdi Chen, Hanyu Dong, and Yu Wang. Flashdecoding++: [30] Bingyang Wu, Yinmin Zhong, Zili Zhang, Gang Huang, Xuanzhe Liu,
Faster large language model inference on gpus. arXiv preprint and Xin Jin. Fast distributed inference serving for large language
arXiv:2311.01282, 2023. models. arXiv preprint arXiv:2305.05920, 2023.
[12] Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang [31] Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth,
Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, and Song Han. Smoothquant: Accurate and efficient post-training
et al. Inference without interference: Disaggregate llm inference for quantization for large language models. In International Conference
mixed downstream workloads. arXiv preprint arXiv:2401.11181, 2024. on Machine Learning, 2023.
[13] Berivan Isik, Hermann Kumbong, Wanyi Ning, Xiaozhe Yao, Sanmi
[32] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W
Koyejo, and Ce Zhang. Gpt-zip: Deep compression of finetuned large Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa:
language models. In Workshop on Efficient Systems for Foundation A dataset for diverse, explainable multi-hop question answering. arXiv
Models@ ICML2023, 2023. preprint arXiv:1809.09600, 2018.
[14] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin [33] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik
Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting
Efficient memory management for large language model serving with in language models. arXiv preprint arXiv:2210.03629, 2022.
pagedattention. In Proceedings of the 29th Symposium on Operating
Systems Principles, 2023.
12
[34] Zhewei Yao, Cheng Li, Xiaoxia Wu, Stephen Youn, and Yuxiong He. A [39] Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang
comprehensive study on post-training quantization for large language Zhang, Zizhong Chen, Xin Liu, and Yibo Zhu. Bytetransformer: A
models. arXiv preprint arXiv:2303.08302, 2023. high-performance transformer boosted for variable-length inputs. In
[35] Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, 2023 IEEE International Parallel and Distributed Processing Symposium
Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable (IPDPS), 2023.
post-training quantization for large-scale transformers. Advances in [40] Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue
Neural Information Processing Systems, 2022. Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E
[36] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Gonzalez, et al. Efficiently programming large language models using
and Byung-Gon Chun. Orca: A distributed serving system for sglang. arXiv preprint arXiv:2312.07104, 2023.
{Transformer-Based} generative models. In 16th USENIX Sympo- [41] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu-
sium on Operating Systems Design and Implementation (OSDI 22), 2022. anzhe Liu, Xin Jin, and Hao Zhang. Distserve: Disaggregating prefill
[37] Lingfan Yu and Jinyang Li. Stateful large language model serving with and decoding for goodput-optimized large language model serving,
pensieve. arXiv preprint arXiv:2312.05516, 2023. 2024.
[38] Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei [42] Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao
Guo, Xusheng Chen, and Yizhou Shan. The cap principle for llm Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A
serving. arXiv preprint arXiv:2405.11299, 2024. survey on efficient inference for large language models. arXiv preprint
arXiv:2404.14294, 2024.

Emerging Architectures For LLM Applications - Andreessen Horowitz
No ratings yet
Emerging Architectures For LLM Applications - Andreessen Horowitz
15 pages
AEC - Q100 - What Changed From Rev H To Rev J
No ratings yet
AEC - Q100 - What Changed From Rev H To Rev J
40 pages
Tutorial T2: Fundamentals of Memory Subsystem Design For HPC and AI
No ratings yet
Tutorial T2: Fundamentals of Memory Subsystem Design For HPC and AI
105 pages
Efficient Memory Management For LLM Model Serving With Paged Attention Sep 2023
No ratings yet
Efficient Memory Management For LLM Model Serving With Paged Attention Sep 2023
16 pages
Hygen: Efficient LLM Serving Via Elastic Online-Offline Request Co-Location
No ratings yet
Hygen: Efficient LLM Serving Via Elastic Online-Offline Request Co-Location
15 pages
Fast Switch
No ratings yet
Fast Switch
13 pages
A Survey of Efficient LLM Inference Serving
No ratings yet
A Survey of Efficient LLM Inference Serving
20 pages
NanoFlow - Towards Optimal Large Language Model Serving Throughput
No ratings yet
NanoFlow - Towards Optimal Large Language Model Serving Throughput
19 pages
MemGPT - Towards LLMs As Operating Systems - 2310.08560
No ratings yet
MemGPT - Towards LLMs As Operating Systems - 2310.08560
15 pages
Roadmap To Learn AI Agents
No ratings yet
Roadmap To Learn AI Agents
10 pages
Droidspeak: KV Cache Sharing For Cross-Llm Communication and Multi-Llm Serving
No ratings yet
Droidspeak: KV Cache Sharing For Cross-Llm Communication and Multi-Llm Serving
17 pages
P C: M A R L - L I: Rompt Ache Odular Ttention Euse FOR OW Atency Nference
No ratings yet
P C: M A R L - L I: Rompt Ache Odular Ttention Euse FOR OW Atency Nference
17 pages
Scaling Memcache at Facebook - Slides
No ratings yet
Scaling Memcache at Facebook - Slides
28 pages
2505.21919v1
No ratings yet
2505.21919v1
3 pages
A Survey On Large Language Model Acceleration Based On KV Cache Management
No ratings yet
A Survey On Large Language Model Acceleration Based On KV Cache Management
43 pages
Exploring DRAM Cache Prefetching For Pooled Memory
No ratings yet
Exploring DRAM Cache Prefetching For Pooled Memory
12 pages
Infinite-Llm: Efficient LLM Service For Long Context With Distattention and Distributed Kvcache
No ratings yet
Infinite-Llm: Efficient LLM Service For Long Context With Distattention and Distributed Kvcache
14 pages
F I: E C A E LLM I S: Lash Nfer Fficient and Ustomizable Ttention Ngine For Nference Erving
No ratings yet
F I: E C A E LLM I S: Lash Nfer Fficient and Ustomizable Ttention Ngine For Nference Erving
20 pages
Cachegen: KV Cache Compression and Streaming For Fast Large Language Model Serving
No ratings yet
Cachegen: KV Cache Compression and Streaming For Fast Large Language Model Serving
19 pages
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
No ratings yet
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
16 pages
Fairness in Serving Large Language Models
No ratings yet
Fairness in Serving Large Language Models
18 pages
Multicore Cache Hierarchies
No ratings yet
Multicore Cache Hierarchies
151 pages
4 Memory Models
No ratings yet
4 Memory Models
19 pages
Fast Distributed Inference Serving For Large Language Models
No ratings yet
Fast Distributed Inference Serving For Large Language Models
15 pages
02 Lecf 13 Map Reduce
No ratings yet
02 Lecf 13 Map Reduce
81 pages
LSERVER
No ratings yet
LSERVER
14 pages
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
No ratings yet
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
20 pages
Cimple: Instruction and Memory Level Parallelism: A DSL For Uncovering ILP and MLP
No ratings yet
Cimple: Instruction and Memory Level Parallelism: A DSL For Uncovering ILP and MLP
16 pages
Stanford Advanced Caches
No ratings yet
Stanford Advanced Caches
46 pages
Osdi24 Lin Chaofan
No ratings yet
Osdi24 Lin Chaofan
18 pages
Ragcache: Efficient Knowledge Caching For Retrieval-Augmented Generation
No ratings yet
Ragcache: Efficient Knowledge Caching For Retrieval-Augmented Generation
14 pages
The System Design
No ratings yet
The System Design
135 pages
Class 6 - Caching
No ratings yet
Class 6 - Caching
3 pages
Topics: Cache Innovations (Sections 2.4, B.4, B.5), Virtual Memory Intro
No ratings yet
Topics: Cache Innovations (Sections 2.4, B.4, B.5), Virtual Memory Intro
20 pages
Smart Memory
No ratings yet
Smart Memory
19 pages
Pie: Pooling CPU Memory For LLM Inference: Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, Ion Stoica
No ratings yet
Pie: Pooling CPU Memory For LLM Inference: Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, Ion Stoica
13 pages
Info
No ratings yet
Info
1 page
Web Server Software Architectures: IEEE Internet Computing December 2003
No ratings yet
Web Server Software Architectures: IEEE Internet Computing December 2003
17 pages
Hu Et Al. - 2024 - Inference Without Interference Disaggregate LLM I
No ratings yet
Hu Et Al. - 2024 - Inference Without Interference Disaggregate LLM I
15 pages
Section 080 Adding Memory To LLM Apps
No ratings yet
Section 080 Adding Memory To LLM Apps
12 pages
DroidSpeak
No ratings yet
DroidSpeak
18 pages
Resource-Aware Hierarchical Federated Learning For Video Caching in Wireless Networks
No ratings yet
Resource-Aware Hierarchical Federated Learning For Video Caching in Wireless Networks
7 pages
Thin Servers With Smart Pipes: Designing Soc Accelerators For Memcached
No ratings yet
Thin Servers With Smart Pipes: Designing Soc Accelerators For Memcached
12 pages
Fast Distributed Inference Serving For Large Language Models
No ratings yet
Fast Distributed Inference Serving For Large Language Models
14 pages
Mu Cache
No ratings yet
Mu Cache
19 pages
CSD Final Report
No ratings yet
CSD Final Report
8 pages
Multiprocessors and Linux: Krzysztof Lichota Lichota@mimuw - Edu.pl
No ratings yet
Multiprocessors and Linux: Krzysztof Lichota Lichota@mimuw - Edu.pl
30 pages
EMBA: Efficient Memory Bandwidth Allocation To Improve Performance On Intel Commodity Processor
No ratings yet
EMBA: Efficient Memory Bandwidth Allocation To Improve Performance On Intel Commodity Processor
12 pages
Get Java Version
No ratings yet
Get Java Version
7 pages
Event LangServe Mistral7B
No ratings yet
Event LangServe Mistral7B
66 pages
Serving LLM 2312.15234
No ratings yet
Serving LLM 2312.15234
32 pages
Abcsdgdsfg
No ratings yet
Abcsdgdsfg
1 page
Module4 CAche Performance
No ratings yet
Module4 CAche Performance
40 pages
Towards Pareto Optimal Throughput in Small Language Model Serving
No ratings yet
Towards Pareto Optimal Throughput in Small Language Model Serving
8 pages
AsliConf Mongodb
No ratings yet
AsliConf Mongodb
46 pages
OS Mini
No ratings yet
OS Mini
20 pages
Module - 6
No ratings yet
Module - 6
89 pages
Memos: An Operating System For Memory-Augmented Generation (Mag) in Large Language Models (Short Version)
No ratings yet
Memos: An Operating System For Memory-Augmented Generation (Mag) in Large Language Models (Short Version)
10 pages
Cache Coherence Protocols For Sequential Consistency: Computer Science and Artificial Intelligence Lab M.I.T
No ratings yet
Cache Coherence Protocols For Sequential Consistency: Computer Science and Artificial Intelligence Lab M.I.T
41 pages
Memory Hierarchy Presentation Detailed
No ratings yet
Memory Hierarchy Presentation Detailed
24 pages
C13-Computational Performance
No ratings yet
C13-Computational Performance
45 pages
CATCH: A Cost Analysis Tool For Co-Optimization of Chiplet-Based Heterogeneous Systems
No ratings yet
CATCH: A Cost Analysis Tool For Co-Optimization of Chiplet-Based Heterogeneous Systems
13 pages
(Ebooks PDF) Download 3D Microelectronic Packaging From Fundamentals To Applications 1st Edition Yan Li Full Chapters
100% (2)
(Ebooks PDF) Download 3D Microelectronic Packaging From Fundamentals To Applications 1st Edition Yan Li Full Chapters
55 pages
Session11 Papers
No ratings yet
Session11 Papers
13 pages
Virtual GPU Packaging and Licensing Guide
No ratings yet
Virtual GPU Packaging and Licensing Guide
22 pages
High-Bandwidth Chiplet Interconnects For Advanced Packaging Technologies in AI ML Applications Challenges and Solutions
No ratings yet
High-Bandwidth Chiplet Interconnects For Advanced Packaging Technologies in AI ML Applications Challenges and Solutions
14 pages
HC28.21.130 High Bandwidth KEVIN - TRAN SKHYNIX VERSION - FINAL DCRP t1 4
No ratings yet
HC28.21.130 High Bandwidth KEVIN - TRAN SKHYNIX VERSION - FINAL DCRP t1 4
22 pages
FusionServer V7 Server GPU Card Operation Guide 08
No ratings yet
FusionServer V7 Server GPU Card Operation Guide 08
170 pages
Poweredge Server Gpu Matrix
No ratings yet
Poweredge Server Gpu Matrix
4 pages
Ga100-883 Pcie Gen4 Hbm2 Tesla Passive P1001-B02: Description Description Description
No ratings yet
Ga100-883 Pcie Gen4 Hbm2 Tesla Passive P1001-B02: Description Description Description
2 pages
IR STD 20250509E - Rev04
No ratings yet
IR STD 20250509E - Rev04
163 pages
JSSC24 SHKim A Single-Ended Impedance-Matched Transmitter With Single Ring-Oscillator-Based Time-Domain ZQ Calibration For Memory Interfaces
No ratings yet
JSSC24 SHKim A Single-Ended Impedance-Matched Transmitter With Single Ring-Oscillator-Based Time-Domain ZQ Calibration For Memory Interfaces
12 pages
Furiosa Introduction Confidential
No ratings yet
Furiosa Introduction Confidential
21 pages
What Is Intel Agilex FPGA
No ratings yet
What Is Intel Agilex FPGA
6 pages
V11 Highlighted Chip Releases - Digital and Machine Learning Processors
No ratings yet
V11 Highlighted Chip Releases - Digital and Machine Learning Processors
111 pages
Micron Quality Manual
No ratings yet
Micron Quality Manual
21 pages
Nvidia Ampere Architecture Whitepaper
No ratings yet
Nvidia Ampere Architecture Whitepaper
83 pages
Enflame - Final Deck
No ratings yet
Enflame - Final Deck
27 pages
E VK Com Englishmagazines 152
No ratings yet
E VK Com Englishmagazines 152
52 pages
Poweredge Server Gpu Matrix
No ratings yet
Poweredge Server Gpu Matrix
2 pages
2025 Ieee International Solid-State Circuits Conference February16-20 Conference Theme-The Silicon Engine Driving The Ai Revolution San Francisco
No ratings yet
2025 Ieee International Solid-State Circuits Conference February16-20 Conference Theme-The Silicon Engine Driving The Ai Revolution San Francisco
106 pages
Amd Instinct Mi300x Platform Data Sheet
No ratings yet
Amd Instinct Mi300x Platform Data Sheet
2 pages
Performance Evaluation of A Next-Generation SX-Aurora TSUBASA Vector Supercomputer
No ratings yet
Performance Evaluation of A Next-Generation SX-Aurora TSUBASA Vector Supercomputer
19 pages
Gfs - AI CoWoS Update
No ratings yet
Gfs - AI CoWoS Update
5 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
Gem5 X TechnicalManual Wireless
No ratings yet
Gem5 X TechnicalManual Wireless
24 pages
A 4ns Settling Time FVF-Based Fast LDO Using Bandwidth Extension Techniques For HBM3
No ratings yet
A 4ns Settling Time FVF-Based Fast LDO Using Bandwidth Extension Techniques For HBM3
3 pages
ECTC2025 EuichulChung 821
No ratings yet
ECTC2025 EuichulChung 821
7 pages

Memserve: Context Caching For Disaggregated LLM Serving With Elastic Memory Pool

Uploaded by

Memserve: Context Caching For Disaggregated LLM Serving With Elastic Memory Pool

Uploaded by

MemServe: Context Caching for Disaggregated LLM

Serving with Elastic Memory Pool

1 Huawei Cloud, 2 UCAS 3 ICT, CAS 4 Peking University

Global Prompt Tree

and data exchange mechanisms.

Elastic Memory Pool

Design Steps Description

Scenario Type APIs Used

fer the active KV cache produced after the prefill phase to

utilization [14], the discrete memory layout presents huge

ing existing AI network stacks.

5.3 Cost Model for Context Caching 6 Locality-Aware Global Scheduling

0.4 0.4 take 10 req/s. We ensure an equal number of instances in

Comm Size (GB)

0 32 64 96 128 160 192 224 256 0 10 20 30 40 50 60 70 80 90 100

Num. of Blocks Cache Ratio (%)

(a) Prefill Time (b) Index Time

TTFT: Cache/Origin (%)

TTFT: Cache/Origin (%)

You might also like