Attentionstore: Cost-Effective Attention Reuse Across Multi-Turn Conversations in Large Language Model Serving
Attentionstore: Cost-Effective Attention Reuse Across Multi-Turn Conversations in Large Language Model Serving
Bin Gao1,* , Zhuomin He2,* , Puru Sharma1 , Qingxuan Kang1 , Djordje Jevdjic1 , Junbo Deng3 ,
Xingkun Yang3 , Zhou Yu3 , and Pengfei Zuo3,†
1 National University of Singapore 2 Shanghai Jiaotong University 3 Huawei Cloud
arXiv:2403.19708v2 [cs.CL] 16 Apr 2024
Abstract erative applications [32, 47, 48]. However, serving these gen-
erative applications with LLMs is very expensive due to the
Interacting with humans through multi-turn conversations is a LLM inference employing a large number of GPUs. Given
fundamental feature of large language models (LLMs). How- the high demand for generative applications, reducing the cost
ever, existing LLM serving engines for executing multi-turn of inference becomes crucial.
conversations are inefficient due to the need to repeatedly
compute the key-value (KV) caches of historical tokens, in- Engaging in multi-turn conversations with humans is an
curring high serving costs. To address the problem, this paper essential capability of LLMs [53, 57]. These multi-turn con-
proposes AttentionStore, a new attention mechanism that en- versations help LLMs comprehend context, user intent, and
ables the reuse of KV caches (i.e., attention reuse) across emotional nuances, enhancing their ability to respond appro-
multi-turn conversations, significantly reducing the repetitive priately. Based on the ShareGPT data [37], a widely-used
computation overheads. AttentionStore maintains a hierarchi- real dataset collected from ChatGPT, 73% of conversations
cal KV caching system that leverages cost-effective memo- involve multiple turns, as analyzed in Section 2.3.
ry/storage mediums to save KV caches for all requests. To However, executing multi-turn conversations in current
reduce KV cache access overheads from slow mediums, Atten- LLM serving engines is highly inefficient, as it requires a large
tionStore employs layer-wise pre-loading and asynchronous number of repetitive computations, incurring high serving
saving schemes to overlap the KV cache access with the GPU costs. During a single turn of conversation, the LLM engine
computation. To ensure that the KV caches to be accessed stores intermediate data, key-value (KV) pairs [4, 22, 38], in
are placed in the fastest hierarchy, AttentionStore employs the limited high-bandwidth memory (HBM) on GPUs. When
scheduler-aware fetching and eviction schemes to consciously that conversation ends and the conversation session becomes
place the KV caches in different layers based on the hints from inactive, the LLM engine generally discards the KV cache
the inference job scheduler. To avoid the invalidation of the associated with that session, to free up space in the HBM for
saved KV caches incurred by context window overflow, At- other active sessions. When the session becomes active again,
tentionStore enables the saved KV caches to remain valid via i.e., the user sends the next message in the conversation, the
decoupling the positional encoding and effectively truncating LLM engine computes the whole KV cache again. This leads
the KV caches. Extensive experimental results demonstrate to repetitive computation of the same KV cache, wasting valu-
that AttentionStore significantly decreases the time to the first able GPU computation resources. With the number of con-
token (TTFT) by up to 87%, improves the prompt prefilling versation turns increases, the repetitive computation overhead
throughput by 7.8× for multi-turn conversations, and reduces linearly increases. Our analysis based on ShareGPT shows
the end-to-end inference cost by up to 70%. For long sequence that up to 99% of the prefilling cost comes from repetitive
inference, AttentionStore reduces the TTFT by up to 95% and computation for the KV cache, as presented in Section 2.3.
improves the prompt prefilling throughput by 22×. To reduce the serving cost and improve the inference per-
formance, this paper proposes AttentionStore, a new attention
mechanism that enables the reuse of KV caches (i.e., attention
1 Introduction reuse) across multi-turn conversations rather than discarding
them. When a conversation session becomes inactive, Atten-
With impressive performance on a wide variety of tasks, large
tionStore saves the corresponding KV cache in a KV caching
language models (LLMs) have ushered in a new era of gen-
system. Upon the resumption of the same session, Attention-
*Work done during their internship at Huawei Cloud. Store loads and reuses the saved KV cache from the KV
†Corresponding author: Pengfei Zuo ([email protected]). caching system, thereby eliminating the overhead of the repet-
1
itive computation. However, building such an efficient KV when saving them. It re-embeds the positional encoding into
caching system for multi-turn conversations presents signifi- KV caches when loading them. After decoupling, truncation
cant challenges. can be directly applied to the KV caches, thereby ensuring
Firstly, the KV caching system serves as the external stor- the reusability of the saved KV caches.
age for GPUs and is attached to the GPUs via low-speed links. We implement the AttentionStore and evaluate it using the
The use of the KV caching system brings about significant ac- real ShareGPT dataset [37]. Extensive experimental results
cess overhead due to the need to transfer KV caches between demonstrate that AttentionStore significantly decreases the
HBMs and the KV caching system. The access overhead of time to the first token (TTFT) by up to 87% and improves the
KV caches is in the critical path of inference execution. This prompt prefilling throughput by 7.8× for multi-turn conver-
is because GPUs can only perform the computation of an sations. It also reduces the end-to-end inference cost by up
inference job after successfully loading its corresponding KV to 70%. For long sequence inference, AttentionStore reduces
cache into HBMs. Likewise, the subsequent inference jobs the TTFT by up to 95% and improves the prompt prefill-
need to wait until the KV caches from the previous jobs are ing throughput by 22×. To summarize, this paper makes the
moved out of the HBMs if the HBM space is not enough. To following contributions:
reduce the KV cache loading overheads, AttentionStore uses a
layer-wise pre-loading scheme to overlap the time of loading • We investigate the recomputation overheads of KV
the KV cache with the inference computation layer by layer. caches in LLMs across conversation turns and iden-
To reduce the KV cache saving overheads, AttentionStore tify the challenges associated with retaining KV caches
develops an asynchronous saving scheme that overlaps the across multi-turn conversations.
time of saving KV caches with the inference computation.
• We propose AttentionStore, a new attention that allows
Secondly, the KV caches occupy a large amount of storage
the reuse of the KV caches for any ensuing conversa-
space that continuously expands during conversations. Prior
tion turns of the same session, achieving a significant
works have attempted to reduce the inefficiency of repetitive
reduction in the recomputation overhead of KV caches
KV computation by retaining the KV caches across multi-
in LLMs.
turn conversations in HBMs [19, 67]. However, this quickly
exhausts the limited HBM capacity. We present an example of • To improve the efficiency of AttentionStore, we design
LLaMA-65B in Section 2.3, which shows the KV caches fully overlapped KV cache access, hierarchical KV cache
occupy the free space within the HBMs in 14 seconds. To placement, and positional encoding decoupled KV cache
address this challenge, AttentionStore explores and exploits truncation schemes.
slower but larger-capacity storage hierarchies than HBMs,
including host memory and disks, to provide adequate storage • We thoroughly evaluate AttentionStore with real datasets
space for caching KV caches. to demonstrate its efficacy and efficiency.
Thirdly, since disks have much larger capacity than the host
memory (tens of TBs v.s. several hundreds of GBs), most KV
caches are retained in disks for AttentionStore. As conversa- 2 Background and Motivation
tion requests arrive randomly, their corresponding KV caches
are more likely to be located in disks, resulting in poor access This section begins with an overview of the fundamentals of
performance. To address this problem, AttentionStore uses generative LLM inference. It then delves into the inefficien-
a scheduler-aware KV cache fetching scheme. This scheme cies that exist in LLMs during multi-turn conversations. The
pre-fetches the KV caches that are likely to be accessed from section ends with a discussion of the design opportunities
disks to the host memory, by utilizing the hints received from for dealing with these inefficiencies and the challenges faced
the interface job scheduler. When the free space of the host during the design of such a system.
memory is not enough, AttentionStore also adopts a scheduler-
aware eviction scheme to efficiently identify the most suitable
2.1 Generative LLM Inference Basics
KV caches in memory and evict them to disks or out of the
system. Transformer Architecture. The transformer has emerged
Finally, when a conversation session surpasses the limit as the widely accepted standard in generative LLM infer-
of the context window of LLMs, e.g., 4K in LLaMA-2 [49], ence. The widely used LLMs like GPTs [32] and LLa-
LLMs generally truncate the oldest tokens and limit the con- MAs [48, 49] are built upon the autoregressive transformer
text to the most recent tokens [33]. This truncation makes all architecture [17, 51]. During inference, these models pro-
saved KV caches of that conversation in AttentionStore in- cess the prompt of the users and generate a response. The
valid since the positional information of all tokens embedded prompt is processed as a sequence of input tokens, and the
in the KV cache is changed. To overcome this issue, Attention- response is generated by the model predicting the probabil-
Store decouples the positional encoding from the KV caches ity of subsequent tokens using the context of all the prior
2
Prefilling Decoding prefilling decoding/token 1.00 1.00
KV Cache[1:s]
STOP 0.75 0.75
token [s+1]
10 0.50 0.50
CDF
CDF
Latency (s)
KV Cache[s+1]
Layers
t [s+2] = EOF 0.25 0.25
5 0.00 0.00
X [1:s] Layers 2 5 10 15 20 25 30 35 40 024 8 16 24 32
0 Conversation turn number Token number (K)
KV Cache[1:s] 1K 2K 4K
START t [s+2]
Input length (a) CDF for conversation turn number (b) CDF for session length
(a) Two-phase illustration. (b) Execution latency. Figure 2: (a) Distribution for conversation turn number
in ShareGPT [37]. (b) The session length distribution of
Figure 1: Prefilling and decoding phases. Latency measured ShareGPT. For better display effect, the statistics exclude
for LLaMA-70B of batch size 8 on 4 A100 GPUs. conversations with over 40 turns or sessions that exceed a
length of 32K.
tokens. The transformer model consists of a chain of l trans-
former layers. Each transformer layer is comprised of two
Turn 1 q1 a1 Turn 1 q1 a1
steps, self-attention and feed-forward network (FFN).
time
time
For the input token list X = [x1 , x2 , ...xs ], each layer applies Turn 2 q1 a1 q2 a2 Turn 2 q1 a1 q2 a2 decoding
Q = WQ X, K = WK X,V = WV X
Figure 3: Comparison of recomputation and AttentionStore.
Subsequently, attention scores are computed via Q, K, and V :
QK T
Attention(Q, K,V ) = so f tmax( √ )V The decoding phase. The decoding phase generates output
dk tokens with autoregressive iterations. The decoding phase
√ takes token s + 1 and the KV cache [1 : s] from the prefilling
where dk is the dimension of the key vector k. Finally, the
phase as input to compute the KV cache s + 1 and the token
projection operation applies a linear transformation on atten-
s + 2. The generation process iteratively continues until the
tion scores. This projected result is handed to the FFN layer.
generated token is <eos> or the iteration number reaches the
The result from FFN is passed on to the next transformer layer
maximum allowed generation number. The decoding phase
as input. Finally, after the input has been processed through
only happens sequentially due to the heavy data dependency
all l transformer layers, the output is a probability vector that
on the previous iteration.
marks out the most probable output tokens.
KV Cache: Within the entire process above, each token The two phases present significantly different characteris-
produces intermediate K and V tensors. When generating tics in terms of execution time. The prefilling phase computes
subsequent tokens, all KV tensors of preceding tokens are the KV cache in parallel. The duration of this phase is closely
necessary for computing the self-attention. These K and V tied to the number of prompt tokens provided as input. As
tensors are generally cached in GPUs, referred to as the KV shown in Figure 1b, the execution time of the prefilling phase
cache. The KV cache typically has a large footprint. For increases as the number of input tokens grows. In contrast,
example, GPT-3 [11, 32] generates a 4.5MB KV cache for the decoding phase only performs computation for a single
each token. The size of KV cache linearly increases with the token in each iteration, which makes the computation time
number of prompt tokens. A conversation session containing for each iteration relatively constant.
thousands of tokens will produce several GBs of KV cache.
2.3 Multi-turn Conversation Inference
2.2 Autoregressive Generation
Engaging humans in multi-turn conversations is a fundamen-
As illustrated in Figure 1a, transformer-based generation can tal feature of modern LLMs. A multi-turn conversation ses-
logically be identified as two distinctive phases [1]. sion consists of a series of continuous conversations, denoted
The prefilling phase. Given a request prompt, the gener- as D = [d1 , d2 , ...dN ]. In each conversation d j , a user inputs
ation takes the prompt token list X = [x1 , x2 , ...xs ] as input a new question or command q j and then awaits the response
and then proceeds to compute the token xs+1 . This process a j from the LLM. To maintain a coherent context and un-
generates a series of KVs, specifically forming the KV cache derstanding of the conversation session, the LLM generates
ranging from 1 to s, which are used for the decoding phase. aN+1 based on both the historical tokens from all previous
3
hist tokens new tokens prefill all prefill new the KV caches to be loaded from the KV caching system.
Recomp ratio
14K The block time is non-negligible compared to the repetitive
100
12K computation time of the KV cache, making the KV caching
8K 50 gap: 99%
6K 0.5 time of the LLaMA-65B model using 4 NVIDIA A100 GPUs
4K and observe that prefilling 2K tokens of a prompt consumes
2K
0 0 0.0 about 360 ms. In contrast, loading the KV cache of the 2K
0 5 10 15 20 0 5 10 15 20 tokens (5GB) from host memory to GPUs consumes about
Conversation Turns Conversation Turns
(a) (b) 192 ms (the GPU system with 16 lanes of PCIe Gen4 has
about 26GB/s of effective data transmission bandwidth).
Figure 4: Recomputation inefficiencies. (a) The average num- 2) High storage capacity requirement of KV caches. Stor-
bers of historical tokens and new tokens in different turns of ing the KV cache for each request consumes a substantial
ShareGPT [37]. (b) The GPU time for prefilling all tokens amount of storage space. For instance, when using 4 A100
and only new input tokens in ShareGPT with Mistral-7B [20] GPUs each with 80GB HBM to run LLaMA-65B, prefilling
on 1 A100 GPU. 2K tokens consumes about 360 ms. This process generates
5GB of KV cache, indicating the generation speed of the
KV cache is about 13.9GB/s. As 130GB of HBM space is
conversation turns d[1 : N] and the input tokens of the current allocated to store the model, the remaining 190GB of free
turn, denoted as q1 a1 q2 a2 ...qN aN qN+1 . HBM space will be fully occupied by the KV cache within
Based on the analysis of ShareGPT [37, 42], a real dataset 14 seconds. If spilling the KV cache to the host memory (e.g.,
collected from ChatGPT that includes more than 90K conver- 512GB space), the host memory will be filled in less than 1
sations, we observe that 73% of conversations are multi-turn, minute. Using disks to save the KV cache can extend the stor-
as shown in Figure 2a. Moreover, 30% of conversations have age space. However, this incurs worse access performance, as
more than 4K tokens as shown in Figure 2b. presented below.
However, executing multi-turn conversations in current
3) Suitable placement of KV caches in different hierar-
LLM serving engines is inefficient due to the repetitive com-
chies. Disks provide much larger capacity than the host mem-
putation of KV caches across multiple conversation turns.
ory (tens of TBs v.s. several hundreds of GBs). Thus most KV
As shown in Figure 3a, in the conversation turn 1, the LLM
caches are retained in disks. However, the disks have an ac-
serving engine generates the KV cache of q1 and a1 . After
cess bandwidth of less than 5GB/s. As conversation requests
finishing turn 1, the LLM serving engine discards the KV
arrive randomly, their corresponding KV caches are more
cache to reclaim the HBM space. In turn 2, the LLM serving
likely to be located in disks when being accessed, resulting in
engine re-generates the KV cache of q1 and a1 . In turn 3,
poor inference performance. It is essential to ensure that the
the KV cache of q1 , a1 , q2 , and a2 is re-generated. As the
KV cache to be accessed in the immediate future is always
session expands, the historical tokens keep accumulating and
placed in the host memory instead of disks.
the amount of repetitive computation significantly increases.
As shown in Figure 4a, with the increase of the conversation 4) Unexpected invalidation of the saved KV caches. With
turns, the percentage of historical tokens will be more than the number of conversation turns increasing, the historical
99% in a new conversation. The repetitive computation time tokens can exceed the context window limitation. LLM serv-
occupies 99% of the prefilling time (a.k.a., time to the first ing engines generally perform token truncation [16, 33] to
token) in the new conversation, as shown in Figure 4b. reduce the input prompt. The truncation has no impact on
previous LLM serving engines since they always recompute
the KV cache based on the input prompt following trunca-
2.4 Opportunities and Challenges
tion. However, the truncation makes the KV caches saved in
Based on the analysis above, we observe that if the KV the KV caching system invalid, since the position of each to-
caches can be reused across multiple turns of conversations, ken is changed after truncation. Thus it cannot match the old
up to 98% of prefilling cost can be reduced. Specifically, embedded positional encoding in the saved KV cache. Such
the KV caches of historical conversations can be saved in a context window overflow can occur with a high probability.
KV caching system out of GPUs. Upon the reactivation of a As shown in Figure 2b, 47% and 30% of conversation ses-
conversation session, GPUs load the associated KV caches sions have a context longer than 2K and 4K, respectively. It
from the KV caching system and reuse them for the new-turn means that when using the LLaMA-2 family with 4K context
conversation. Nevertheless, to build an efficient KV caching window [49], the context window overflow occurs in 30% of
system, there exist many significant challenges. conversation sessions. When using the OPT family with 2K
1) High KV cache access overheads. During the inference, context window [63], the context window overflow occurs in
the computation of GPUs can be blocked due to waiting for 47% of conversation sessions.
4
AttentionStore
Controller
GPU Cluster identifies the least valuable KV caches and evicts them to
Job Scheduler GPU 0 GPU 1 disks or out of the caching system (§3.3).
HBM 0 HBM 1
Job For Challenge 4, to deal with the invalidation of KV caches
Queue Load (§3.2.1) Save (§3.2.2)
saved in AttentionStore due to context window overflow, we
utilize a positional encoding decoupled truncation scheme to
KV Cache Manager Host memory save the KV caches without positional encoding embedded,
KV Access Fetch (§3.3.1) Evict (§3.3.2) and hence support the truncation directly on KV caches. When
(§3.2)
loading the KV cache, AttentionStore re-embeds the new
KV Placement Disks positional encoding into the KV caches (§3.4).
(§3.3)
KV Cache Storage
5
prefill decode
Execution
Stream
Last job Gap Layer 1 Layer 2 Layer 3 Execution
Stream
Layer 1 Layer 2 Layer 3 L1 L2 L3 L1 L2 L3 Gap
Read
Layer 1 Layer 2 Layer 3
Stream Write write
Stream
(a) Baseline: KV cache loading without concurrent operations. (a) Baseline: KV cache saving without concurrent operations.
Execution prefill decode
Stream Last job Layer 1 Layer 2 Layer 3
Execution
Read Stream Layer 1 Layer 2 Layer 3 L1 L2 L3 L1 L2 L3
Layer 1 Layer 2 Layer 3
Stream
Write
Stream
(c) Layer-wise pre-loading with buffer. the prefilling time for a token, the length of historical tokens
in a session, and the length of new input tokens in the con-
Figure 6: Layer-wise KV cache pre-loading. Blue blocks
versation, respectively. Imperfect overlapping happens when
indicate the execution of each transformer layer. Red blocks
Tload Lhist > Tpre f Lnew , which indicates that the transmission
indicate the KV cache loading of each transformer layer.
time is larger than the partial prefilling time. The buffer is
used to fill up the time gap Tload Lhist − Tpre f Lnew . Combined
Execution
Stream Last job Layer 1 Layer 2 Layer 3 with the PCIe bandwidth B, the buffer size can be set by the
Read
Stream
Layer 1 Layer 2 Layer 3 following formula: Sbu f = B(Tload Lhist − Tpre f Lnew ).
(a) Layer-wise pre-loading with imperfect overlapping. 3.2.2 Asynchronous Saving from HBMs to Memory
Execution
Stream Last job Layer 1 Layer 2 Layer 3
Read
AttentionStore needs to save KV caches to host memory to
Layer 1 Layer 2 Layer 3
Stream enable the reuse of the KV caches across conversations. A
baseline method to save the KV caches is to write all produced
(b) Perfect pre-loading with a customized larger buffer. KV caches together after the round of conversation ends. This
Figure 7: Layer-wise KV cache pre-loading. method however potentially delays the execution of the next
scheduled jobs since the KV saving time is on the critical
path of inference, as shown in Figure 8a. To reduce this over-
observe that a gap still exists between the last job and the first head, AttentionStore incorporates an asynchronous KV cache
layer of the current job, since the loading can only commence saving scheme to overlap the KV cache write-back with the
once the HBM execution buffer is available, i.e., the last job is computation, which also considers the different characteris-
finished. To further mitigate the gap between the last job and tics of prefilling and decoding phases to perform different
the first layer of the current job, AttentionStore reserves an overlapping mechanisms.
HBM read buffer to eliminate the gap. Specifically, as shown Specifically, the generation speeds of KV caches at the
in Figure 6c, with the read buffer, the read stream doesn’t prefilling and decoding phases are different. The prefilling
have to wait for the release of the execution buffer from the phase processes tokens concurrently, thus generating substan-
last job. The read stream can start the pre-loading while the tial volumes of KV cache within a restricted timeframe. In
last job is running. contrast, the decoding phase generates the KV cache of one
However, pre-loading may fail to fully overlap with the token at a time. As shown in Figure 8b, for the prefilling
computation if the KV cache loading time is longer than the phase, as each self-attention operation can produce a signif-
prefilling computation time. As shown in Figure 7a, multi- icant amount of KV cache, the write stream retains the KV
ple gaps exist between the computation of layers because cache layer by layer. The KV cache produced by the prefilling
the KV cache fetching time for each layer exceeds the com- phase can be overlapped with the decoding phase. For the
putation time for each layer, resulting in imperfect overlap- decoding phase, as the KV cache is iteratively produced, the
ping. The overhead can be further minimized by employing a write stream writes back the KV cache layer by layer while
customized larger pre-loading buffer. With the larger buffer, decoding. To avoid getting stuck if the KV cached is not fully
pre-loading can be issued with an earlier start. For instance, written back when the decoding is already finished, we also
as shown in Figure 7b, with the larger buffer, pre-loading is reserve an HBM write buffer to cover such cases similar to the
allowed to pre-load KV cache for more layers and thus the read buffer used in the KV cache prefetching. The unfinished
gaps between layers can be overlapped. Let Tload , Tpre f , Lhist KV caches are temporarily moved to the write buffer to avoid
and Lnew denote the access time of the KV cache for a token, blocking the execution of the next job.
6
3.3 Hierarchical KV Cache Placement eviction window, size: 6
Job
AttentionStore leverages both host memory and disks to ex- Job 9 Job 8 Job 7 Job 6 Job 5 Job 4 Job 3 Job 2 head
Queue
pand the available space for KV cache storage. The access
cache use sys use prefetching window, size: 2
speed of host memory, i.e., DRAM, is much higher than disks,
DRAM KV 4 KV 2 KV 1 buf execution Job 1
i.e., SSDs, (tens of GB/s v.s. several GB/s). If the KV caches
to be accessed are always found in the host memory instead disks -> dram KV 3
of disks, the access performance of KV caches will be opti- disks KV 9 KV 8 KV 7 KV 3 dram -> disks KV 4
mal. To achieve this, AttentionStore applies a scheduler-aware
cache use Timeline
fetching scheme to pre-fetch the KV caches from disks to
host memory, ensuring KV cache access at the optimal speed
(§3.3.1), and a scheduler-aware eviction scheme to evict suit- Figure 9: Scheduler-aware KV cache fetching and eviction.
able KV caches from host memory to disks (§3.3.2).
3.3.2 Scheduler-aware Eviction from Memory to Disks
3.3.1 Scheduler-aware Fetching from Disks to Memory When the free space in the host memory is exhausted, we
need to evict some KV caches from the host memory to disks.
Since disks have much larger capacity than the host memory Meanwhile, if the disks are full, we also need to evict some
(tens of TBs v.s. several hundreds of GBs), most KV caches KV caches stored in the disks out of the system. Therefore,
are retained in disks for AttentionStore. As conversation re- it is important to carefully choose the suitable KV cache
quests arrive randomly, their corresponding KV caches are candidates to be evicted for achieving a high cache hit rate.
more likely to be located in disks, resulting in poor access
Different from existing cache eviction strategies, such
performance.
as the least-recently-used (LRU) [50], first-in-first-out
To address the problem, we leverage a scheduler-aware (FIFO) [9], and their variants, which solely rely on the his-
KV cache fetching scheme to pre-fetch the KV caches to torical access information of the KV caches, AttentionStore
be accessed from disks to the host memory. This is done by presents a scheduler-aware eviction scheme which can fur-
utilizing the hints from the inference job scheduler. Specif- ther leverage the future access information of KV caches
ically, the job scheduler maintains a job queue, thus having to achieve a higher cache hit rate. The job queue in the job
the full knowledge of waiting jobs. AttentionStore applies a scheduler gives us the opportunity to achieve this. Specifi-
look-ahead prefetching window to watch for the waiting jobs cally, AttentionStore maintains a look-ahead eviction window
to be executed. If the KV cache of the waiting jobs is hit in in the job queue. The maximum length of the look-ahead evic-
the disks, AttentionStore will pre-fetch the KV cache of wait- tion window is determined by the total storage capacity of
ing jobs from the disks to host memory before these waiting the KV caching system. Assume the total available capacity
jobs are executed. The length of the look-ahead prefetching in the disks is Cdisk . The look-ahead eviction window length
window is determined by the available capacity in the host is (Cmem +Cdisk )/Skv . When AttentionStore attempts to evict
memory. Given the available memory capacity for prefetching one item out of the KV caching system, if finding the item
Cmem and the average KV size of a session Skv , the prefetching to be evicted in the look-ahead eviction window, the item is
window length is L pw = Cmem /Skv . exempted. When AttentionStore evicts one item from the host
A scheduler-aware fetching example is shown in Figure 9. memory to disks, the item located at the tail of the look-ahead
As Job 1 is executing, the KV cache manager applies a look- eviction window has a higher priority to be evicted. Note that
ahead window size of 2 (the host memory has 2 KV cache one item corresponds to all KV caches associated with a con-
slots for the KV cache fetching) to check the KV cache hit versation session, which is the minimal eviction and fetching
status of the waiting Jobs 2-3. The KV cache for Job 2 is hit in granularity in AttentionStore. This is because the KV cache
the host memory but the KV cache for Job 3 is not in the host in the same conversation session is either all used or none of
memory. Then the KV cache fetching threads start fetching it is used.
the KV cache for Job 3 from disks to the host memory. A scheduler-aware eviction example is shown in Figure 9.
Note that AttentionStore includes a host memory buffer When the KV cache of Job 3 is chosen to be migrated to the
that allows for seamless fetching of KV caches from disks host memory, the buffer will be utilized. To maintain a buffer
to memory, preventing any delays when the host memory is in the host memory, AttentionStore needs to evict KV caches
full. When the capacity of the free memory reaches a defined from the host memory to the disks. AttentionStore employs
threshold, AttentionStore triggers a KV eviction from host a look-ahead eviction window of size 6 to monitor the KV
memory to disks to ensure the constant availability of the host cache status of the jobs. First, it finds that the KV caches in
memory buffer. The eviction process from host memory to the host memory all have an associated job in the job queue.
disks is presented in the next subsection. It then continues scanning the look-ahead eviction window
7
4K 4K
Prompt
KC VC KC VC
Truncated truncation New Tokens New Tokens
Prompt k v q KC VC
KV cache new KV cache New KV cache truncation KV cache New KV cache
wk wv wq k v q k v q
2K 2K wk wv wq wk wv wq
(a) baseline (b) AttentionStore
Input Input Input
Figure 10: Illustration of managing context window overflow. (a) (b) (c)
positional
KC K cache VC V cache
Context window size: 4K, truncation ratio: 2K. (a) Baseline. encoding
8
RE AS RE AS RE Prefill
RE Decode
RE Overflow
AS Prefill
AS Decode
30
Time (H)
TTFT (s)
50 500 20
0.5
25 250 10
0 Llama-13B Llama-65B Llama-70B Falcon-40B 0.0 Llama-13B Llama-65B Llama-70B Falcon-40B 0 Llama-13B Llama-65B Llama-70B Falcon-40B 0 Llama-13B Llama-65B Llama-70B Falcon-40B
Model Model Model Model
Figure 13: Cache hit rate. Figure 14: Time to first token. Figure 15: Prefill throughput. Figure 16: GPU time.
computation with proactive swapping. Separate IO threads since other performance metrics are closely related to it. Fig-
migrate data between the host memory and the disks, overlap- ure 13 shows the KV cache hit rates in AS for various LLMs.
ping the execution with the KV cache migrations. Continuous AS exhibits high hit rates around 86%, 71% 89%, and 90%
batching [61] is enabled through experiments. for LLaMa-13B, LLaMA-65B, LLaMA-70B and Falcon-40B,
Models. The experiments evaluate the open-sourced respectively. In contrast, we observe a relatively low hit rate of
LLaMA-1 model with 65B [48], LLaMA-2 models [49] with LLaMA-65B. This discrepancy arises due to the larger storage
13B, 70B, and Falcon 40B [36]. The intermediate activation space required by LLaMA-65B for saving KV caches. Given
uses FP16, aligned with prior systems [52, 61]. We also im- the same available storage space, AS accommodates fewer
plement Mistral-7B [20] with a 32K context window. Unless sessions for LLaMA-65B, thereby limiting the hit rate. Specif-
specified otherwise, LLaMA-13B operates on two GPUs with ically, LLaMA-65B necessitates 2.5MB of space for each to-
24 batches, while LLaMA-65B, LLaMA-70B, and Falcon- ken in the KV cache, LLaMA-13B requires 0.78MB, LLaMA-
40B run on four GPUs, handling 24 batches each. 70B and Falcon-40B require only 0.31MB and 0.12MB of
Workloads. The workload is integrated from the ShareGPT space per token due to using the group query attention with a
dataset [37, 66]. As there is no public request arrival times- GQA factor of 8 and 16, respectively.
tamp available in the dataset, we generate request arrival times Time to first token (TTFT). TTFT is an important metric
based on the Poisson distribution with various arrival rates, for quality of service in LLM serving [3, 35]. It indicates
following prior works [22, 56]. We set the number of different how quickly users start seeing the output of LLMs after
sessions arriving per second according to a Poisson distribu- entering their prompt. As shown in Figure 14, AS signifi-
tion (with λ = 1.0). 9K conversation sessions are used in the cantly reduces the TTFT by 85%, 61%, 87% and 86% for
experiments. LLaMA-13B, LLaMA-65B, LLaMA-70B and Falcon-40B
Baseline. We compare AttentionStore (AS) with re- respectively, in comparison to RE. This is because AS elimi-
computation (RE). RE only keeps historical tokens of conver- nates a large amount of repetitive computation for generating
sation sessions. It discards KV caches after serving a conver- the KV caches of historical tokens in the prefilling phase.
sation and does not keep the KV cache while the conversation Upon cache hits, the TTFT of AS only relies on the number
session is inactive. When a conversation associated with a of newly input tokens in the new conversation turn.
particular session becomes active again, RE leverages the his- Prefilling throughput. Prefilling throughput is the metric
torical tokens from that session to recompute their KV caches. to evaluate the speed of processing the prompt. Figure 15
When the historical tokens exceed the context window limita- shows the measured prefilling throughput. We observe that AS
tion, RE applies token truncation, same as the general LLM delivers remarkable speedups of 6.8×, 2.6×, 7.8× and 7.2×
services [33]. For simplicity, the token truncation ratio is set for LLaMA-13B, LLaMA-65B, LLaMA-70B, and Falcon-
to 0.5, implying that when an overflow occurs, the system 40B respectively, when compared to RE. The improvement of
will discard the earliest half of the tokens. AS on prefilling throughput comes from the reduced prefilling
time. AS only prefills the new input of the new conversation.
Moreover, AS can load and reuse the historical KV caches
4.2 End-to-end Performance
from the KV caching system with layer-wise pre-loading
In the end-to-end experiments, we use 9K conversations from optimization. The historical KV cache loading simultaneously
ShareGPT [37] and the average number of turns in these occurs with the prefilling on the new input tokens.
conversations is 5.75. Thus the total number of conversation GPU time. Figure 16 shows the end-to-end GPU time to
turns is about 52K. We warm up the KV caching system finish all inference jobs in the workload. We observe that
using the first 10K conversation turns and then evaluate the AS achieves speedups of 4.0×, 1.9×, 3.3×, and 3.4× for
performance on the following 42K turns. LLaMA-13B, LLaMA-65B, LLaMA-70B and Falcon-40B re-
Cache hit rate. We first present the cache hit rate in AS spectively, compared to RE. The performance improvements
9
RE-GPU AS-Storage RE AS-Prefill Computation Load KV Cache Computation AS
400 0.50
1 2.5
200 0.25
0 LLaMA LLaMA LLaMA Falcon 0 500/500 600/400 700/300 800/200 900/100
0.00 NO-PL PL-B0 PL-B5 PL-B10PL-B15 0.0 1000 1200 1400 1600
-13B -65B -70B -40B His Length / Prompt Length x Prompt Length
Figure 17: Inference cost. Figure 18: Recomputation v.s. Figure 19: AS with no pre- Figure 20: Performance im-
AttentionStore. loading v.s. AS pre-loading pact of using write overlap.
with various buffer sizes.
10
AS-DRAM LRU-DRAM FIFO-DRAM AS LRU FIFO OF AS OF AS
AS-SSD LRU-SSD FIFO-SSD
100 30 100
15
Hit Rate (%)
Time (H)
Time (H)
10
50 50
10 5
25 25
0 128G/2T 128G/5T 128G/10T 0 128G/2T 128G/5T 128G/10T 0 LLaMA LLaMA LLaMA Falcon 0 LLaMA LLaMA LLaMA Falcon
DRAM / SSD DRAM / SSD -13B -65B -70B -40B -13B -65B -70B -40B
(a) Impact on hit rate. (b) Impact on GPU time. (a) Impact on hit rate. (b) Impact on GPU time.
Figure 21: Comparison of the eviction algorithms under vari- Figure 22: Context overflow impact.
ous storage settings.
11
100 2.0K
Throughput (t/s)
HBM HBM+DRAM AS HBM HBM+DRAM AS
Hit Rate (%)
75 1.5K
100
50 1.0K 40
Time (H)
0 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.0K0.000.050.100.150.200.250.300.35 50 20
0.00
Storage Capacity / DSpUT Storage Capacity / DSpUT 25 10
0 LLaMA
0.1 0.1 0.1 0.2
LLaMA LLaMA Falcon 0 LLaMA LLaMA LLaMA Falcon
(a) Impact on hit rate. (b) Impact on throughput. -13B -65B -70B -40B -13B -65B -70B -40B
Figure 23: Impact of storage capacity and the number of (a) Impact of caching storage (b) Impact of caching storage
distinct sessions. mediums on hit rates. mediums on GPU time.
12
Table 1: PPL comparison of different methods.
RE AS RE AS
TTFT (s)
LLaMA-7B 5.47 5.48 2198.7 1.5
22X
WikiText-2 1.0
LLaMA-13B 4.91 4.90 1647.7 100K
0.5
LLaMA-7B 8.48 8.49 2543.5 0.0 10K
PTB 4K 8K 12K 16K 20K 24K 28K 4K 8K 12K 16K 20K 24K 28K
LLaMA-13B 7.61 7.60 1865.8 Sequence Length Sequence Length
LLaMA-7B 6.96 6.98 2343.5 (a) Impact on TTFT. (b) Impact on prefill throughput.
C4
LLaMA-13B 6.44 6.45 1745.6
Figure 25: Prefilling performance of long sequence inference.
Table 2: Accuracy of different methods.
Benchmark Model AS TT NKVT users submit a series of analysis tasks for the same document,
forming a multi-turn conversation session. The size of the
LLaMA-7B 43.7% 43.4% 21.8% document varies from 4K to 28K. Each session consists of 6
MMLU
LLaMA-13B 52.3% 53.2% 29.6% analysis tasks, with each task requiring an input of 256 tokens
LLaMA-7B 66.0% 65.9% 12.0% and producing an output of 64 tokens. In the experiments,
LongEval
LLaMA-13B 68.0% 68.0% 14.0% we use a batch size of 1 and evaluate the performance of the
second and subsequent turns.
LLaMA-7B 77.1% 77.2% 48.9%
PIQA TTFT. Figure 25a shows the average TTFT for different
LLaMA-13B 80.5% 80.4% 50.2%
sequence lengths. The TTFT of RE gradually increases to
2.5s when the sequence length reaches 28K. In contrast, AS
caches. Directly truncating the KV caches would scramble has only about 0.12s of TTFT, resulting in a 95% reduction
the coupled positional information, resulting in the models’ in TTFT compared to RE.
failure to maintain a low PPL. Prefilling throughput. Figure 25b illustrates the measured
Accuracy. To analyze the accuracy of the models in answer- prefilling throughput. AS significantly improves the prefill-
ing questions after truncation, we conduct experiments using ing throughput, achieving a speedup up to 22× compared to
the MMLU [18], LongEval [23, 58], and PIQA [7] bench- RE. The prefilling throughput of RE cannot improve as the
marks. Specifically, we first input a long text to simulate the sequence length grows due to being bounded by the com-
overflow of historical inputs to trigger the truncation opera- putational capability. AS observes a continuous increase in
tion, and then append the questions from the benchmarks as the prefilling throughput as the sequence length grows by
new inputs afterward. As shown in Table 2, both AS and TT efficiently reusing the historical KV cache.
provide high comparable accuracy. TT achieves high accu- GPU time. Figure 26a shows the average GPU time to
racy by paying the recomputation cost for context window complete each analysis task. AS demonstrates consistent GPU
overflow, while our AS avoids this cost and still maintains time savings compared to RE, regardless of the sequence
high accuracy. In contrast, the NKVT has a much lower accu- length. When the sequence length increases, RE requires more
racy than AS and TT because the coupled positional encoding prefilling time to recompute the KV caches, which accounts
after KV cache truncation is miscoded, which results in more for a substantial portion of the total GPU time, i.e., 41%, for
disruption to new inputs. a sequence length of 28K. In contrast, AS efficiently reduces
prefilling costs by reusing KV caches, resulting in only 1.2%
of the GPU time being allocated to prefilling.
4.3.8 Performance for Long Sequence Inference
Output throughput. The overall output throughput is cal-
Modern LLMs continue to incorporate longer context win- culated by the number of generated tokens divided by the
dows to accommodate a greater amount of information, em- total processing time of a task, as shown in Figure 26b. As
powering long sequence inference applications (e.g., docu- the sequence length increases, both RE and AS experience a
ment understanding [60] and code understanding [29]). We decrease in throughput. This is attributed to the increased com-
assess the efficacy of AttentionStore with models designed putational demands for computing attention of the lengthy
for long sequence inference in these applications. Specifi- sequence, resulting in a longer decoding time for each to-
cally, we deploy the Mistral-7B model [20] with a maximum ken. Notably, AS consistently surpasses RE in all scenarios,
32K context window on one A100 GPU with 80GB HBM, demonstrating an improvement in output throughput of up
employing a GQA factor of 8 [2, 20]. We evaluate a documen- to 67%. These improvements in output throughput primarily
tation analysis application as an example. In this application, stem from the elimination of KV cache recomputation.
13
Decoding AS Prefill AS prefill RE AS
6 Conclusion
30
5.0 lows the reuse of the KV caches for any ensuing turns of the
20 same conversation, achieving a significant reduction in the re-
2.5
10 computation overhead of KV caches in LLMs. To improve the
0.0 4K 8K 12K 16K 20K 24K 28K 0 4K 8K 12K 16K 20K 24K 28K efficiency of AttentionStore, we design overlapped KV cache
Sequence Length Sequence Length access, hierarchical KV cache placement, and positional en-
coding decoupled KV cache truncation schemes. Extensive
(a) Impact on GPU time. (b) Impact on output throughput. experimental results demonstrate that AttentionStore signif-
Figure 26: Overall performance of long sequence inference. icantly decreases the TTFT by up to 87% and improves the
prompt prefilling throughput by 7.8× for multi-turn conversa-
tions. It reduces the end-to-end inference cost by up to 70%. It
also decreases the TTFT by up to 95% and enhances prompt
5 Related Work
prefilling throughput by 22× for long sequence inference.
14
[8] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, [19] InternLM. Lmdeploy. https://github.com/Inter
Quoc V Le, and Ruslan Salakhutdinov. Transformer- nLM/lmdeploy.
xl: Attentive language models beyond a fixed-length
context. arXiv preprint arXiv:1901.02860, 2019. [20] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch,
Chris Bamford, Devendra Singh Chaplot, Diego de las
[9] Asit Dan and Don Towsley. An approximate analysis Casas, Florian Bressand, Gianna Lengyel, Guillaume
of the lru and fifo buffer replacement schemes. In Pro- Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint
ceedings of the 1990 ACM SIGMETRICS conference on arXiv:2310.06825, 2023.
Measurement and modeling of computer systems, pages [21] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux,
143–152, 1990. Arthur Mensch, Blanche Savary, Chris Bamford, De-
vendra Singh Chaplot, Diego de las Casas, Emma Bou
[10] Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal,
Hanna, Florian Bressand, et al. Mixtral of experts. arXiv
Bin Yu, Ahmed Awadallah, and Subhabrata Mukher-
preprint arXiv:2401.04088, 2024.
jee. Skipdecode: Autoregressive skip decoding with
batching and caching for efficient llm inference. arXiv [22] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying
preprint arXiv:2307.02628, 2023. Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E Gon-
zalez, Hao Zhang, and Ion Stoica. Efficient memory
[11] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke management for large language model serving with
Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication pagedattention. In Proceedings of ACM Symposium
for transformers at scale. In Proceedings of Advances in on Operating Systems Principles, SOSP, 2023.
Neural Information Processing Systems, NeuIPS, 2022.
[23] Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lian-
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and min Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma,
Kristina Toutanova. Bert: Pre-training of deep bidirec- and Hao Zhang. How long can context length of open-
tional transformers for language understanding. arXiv source llms truly promise? In Workshop in Proceedings
preprint arXiv:1810.04805, 2018. of Advances in Neural Information Processing Systems,
NeuIPS Workshop, 2023.
[13] Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Ji-
awei Han, and Jianfeng Gao. Model tells you what to [24] Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and
discard: Adaptive kv cache compression for llms. arXiv Hong Xu. Accelerating distributed moe training and
preprint arXiv:2310.01801, 2023. inference with lina. In Proceedings of USENIX Annual
Technical Conference, ATC, 2023.
[14] Hila Gonen, Srini Iyer, Terra Blevins, Noah A Smith,
[25] Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao
and Luke Zettlemoyer. Demystifying prompts in lan-
Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis,
guage models via perplexity estimation. arXiv preprint
and Anshumali Shrivastava. Scissorhands: Exploit-
arXiv:2212.04037, 2022.
ing the persistence of importance hypothesis for llm
[15] Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, kv cache compression at test time. arXiv preprint
and Tie-Yan Liu. Frage: Frequency-agnostic word rep- arXiv:2305.17118, 2023.
resentation. In Proceedings of Advances in Neural In- [26] Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang
formation Processing Systems, NeuIPS, 2022. Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang,
Yuandong Tian, Christopher Re, et al. Deja vu: Con-
[16] Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng
textual sparsity for efficient llms at inference time. In
Ji, and Sinong Wang. Lm-infinite: Simple on-the-fly
Proceedings of International Conference on Machine
length generalization for large language models. arXiv
Learning, ICML, 2023.
preprint arXiv:2308.16137, 2023.
[27] Mitch Marcus, Beatrice Santorini, and Mary Ann
[17] Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Marcinkiewicz. Building a large annotated corpus of
Xu, and Yunhe Wang. Transformer in transformer. In english: The penn treebank. Computational linguistics,
Proceedings of Advances in Neural Information Pro- 19(2):313–330, 1993.
cessing Systems, NeuIPS, 2021.
[28] Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta,
[18] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei,
Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mohammad Rastegari, and Mehrdad Farajtabar. Relu
Measuring massive multitask language understanding. strikes back: Exploiting activation sparsity in large lan-
arXiv preprint arXiv:2009.03300, 2020. guage models. arXiv preprint arXiv:2310.04564, 2023.
15
[29] Daye Nam, Andrew Macvean, Vincent Hellendoorn, power next-generation ai scale. In Proceedings of In-
Bogdan Vasilescu, and Brad Myers. Using an llm ternational Conference on Machine Learning, ICML,
to help with code understanding. In Proceedings of 2022.
IEEE/ACM International Conference on Software Engi-
neering, ICSE, 2024. [41] Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley,
Shaden Smith, and Yuxiong He. Zero-infinity: Breaking
[30] NVIDIA. Fastertransformer. https://github.com/N the gpu memory wall for extreme scale deep learning.
VIDIA/FasterTransformer. In Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Anal-
[31] NVIDIA. Nvidia collective communications library
ysis, SC, 2021.
(nccl). https://developer.nvidia.com/nccl.
[35] Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo [45] Benjamin Spector and Chris Re. Accelerating llm infer-
Goiri, Aashaka Shah, Saeed Maleki, and Ricardo Bian- ence with staged speculative decoding. arXiv preprint
chini. Splitwise: Efficient generative llm inference using arXiv:2308.04623, 2023.
phase splitting. arXiv preprint arXiv:2311.18677, 2023.
[46] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan,
[36] Guilherme Penedo, Quentin Malartic, Daniel Hess- Wen Bo, and Yunfeng Liu. Roformer: Enhanced trans-
low, Ruxandra Cojocaru, Alessandro Cappelli, Hamza former with rotary position embedding. Neurocomput-
Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and ing, 568:127063, 2024.
Julien Launay. The RefinedWeb dataset for Falcon LLM:
outperforming curated corpora with web data, and web [47] Gemini Team, Rohan Anil, Sebastian Borgeaud,
data only. arXiv preprint arXiv:2306.01116, 2023. Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu
Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth,
[37] philschmid. Sharegpt raw. https://huggingface.co
et al. Gemini: a family of highly capable multimodal
/datasets/philschmid/sharegpt-raw/tree/main
models. arXiv preprint arXiv:2312.11805, 2023.
/sharegpt_90k_raw_dataset.
[38] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, [48] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Martinet, Marie-Anne Lachaux, Timothée Lacroix, Bap-
Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scal- tiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar,
ing transformer inference. In Proceedings of Machine et al. Llama: Open and efficient foundation language
Learning and Systems, MLSys, 2023. models. arXiv preprint arXiv:2302.13971, 2023.
[39] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine [49] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert,
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Li, and Peter J Liu. Exploring the limits of transfer Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al.
learning with a unified text-to-text transformer. Journal Llama 2: Open foundation and fine-tuned chat models.
of machine learning research, 2020. arXiv preprint arXiv:2307.09288, 2023.
[40] Samyam Rajbhandari, Conglong Li, Zhewei Yao, Min- [50] Valentin Touzeau, Claire Maïza, David Monniaux, and
jia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Jan Reineke. Fast and exact analysis for lru caches.
Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Proceedings of the ACM on Programming Languages,
Advancing mixture-of-experts inference and training to 3(POPL):1–29, 2019.
16
[51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob [62] Lingfan Yu and Jinyang Li. Stateful large lan-
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, guage model serving with pensieve. arXiv preprint
and Illia Polosukhin. Attention is all you need. In Pro- arXiv:2312.05516, 2023.
ceedings of Advances in Neural Information Processing
Systems, NeuIPS, 2017. [63] Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Artetxe, Moya Chen, Shuohui Chen, Christopher De-
[52] vLLM Project. vllm: Easy, fast, and cheap llm serving wan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt:
with pagedattention. https://github.com/vllm-p Open pre-trained transformer language models. arXiv
roject/vllm/. preprint arXiv:2205.01068, 2022.
[53] Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, [64] Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,
Lifan Yuan, Hao Peng, and Heng Ji. Mint: Evaluating Maosong Sun, and Qun Liu. Ernie: Enhanced language
llms in multi-turn interaction with tools and language representation with informative entities. arXiv preprint
feedback. arXiv preprint arXiv:2309.10691, 2023. arXiv:1905.07129, 2019.
[54] Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. [65] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong
Tabi: An efficient multi-level inference system for large Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong
language models. In Proceedings of the European Con- Tian, Christopher Ré, Clark Barrett, et al. H _2 o: Heavy-
ference on Computer Systems, EuroSys, 2023. hitter oracle for efficient generative inference of large
[55] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien language models. arXiv preprint arXiv:2306.14048,
Chaumond, Clement Delangue, Anthony Moi, Pierric 2023.
Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al.
[66] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Huggingface’s transformers: State-of-the-art natural lan-
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuo-
guage processing. arXiv preprint arXiv:1910.03771,
han Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-
2019.
judge with mt-bench and chatbot arena. arXiv preprint
[56] Bingyang Wu, Yinmin Zhong, Zili Zhang, Gang Huang, arXiv:2306.05685, 2023.
Xuanzhe Liu, and Xin Jin. Fast distributed inference
serving for large language models. arXiv preprint [67] Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff
arXiv:2305.05920, 2023. Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Chris-
tos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al.
[57] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Efficiently programming large language models using
Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xi- sglang. arXiv preprint arXiv:2312.07104, 2023.
aoyun Zhang, and Chi Wang. Autogen: Enabling next-
gen llm applications via multi-agent conversation frame-
work. arXiv preprint arXiv:2308.08155, 2023.
[58] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song
Han, and Mike Lewis. Efficient streaming lan-
guage models with attention sinks. arXiv preprint
arXiv:2309.17453, 2023.
[59] Linting Xue, Noah Constant, Adam Roberts, Mihir
Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua,
and Colin Raffel. mt5: A massively multilingual
pre-trained text-to-text transformer. arXiv preprint
arXiv:2010.11934, 2020.
[60] Lu Ye, Ze Tao, Yong Huang, and Yang Li. Chunkat-
tention: Efficient self-attention with prefix-aware kv
cache and two-phase partition. arXiv preprint
arXiv:2402.15220, 2024.
[61] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo-
jeong Kim, and Byung-Gon Chun. Orca: A distributed
serving system for transformer-based generative models.
In Proceedings of USENIX Symposium on Operating
Systems Design and Implementation, OSDI, 2022.
17