0% found this document useful (0 votes)
36 views17 pages

Attentionstore: Cost-Effective Attention Reuse Across Multi-Turn Conversations in Large Language Model Serving

Uploaded by

zqli0924
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views17 pages

Attentionstore: Cost-Effective Attention Reuse Across Multi-Turn Conversations in Large Language Model Serving

Uploaded by

zqli0924
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in

Large Language Model Serving

Bin Gao1,* , Zhuomin He2,* , Puru Sharma1 , Qingxuan Kang1 , Djordje Jevdjic1 , Junbo Deng3 ,
Xingkun Yang3 , Zhou Yu3 , and Pengfei Zuo3,†
1 National University of Singapore 2 Shanghai Jiaotong University 3 Huawei Cloud
arXiv:2403.19708v2 [cs.CL] 16 Apr 2024

Abstract erative applications [32, 47, 48]. However, serving these gen-
erative applications with LLMs is very expensive due to the
Interacting with humans through multi-turn conversations is a LLM inference employing a large number of GPUs. Given
fundamental feature of large language models (LLMs). How- the high demand for generative applications, reducing the cost
ever, existing LLM serving engines for executing multi-turn of inference becomes crucial.
conversations are inefficient due to the need to repeatedly
compute the key-value (KV) caches of historical tokens, in- Engaging in multi-turn conversations with humans is an
curring high serving costs. To address the problem, this paper essential capability of LLMs [53, 57]. These multi-turn con-
proposes AttentionStore, a new attention mechanism that en- versations help LLMs comprehend context, user intent, and
ables the reuse of KV caches (i.e., attention reuse) across emotional nuances, enhancing their ability to respond appro-
multi-turn conversations, significantly reducing the repetitive priately. Based on the ShareGPT data [37], a widely-used
computation overheads. AttentionStore maintains a hierarchi- real dataset collected from ChatGPT, 73% of conversations
cal KV caching system that leverages cost-effective memo- involve multiple turns, as analyzed in Section 2.3.
ry/storage mediums to save KV caches for all requests. To However, executing multi-turn conversations in current
reduce KV cache access overheads from slow mediums, Atten- LLM serving engines is highly inefficient, as it requires a large
tionStore employs layer-wise pre-loading and asynchronous number of repetitive computations, incurring high serving
saving schemes to overlap the KV cache access with the GPU costs. During a single turn of conversation, the LLM engine
computation. To ensure that the KV caches to be accessed stores intermediate data, key-value (KV) pairs [4, 22, 38], in
are placed in the fastest hierarchy, AttentionStore employs the limited high-bandwidth memory (HBM) on GPUs. When
scheduler-aware fetching and eviction schemes to consciously that conversation ends and the conversation session becomes
place the KV caches in different layers based on the hints from inactive, the LLM engine generally discards the KV cache
the inference job scheduler. To avoid the invalidation of the associated with that session, to free up space in the HBM for
saved KV caches incurred by context window overflow, At- other active sessions. When the session becomes active again,
tentionStore enables the saved KV caches to remain valid via i.e., the user sends the next message in the conversation, the
decoupling the positional encoding and effectively truncating LLM engine computes the whole KV cache again. This leads
the KV caches. Extensive experimental results demonstrate to repetitive computation of the same KV cache, wasting valu-
that AttentionStore significantly decreases the time to the first able GPU computation resources. With the number of con-
token (TTFT) by up to 87%, improves the prompt prefilling versation turns increases, the repetitive computation overhead
throughput by 7.8× for multi-turn conversations, and reduces linearly increases. Our analysis based on ShareGPT shows
the end-to-end inference cost by up to 70%. For long sequence that up to 99% of the prefilling cost comes from repetitive
inference, AttentionStore reduces the TTFT by up to 95% and computation for the KV cache, as presented in Section 2.3.
improves the prompt prefilling throughput by 22×. To reduce the serving cost and improve the inference per-
formance, this paper proposes AttentionStore, a new attention
mechanism that enables the reuse of KV caches (i.e., attention
1 Introduction reuse) across multi-turn conversations rather than discarding
them. When a conversation session becomes inactive, Atten-
With impressive performance on a wide variety of tasks, large
tionStore saves the corresponding KV cache in a KV caching
language models (LLMs) have ushered in a new era of gen-
system. Upon the resumption of the same session, Attention-
*Work done during their internship at Huawei Cloud. Store loads and reuses the saved KV cache from the KV
†Corresponding author: Pengfei Zuo ([email protected]). caching system, thereby eliminating the overhead of the repet-

1
itive computation. However, building such an efficient KV when saving them. It re-embeds the positional encoding into
caching system for multi-turn conversations presents signifi- KV caches when loading them. After decoupling, truncation
cant challenges. can be directly applied to the KV caches, thereby ensuring
Firstly, the KV caching system serves as the external stor- the reusability of the saved KV caches.
age for GPUs and is attached to the GPUs via low-speed links. We implement the AttentionStore and evaluate it using the
The use of the KV caching system brings about significant ac- real ShareGPT dataset [37]. Extensive experimental results
cess overhead due to the need to transfer KV caches between demonstrate that AttentionStore significantly decreases the
HBMs and the KV caching system. The access overhead of time to the first token (TTFT) by up to 87% and improves the
KV caches is in the critical path of inference execution. This prompt prefilling throughput by 7.8× for multi-turn conver-
is because GPUs can only perform the computation of an sations. It also reduces the end-to-end inference cost by up
inference job after successfully loading its corresponding KV to 70%. For long sequence inference, AttentionStore reduces
cache into HBMs. Likewise, the subsequent inference jobs the TTFT by up to 95% and improves the prompt prefill-
need to wait until the KV caches from the previous jobs are ing throughput by 22×. To summarize, this paper makes the
moved out of the HBMs if the HBM space is not enough. To following contributions:
reduce the KV cache loading overheads, AttentionStore uses a
layer-wise pre-loading scheme to overlap the time of loading • We investigate the recomputation overheads of KV
the KV cache with the inference computation layer by layer. caches in LLMs across conversation turns and iden-
To reduce the KV cache saving overheads, AttentionStore tify the challenges associated with retaining KV caches
develops an asynchronous saving scheme that overlaps the across multi-turn conversations.
time of saving KV caches with the inference computation.
• We propose AttentionStore, a new attention that allows
Secondly, the KV caches occupy a large amount of storage
the reuse of the KV caches for any ensuing conversa-
space that continuously expands during conversations. Prior
tion turns of the same session, achieving a significant
works have attempted to reduce the inefficiency of repetitive
reduction in the recomputation overhead of KV caches
KV computation by retaining the KV caches across multi-
in LLMs.
turn conversations in HBMs [19, 67]. However, this quickly
exhausts the limited HBM capacity. We present an example of • To improve the efficiency of AttentionStore, we design
LLaMA-65B in Section 2.3, which shows the KV caches fully overlapped KV cache access, hierarchical KV cache
occupy the free space within the HBMs in 14 seconds. To placement, and positional encoding decoupled KV cache
address this challenge, AttentionStore explores and exploits truncation schemes.
slower but larger-capacity storage hierarchies than HBMs,
including host memory and disks, to provide adequate storage • We thoroughly evaluate AttentionStore with real datasets
space for caching KV caches. to demonstrate its efficacy and efficiency.
Thirdly, since disks have much larger capacity than the host
memory (tens of TBs v.s. several hundreds of GBs), most KV
caches are retained in disks for AttentionStore. As conversa- 2 Background and Motivation
tion requests arrive randomly, their corresponding KV caches
are more likely to be located in disks, resulting in poor access This section begins with an overview of the fundamentals of
performance. To address this problem, AttentionStore uses generative LLM inference. It then delves into the inefficien-
a scheduler-aware KV cache fetching scheme. This scheme cies that exist in LLMs during multi-turn conversations. The
pre-fetches the KV caches that are likely to be accessed from section ends with a discussion of the design opportunities
disks to the host memory, by utilizing the hints received from for dealing with these inefficiencies and the challenges faced
the interface job scheduler. When the free space of the host during the design of such a system.
memory is not enough, AttentionStore also adopts a scheduler-
aware eviction scheme to efficiently identify the most suitable
2.1 Generative LLM Inference Basics
KV caches in memory and evict them to disks or out of the
system. Transformer Architecture. The transformer has emerged
Finally, when a conversation session surpasses the limit as the widely accepted standard in generative LLM infer-
of the context window of LLMs, e.g., 4K in LLaMA-2 [49], ence. The widely used LLMs like GPTs [32] and LLa-
LLMs generally truncate the oldest tokens and limit the con- MAs [48, 49] are built upon the autoregressive transformer
text to the most recent tokens [33]. This truncation makes all architecture [17, 51]. During inference, these models pro-
saved KV caches of that conversation in AttentionStore in- cess the prompt of the users and generate a response. The
valid since the positional information of all tokens embedded prompt is processed as a sequence of input tokens, and the
in the KV cache is changed. To overcome this issue, Attention- response is generated by the model predicting the probabil-
Store decouples the positional encoding from the KV caches ity of subsequent tokens using the context of all the prior

2
Prefilling Decoding prefilling decoding/token 1.00 1.00
KV Cache[1:s]
STOP 0.75 0.75
token [s+1]
10 0.50 0.50

CDF

CDF
Latency (s)
KV Cache[s+1]
Layers
t [s+2] = EOF 0.25 0.25
5 0.00 0.00
X [1:s] Layers 2 5 10 15 20 25 30 35 40 024 8 16 24 32
0 Conversation turn number Token number (K)
KV Cache[1:s] 1K 2K 4K
START t [s+2]
Input length (a) CDF for conversation turn number (b) CDF for session length
(a) Two-phase illustration. (b) Execution latency. Figure 2: (a) Distribution for conversation turn number
in ShareGPT [37]. (b) The session length distribution of
Figure 1: Prefilling and decoding phases. Latency measured ShareGPT. For better display effect, the statistics exclude
for LLaMA-70B of batch size 8 on 4 A100 GPUs. conversations with over 40 turns or sessions that exceed a
length of 32K.
tokens. The transformer model consists of a chain of l trans-
former layers. Each transformer layer is comprised of two
Turn 1 q1 a1 Turn 1 q1 a1
steps, self-attention and feed-forward network (FFN).

time

time
For the input token list X = [x1 , x2 , ...xs ], each layer applies Turn 2 q1 a1 q2 a2 Turn 2 q1 a1 q2 a2 decoding

a series of projections on each token in X using the weights Turn 3 q1 a1 q2 a2 q3 a3 Turn 3 q1 a1 q2 a2 q3 a3


WQ ,WK ,WV . This generates the elements in the set of queries, Turn 3 prefill decoding loading KV prefill
keys, and values, referred to as Q, K, and V respectively: (a) Recomputation (b) AttentionStore

Q = WQ X, K = WK X,V = WV X
Figure 3: Comparison of recomputation and AttentionStore.
Subsequently, attention scores are computed via Q, K, and V :

QK T
Attention(Q, K,V ) = so f tmax( √ )V The decoding phase. The decoding phase generates output
dk tokens with autoregressive iterations. The decoding phase
√ takes token s + 1 and the KV cache [1 : s] from the prefilling
where dk is the dimension of the key vector k. Finally, the
phase as input to compute the KV cache s + 1 and the token
projection operation applies a linear transformation on atten-
s + 2. The generation process iteratively continues until the
tion scores. This projected result is handed to the FFN layer.
generated token is <eos> or the iteration number reaches the
The result from FFN is passed on to the next transformer layer
maximum allowed generation number. The decoding phase
as input. Finally, after the input has been processed through
only happens sequentially due to the heavy data dependency
all l transformer layers, the output is a probability vector that
on the previous iteration.
marks out the most probable output tokens.
KV Cache: Within the entire process above, each token The two phases present significantly different characteris-
produces intermediate K and V tensors. When generating tics in terms of execution time. The prefilling phase computes
subsequent tokens, all KV tensors of preceding tokens are the KV cache in parallel. The duration of this phase is closely
necessary for computing the self-attention. These K and V tied to the number of prompt tokens provided as input. As
tensors are generally cached in GPUs, referred to as the KV shown in Figure 1b, the execution time of the prefilling phase
cache. The KV cache typically has a large footprint. For increases as the number of input tokens grows. In contrast,
example, GPT-3 [11, 32] generates a 4.5MB KV cache for the decoding phase only performs computation for a single
each token. The size of KV cache linearly increases with the token in each iteration, which makes the computation time
number of prompt tokens. A conversation session containing for each iteration relatively constant.
thousands of tokens will produce several GBs of KV cache.
2.3 Multi-turn Conversation Inference
2.2 Autoregressive Generation
Engaging humans in multi-turn conversations is a fundamen-
As illustrated in Figure 1a, transformer-based generation can tal feature of modern LLMs. A multi-turn conversation ses-
logically be identified as two distinctive phases [1]. sion consists of a series of continuous conversations, denoted
The prefilling phase. Given a request prompt, the gener- as D = [d1 , d2 , ...dN ]. In each conversation d j , a user inputs
ation takes the prompt token list X = [x1 , x2 , ...xs ] as input a new question or command q j and then awaits the response
and then proceeds to compute the token xs+1 . This process a j from the LLM. To maintain a coherent context and un-
generates a series of KVs, specifically forming the KV cache derstanding of the conversation session, the LLM generates
ranging from 1 to s, which are used for the decoding phase. aN+1 based on both the historical tokens from all previous

3
hist tokens new tokens prefill all prefill new the KV caches to be loaded from the KV caching system.
Recomp ratio
14K The block time is non-negligible compared to the repetitive
100
12K computation time of the KV cache, making the KV caching

Recomp ratio (%)


GPU time (s)
10K 1.0
solution lose efficacy. For example, we evaluate the inference
Tokens

8K 50 gap: 99%
6K 0.5 time of the LLaMA-65B model using 4 NVIDIA A100 GPUs
4K and observe that prefilling 2K tokens of a prompt consumes
2K
0 0 0.0 about 360 ms. In contrast, loading the KV cache of the 2K
0 5 10 15 20 0 5 10 15 20 tokens (5GB) from host memory to GPUs consumes about
Conversation Turns Conversation Turns
(a) (b) 192 ms (the GPU system with 16 lanes of PCIe Gen4 has
about 26GB/s of effective data transmission bandwidth).
Figure 4: Recomputation inefficiencies. (a) The average num- 2) High storage capacity requirement of KV caches. Stor-
bers of historical tokens and new tokens in different turns of ing the KV cache for each request consumes a substantial
ShareGPT [37]. (b) The GPU time for prefilling all tokens amount of storage space. For instance, when using 4 A100
and only new input tokens in ShareGPT with Mistral-7B [20] GPUs each with 80GB HBM to run LLaMA-65B, prefilling
on 1 A100 GPU. 2K tokens consumes about 360 ms. This process generates
5GB of KV cache, indicating the generation speed of the
KV cache is about 13.9GB/s. As 130GB of HBM space is
conversation turns d[1 : N] and the input tokens of the current allocated to store the model, the remaining 190GB of free
turn, denoted as q1 a1 q2 a2 ...qN aN qN+1 . HBM space will be fully occupied by the KV cache within
Based on the analysis of ShareGPT [37, 42], a real dataset 14 seconds. If spilling the KV cache to the host memory (e.g.,
collected from ChatGPT that includes more than 90K conver- 512GB space), the host memory will be filled in less than 1
sations, we observe that 73% of conversations are multi-turn, minute. Using disks to save the KV cache can extend the stor-
as shown in Figure 2a. Moreover, 30% of conversations have age space. However, this incurs worse access performance, as
more than 4K tokens as shown in Figure 2b. presented below.
However, executing multi-turn conversations in current
3) Suitable placement of KV caches in different hierar-
LLM serving engines is inefficient due to the repetitive com-
chies. Disks provide much larger capacity than the host mem-
putation of KV caches across multiple conversation turns.
ory (tens of TBs v.s. several hundreds of GBs). Thus most KV
As shown in Figure 3a, in the conversation turn 1, the LLM
caches are retained in disks. However, the disks have an ac-
serving engine generates the KV cache of q1 and a1 . After
cess bandwidth of less than 5GB/s. As conversation requests
finishing turn 1, the LLM serving engine discards the KV
arrive randomly, their corresponding KV caches are more
cache to reclaim the HBM space. In turn 2, the LLM serving
likely to be located in disks when being accessed, resulting in
engine re-generates the KV cache of q1 and a1 . In turn 3,
poor inference performance. It is essential to ensure that the
the KV cache of q1 , a1 , q2 , and a2 is re-generated. As the
KV cache to be accessed in the immediate future is always
session expands, the historical tokens keep accumulating and
placed in the host memory instead of disks.
the amount of repetitive computation significantly increases.
As shown in Figure 4a, with the increase of the conversation 4) Unexpected invalidation of the saved KV caches. With
turns, the percentage of historical tokens will be more than the number of conversation turns increasing, the historical
99% in a new conversation. The repetitive computation time tokens can exceed the context window limitation. LLM serv-
occupies 99% of the prefilling time (a.k.a., time to the first ing engines generally perform token truncation [16, 33] to
token) in the new conversation, as shown in Figure 4b. reduce the input prompt. The truncation has no impact on
previous LLM serving engines since they always recompute
the KV cache based on the input prompt following trunca-
2.4 Opportunities and Challenges
tion. However, the truncation makes the KV caches saved in
Based on the analysis above, we observe that if the KV the KV caching system invalid, since the position of each to-
caches can be reused across multiple turns of conversations, ken is changed after truncation. Thus it cannot match the old
up to 98% of prefilling cost can be reduced. Specifically, embedded positional encoding in the saved KV cache. Such
the KV caches of historical conversations can be saved in a context window overflow can occur with a high probability.
KV caching system out of GPUs. Upon the reactivation of a As shown in Figure 2b, 47% and 30% of conversation ses-
conversation session, GPUs load the associated KV caches sions have a context longer than 2K and 4K, respectively. It
from the KV caching system and reuse them for the new-turn means that when using the LLaMA-2 family with 4K context
conversation. Nevertheless, to build an efficient KV caching window [49], the context window overflow occurs in 30% of
system, there exist many significant challenges. conversation sessions. When using the OPT family with 2K
1) High KV cache access overheads. During the inference, context window [63], the context window overflow occurs in
the computation of GPUs can be blocked due to waiting for 47% of conversation sessions.

4
AttentionStore
Controller
GPU Cluster identifies the least valuable KV caches and evicts them to
Job Scheduler GPU 0 GPU 1 disks or out of the caching system (§3.3).
HBM 0 HBM 1
Job For Challenge 4, to deal with the invalidation of KV caches
Queue Load (§3.2.1) Save (§3.2.2)
saved in AttentionStore due to context window overflow, we
utilize a positional encoding decoupled truncation scheme to
KV Cache Manager Host memory save the KV caches without positional encoding embedded,
KV Access Fetch (§3.3.1) Evict (§3.3.2) and hence support the truncation directly on KV caches. When
(§3.2)
loading the KV cache, AttentionStore re-embeds the new
KV Placement Disks positional encoding into the KV caches (§3.4).
(§3.3)
KV Cache Storage

3.2 Overlapped KV Cache Access


Figure 5: The system architecture of AttentionStore.
The use of slower memory/storage hierarchies results in sig-
nificant access overhead because KV caches need to be trans-
3 The AttentionStore Design ferred between HBMs and the slower mediums, blocking the
inference and causing a waste of computational resources. To
3.1 Overview reduce the KV cache loading overheads from host memory to
HBMs, AttentionStore uses a layer-wise pre-loading scheme
We propose a new attention mechanism, called AttentionStore, to overlap the loading of the KV cache with the inference
to save the KV caches for all conversations, enabling the reuse computation layer by layer (§3.2.1). To reduce the KV cache
of historical KV caches across multi-turn conversations, in- saving overheads, AttentionStore develops an asynchronous
stead of discarding them as in conventional attention mecha- saving scheme that overlaps the saving of KV caches with the
nisms. Specifically, AttentionStore saves the KV cache in a inference computation (§3.2.2).
KV caching system when the associated conversation session
is inactive. If the same conversation session is activated in the
3.2.1 Layer-wise Pre-loading from Memory to HBMs
future, its KV cache is fetched from the KV caching system
and reused for inference. By doing so, AttentionStore only AttentionStore loads KV caches from the host memory to
executes partial prefilling, on just the new tokens input in the HBMs, resulting in high data access overhead. The access
new turn of conversation, rather than prefilling all historical process is in the critical path of the inference execution as
tokens. As shown in Figure 3b, when executing the inference shown in Figure 6a, since GPUs must rely on the KV cache
of Turn 3, the KV cache of q1 , a1 , q2 , and a2 is reused and to execute the inference computation. This overhead becomes
only q3 needs to be prefilled. AttentionStore effectively elim- more significant as the size of the KV cache increases, as
inates the repetitive computation overhead of the historical discussed in Section 2.4. To eliminate this overhead, Atten-
tokens, thereby reducing the prefilling cost. tionStore employs a layer-wise pre-loading scheme to miti-
Figure 5 shows the architectural overview of Attention- gate the impact. The main idea is to overlap the loading of
Store. It maintains a hierarchical KV caching system with the KV cache with the prefilling computation of new input
efficient KV cache access, placement, and truncation tech- tokens for the conversation. In particular, the LLM model is
niques to address the challenges mentioned in Section 2.4. chained by multiple transformer layers, each with its own KV
For Challenge 1, to reduce the overhead of KV cache load- cache. As the GPU executes a layer, the KV cache needed by
ing from the KV caching system into HBMs, AttentionStore the subsequent layers can be loaded from the host memory
leverages a layer-wise pre-loading scheme to overlap the KV concurrently. By doing so, when the GPU starts computing
cache loading with the inference computation. To reduce the the self-attention for a layer, the corresponding KV cache of
KV cache saving overhead from HBMs to host memory, At- the layer is already in the HBM execution buffer.
tentionStore leverages an asynchronous saving scheme to Figure 6b illustrates how the layer-wise pre-loading scheme
overlap the saving with the inference computation. (§3.2). overlaps the KV cache fetching time with the computation
For Challenges 2 and 3, to enlarge the available storage time. The example applies a 3-layer model for simplicity.
space for caching KV caches, AttentionStore employs multi- Before initiating the computation of Layer 1, the KV cache
tier cost-effective storage mediums, i.e., host memory and for this layer must first be prepared in the HBM. The read
disks. To reduce the impact of accessing slow disks on the stream first issues a KV cache loading operation to read the
inference performance, we present a scheduler-aware fetching KV cache for Layer 1 into the HBM execution buffer. The
scheme that leverages the hints from the job scheduler to execution stream then starts computing Layer 1. While the
prefetch KV caches to be accessed from disks to host memory. execution stream is computing one layer, the read stream
Meanwhile, to efficiently leverage the limited host memory concurrently loads the KV cache for the next layers. Thus,
space, we present a scheduler-aware eviction scheme that the loading is overlapped with the computation. However, we

5
prefill decode
Execution
Stream
Last job Gap Layer 1 Layer 2 Layer 3 Execution
Stream
Layer 1 Layer 2 Layer 3 L1 L2 L3 L1 L2 L3 Gap
Read
Layer 1 Layer 2 Layer 3
Stream Write write
Stream

(a) Baseline: KV cache loading without concurrent operations. (a) Baseline: KV cache saving without concurrent operations.
Execution prefill decode
Stream Last job Layer 1 Layer 2 Layer 3
Execution
Read Stream Layer 1 Layer 2 Layer 3 L1 L2 L3 L1 L2 L3
Layer 1 Layer 2 Layer 3
Stream
Write
Stream

(b) Layer-wise pre-loading without buffer.


(b) Asynchronous KV cache saving with overlapping.
Execution Layer 1 Layer 2 Layer 3
Last job
Stream
Read Layer 1 Layer 2 Layer 3
Figure 8: Asynchronous KV cache saving.
Stream

(c) Layer-wise pre-loading with buffer. the prefilling time for a token, the length of historical tokens
in a session, and the length of new input tokens in the con-
Figure 6: Layer-wise KV cache pre-loading. Blue blocks
versation, respectively. Imperfect overlapping happens when
indicate the execution of each transformer layer. Red blocks
Tload Lhist > Tpre f Lnew , which indicates that the transmission
indicate the KV cache loading of each transformer layer.
time is larger than the partial prefilling time. The buffer is
used to fill up the time gap Tload Lhist − Tpre f Lnew . Combined
Execution
Stream Last job Layer 1 Layer 2 Layer 3 with the PCIe bandwidth B, the buffer size can be set by the
Read
Stream
Layer 1 Layer 2 Layer 3 following formula: Sbu f = B(Tload Lhist − Tpre f Lnew ).

(a) Layer-wise pre-loading with imperfect overlapping. 3.2.2 Asynchronous Saving from HBMs to Memory
Execution
Stream Last job Layer 1 Layer 2 Layer 3
Read
AttentionStore needs to save KV caches to host memory to
Layer 1 Layer 2 Layer 3
Stream enable the reuse of the KV caches across conversations. A
baseline method to save the KV caches is to write all produced
(b) Perfect pre-loading with a customized larger buffer. KV caches together after the round of conversation ends. This
Figure 7: Layer-wise KV cache pre-loading. method however potentially delays the execution of the next
scheduled jobs since the KV saving time is on the critical
path of inference, as shown in Figure 8a. To reduce this over-
observe that a gap still exists between the last job and the first head, AttentionStore incorporates an asynchronous KV cache
layer of the current job, since the loading can only commence saving scheme to overlap the KV cache write-back with the
once the HBM execution buffer is available, i.e., the last job is computation, which also considers the different characteris-
finished. To further mitigate the gap between the last job and tics of prefilling and decoding phases to perform different
the first layer of the current job, AttentionStore reserves an overlapping mechanisms.
HBM read buffer to eliminate the gap. Specifically, as shown Specifically, the generation speeds of KV caches at the
in Figure 6c, with the read buffer, the read stream doesn’t prefilling and decoding phases are different. The prefilling
have to wait for the release of the execution buffer from the phase processes tokens concurrently, thus generating substan-
last job. The read stream can start the pre-loading while the tial volumes of KV cache within a restricted timeframe. In
last job is running. contrast, the decoding phase generates the KV cache of one
However, pre-loading may fail to fully overlap with the token at a time. As shown in Figure 8b, for the prefilling
computation if the KV cache loading time is longer than the phase, as each self-attention operation can produce a signif-
prefilling computation time. As shown in Figure 7a, multi- icant amount of KV cache, the write stream retains the KV
ple gaps exist between the computation of layers because cache layer by layer. The KV cache produced by the prefilling
the KV cache fetching time for each layer exceeds the com- phase can be overlapped with the decoding phase. For the
putation time for each layer, resulting in imperfect overlap- decoding phase, as the KV cache is iteratively produced, the
ping. The overhead can be further minimized by employing a write stream writes back the KV cache layer by layer while
customized larger pre-loading buffer. With the larger buffer, decoding. To avoid getting stuck if the KV cached is not fully
pre-loading can be issued with an earlier start. For instance, written back when the decoding is already finished, we also
as shown in Figure 7b, with the larger buffer, pre-loading is reserve an HBM write buffer to cover such cases similar to the
allowed to pre-load KV cache for more layers and thus the read buffer used in the KV cache prefetching. The unfinished
gaps between layers can be overlapped. Let Tload , Tpre f , Lhist KV caches are temporarily moved to the write buffer to avoid
and Lnew denote the access time of the KV cache for a token, blocking the execution of the next job.

6
3.3 Hierarchical KV Cache Placement eviction window, size: 6

Job
AttentionStore leverages both host memory and disks to ex- Job 9 Job 8 Job 7 Job 6 Job 5 Job 4 Job 3 Job 2 head
Queue
pand the available space for KV cache storage. The access
cache use sys use prefetching window, size: 2
speed of host memory, i.e., DRAM, is much higher than disks,
DRAM KV 4 KV 2 KV 1 buf execution Job 1
i.e., SSDs, (tens of GB/s v.s. several GB/s). If the KV caches
to be accessed are always found in the host memory instead disks -> dram KV 3
of disks, the access performance of KV caches will be opti- disks KV 9 KV 8 KV 7 KV 3 dram -> disks KV 4
mal. To achieve this, AttentionStore applies a scheduler-aware
cache use Timeline
fetching scheme to pre-fetch the KV caches from disks to
host memory, ensuring KV cache access at the optimal speed
(§3.3.1), and a scheduler-aware eviction scheme to evict suit- Figure 9: Scheduler-aware KV cache fetching and eviction.
able KV caches from host memory to disks (§3.3.2).
3.3.2 Scheduler-aware Eviction from Memory to Disks
3.3.1 Scheduler-aware Fetching from Disks to Memory When the free space in the host memory is exhausted, we
need to evict some KV caches from the host memory to disks.
Since disks have much larger capacity than the host memory Meanwhile, if the disks are full, we also need to evict some
(tens of TBs v.s. several hundreds of GBs), most KV caches KV caches stored in the disks out of the system. Therefore,
are retained in disks for AttentionStore. As conversation re- it is important to carefully choose the suitable KV cache
quests arrive randomly, their corresponding KV caches are candidates to be evicted for achieving a high cache hit rate.
more likely to be located in disks, resulting in poor access
Different from existing cache eviction strategies, such
performance.
as the least-recently-used (LRU) [50], first-in-first-out
To address the problem, we leverage a scheduler-aware (FIFO) [9], and their variants, which solely rely on the his-
KV cache fetching scheme to pre-fetch the KV caches to torical access information of the KV caches, AttentionStore
be accessed from disks to the host memory. This is done by presents a scheduler-aware eviction scheme which can fur-
utilizing the hints from the inference job scheduler. Specif- ther leverage the future access information of KV caches
ically, the job scheduler maintains a job queue, thus having to achieve a higher cache hit rate. The job queue in the job
the full knowledge of waiting jobs. AttentionStore applies a scheduler gives us the opportunity to achieve this. Specifi-
look-ahead prefetching window to watch for the waiting jobs cally, AttentionStore maintains a look-ahead eviction window
to be executed. If the KV cache of the waiting jobs is hit in in the job queue. The maximum length of the look-ahead evic-
the disks, AttentionStore will pre-fetch the KV cache of wait- tion window is determined by the total storage capacity of
ing jobs from the disks to host memory before these waiting the KV caching system. Assume the total available capacity
jobs are executed. The length of the look-ahead prefetching in the disks is Cdisk . The look-ahead eviction window length
window is determined by the available capacity in the host is (Cmem +Cdisk )/Skv . When AttentionStore attempts to evict
memory. Given the available memory capacity for prefetching one item out of the KV caching system, if finding the item
Cmem and the average KV size of a session Skv , the prefetching to be evicted in the look-ahead eviction window, the item is
window length is L pw = Cmem /Skv . exempted. When AttentionStore evicts one item from the host
A scheduler-aware fetching example is shown in Figure 9. memory to disks, the item located at the tail of the look-ahead
As Job 1 is executing, the KV cache manager applies a look- eviction window has a higher priority to be evicted. Note that
ahead window size of 2 (the host memory has 2 KV cache one item corresponds to all KV caches associated with a con-
slots for the KV cache fetching) to check the KV cache hit versation session, which is the minimal eviction and fetching
status of the waiting Jobs 2-3. The KV cache for Job 2 is hit in granularity in AttentionStore. This is because the KV cache
the host memory but the KV cache for Job 3 is not in the host in the same conversation session is either all used or none of
memory. Then the KV cache fetching threads start fetching it is used.
the KV cache for Job 3 from disks to the host memory. A scheduler-aware eviction example is shown in Figure 9.
Note that AttentionStore includes a host memory buffer When the KV cache of Job 3 is chosen to be migrated to the
that allows for seamless fetching of KV caches from disks host memory, the buffer will be utilized. To maintain a buffer
to memory, preventing any delays when the host memory is in the host memory, AttentionStore needs to evict KV caches
full. When the capacity of the free memory reaches a defined from the host memory to the disks. AttentionStore employs
threshold, AttentionStore triggers a KV eviction from host a look-ahead eviction window of size 6 to monitor the KV
memory to disks to ensure the constant availability of the host cache status of the jobs. First, it finds that the KV caches in
memory buffer. The eviction process from host memory to the host memory all have an associated job in the job queue.
disks is presented in the next subsection. It then continues scanning the look-ahead eviction window

7
4K 4K

Prompt
KC VC KC VC
Truncated truncation New Tokens New Tokens
Prompt k v q KC VC
KV cache new KV cache New KV cache truncation KV cache New KV cache
wk wv wq k v q k v q
2K 2K wk wv wq wk wv wq
(a) baseline (b) AttentionStore
Input Input Input
Figure 10: Illustration of managing context window overflow. (a) (b) (c)
positional
KC K cache VC V cache
Context window size: 4K, truncation ratio: 2K. (a) Baseline. encoding

Token truncation [33]. (b) KV cache truncation.


Figure 11: (a) absolute positional encoding. (b) relative po-
sitional encoding. (c) KV cache with decoupled positional
from tail to head and considers the jobs near the tail to have
encoding.
higher priority to be evicted. Therefore, the KV cache for Job
4 is selected to be evicted from the host memory to the disks. to infer
pos [0:2048] pos [0:1536] Position
Since the disks are also full, the scanning process identifies + + (on key)
that the last arrived Job 9 in the job queue is the most suitable KV
Storing Fetching
candidate to be evicted. Finally, the KV cache for Job 4 is
DRAM
moved to the location previously occupied by Job 9.
KV cache truncation

Figure 12: Illustration of KV cache truncation with Attention-


3.4 Decoupled KV Cache Truncation Store.
When the historical tokens exceed the limitation of the con-
text window, LLM serving engines generally perform token further used for the following inference.
truncation [33]. As shown in Figure 10a, the context window Figure 12 provides an example of how AttentionStore sup-
size is 4K. Once the context window overflows, the LLM ports KV cache truncation. AttentionStore stores the KV
serving engines cut off the first 2K tokens of the prompt. The cache without the positional encodings. In the cases where
truncation has no impact on previous LLM serving engines KV cache truncation becomes necessary, the LLM engine re-
since they always recompute the KV cache based on the in- trieves the truncated KV cache (i.e., KV [0:1536]) and loads it
put prompt, regardless of truncation. However, the truncation to the HBM. The new positional encodings are subsequently
makes the KV caches stored in AttentionStore invalid, signifi- applied to the KV cache.
cantly reducing the efficiency of AttentionStore. This is due Note that AttentionStore also allows for selective preser-
to the positional encoding embedded in the KV caches. On vation of certain KV cache with important scores, e.g., the
performing token truncation on the prompt, the position of initial tokens [58] or important tokens [13, 25, 65], to further
each token is changed. The positional encoding embedded in improve the generation quality of LLMs.
the KV caches cannot be modified to match the positions of
tokens in the prompt, making the KV caches invalid. 4 Performance Evaluation
To address this problem, AttentionStore enables the KV
caches after truncation to be still valid via decoupling the 4.1 Experimental Setup
positional encoding. AttentionStore needs to work with the
relative position encoding (RPE) [46, 48, 58]. Unlike the ab- Testbeds. All our experiments are performed on 4 NVIDIA
solute positional encoding (APE) in which positional encod- A100 GPUs, each with 80GB HBM. The system is equipped
ings are added to the input, RPE directly embeds the posi- with 128GB DRAM and 10TB SSDs. GPUs are connected to
tional encodings in the query (Q) and key (K) vectors, as the host via PCIe Gen 4.
shown in Figure 11b. Extensive research shows that RPE We implement AttentionStore in Pytorch and Python. The
allows LLMs to learn from longer data sequences than host memory and disks are managed in the form of blocks
APE [12, 51, 64]. Therefore, RPE is widely used in modern to improve storage utilization, similar to [22]. Our internal
LLMs, e.g., LLaMA [48], T5 [59], Falcon [36], Mistral [20], storage allocator allocates and deallocates storage blocks on
Mixtral [21] and Transformer-XL [8]. By simply moving the demand. For the model executor, AttentionStore integrates the
time of caching KVs before embedding positional encodings implementation of popular LLMs such as LLaMA [48] and
in RPE as shown in Figure 11c, AttentionStore can store Falcon [36] using Pytorch [34] and Transformers [55]. NCCL
the KVs without embedded positional encodings in the KV library [31] is applied for synchronization of the parallel GPU
caching system. When reusing the KVs in AttentionStore, the workers. Dedicated CUDA streams are used for moving data
KVs are embedded with the new positional encodings and between the GPUs and the host memory, overlapping the

8
RE AS RE AS RE Prefill
RE Decode
RE Overflow
AS Prefill
AS Decode

Prefill Throughput (t/s)


100 1000
40 4.02X 1.93X 3.28X 3.42X
75 1.0 750
Hit Rate (%)

30

Time (H)
TTFT (s)
50 500 20
0.5
25 250 10
0 Llama-13B Llama-65B Llama-70B Falcon-40B 0.0 Llama-13B Llama-65B Llama-70B Falcon-40B 0 Llama-13B Llama-65B Llama-70B Falcon-40B 0 Llama-13B Llama-65B Llama-70B Falcon-40B
Model Model Model Model

Figure 13: Cache hit rate. Figure 14: Time to first token. Figure 15: Prefill throughput. Figure 16: GPU time.

computation with proactive swapping. Separate IO threads since other performance metrics are closely related to it. Fig-
migrate data between the host memory and the disks, overlap- ure 13 shows the KV cache hit rates in AS for various LLMs.
ping the execution with the KV cache migrations. Continuous AS exhibits high hit rates around 86%, 71% 89%, and 90%
batching [61] is enabled through experiments. for LLaMa-13B, LLaMA-65B, LLaMA-70B and Falcon-40B,
Models. The experiments evaluate the open-sourced respectively. In contrast, we observe a relatively low hit rate of
LLaMA-1 model with 65B [48], LLaMA-2 models [49] with LLaMA-65B. This discrepancy arises due to the larger storage
13B, 70B, and Falcon 40B [36]. The intermediate activation space required by LLaMA-65B for saving KV caches. Given
uses FP16, aligned with prior systems [52, 61]. We also im- the same available storage space, AS accommodates fewer
plement Mistral-7B [20] with a 32K context window. Unless sessions for LLaMA-65B, thereby limiting the hit rate. Specif-
specified otherwise, LLaMA-13B operates on two GPUs with ically, LLaMA-65B necessitates 2.5MB of space for each to-
24 batches, while LLaMA-65B, LLaMA-70B, and Falcon- ken in the KV cache, LLaMA-13B requires 0.78MB, LLaMA-
40B run on four GPUs, handling 24 batches each. 70B and Falcon-40B require only 0.31MB and 0.12MB of
Workloads. The workload is integrated from the ShareGPT space per token due to using the group query attention with a
dataset [37, 66]. As there is no public request arrival times- GQA factor of 8 and 16, respectively.
tamp available in the dataset, we generate request arrival times Time to first token (TTFT). TTFT is an important metric
based on the Poisson distribution with various arrival rates, for quality of service in LLM serving [3, 35]. It indicates
following prior works [22, 56]. We set the number of different how quickly users start seeing the output of LLMs after
sessions arriving per second according to a Poisson distribu- entering their prompt. As shown in Figure 14, AS signifi-
tion (with λ = 1.0). 9K conversation sessions are used in the cantly reduces the TTFT by 85%, 61%, 87% and 86% for
experiments. LLaMA-13B, LLaMA-65B, LLaMA-70B and Falcon-40B
Baseline. We compare AttentionStore (AS) with re- respectively, in comparison to RE. This is because AS elimi-
computation (RE). RE only keeps historical tokens of conver- nates a large amount of repetitive computation for generating
sation sessions. It discards KV caches after serving a conver- the KV caches of historical tokens in the prefilling phase.
sation and does not keep the KV cache while the conversation Upon cache hits, the TTFT of AS only relies on the number
session is inactive. When a conversation associated with a of newly input tokens in the new conversation turn.
particular session becomes active again, RE leverages the his- Prefilling throughput. Prefilling throughput is the metric
torical tokens from that session to recompute their KV caches. to evaluate the speed of processing the prompt. Figure 15
When the historical tokens exceed the context window limita- shows the measured prefilling throughput. We observe that AS
tion, RE applies token truncation, same as the general LLM delivers remarkable speedups of 6.8×, 2.6×, 7.8× and 7.2×
services [33]. For simplicity, the token truncation ratio is set for LLaMA-13B, LLaMA-65B, LLaMA-70B, and Falcon-
to 0.5, implying that when an overflow occurs, the system 40B respectively, when compared to RE. The improvement of
will discard the earliest half of the tokens. AS on prefilling throughput comes from the reduced prefilling
time. AS only prefills the new input of the new conversation.
Moreover, AS can load and reuse the historical KV caches
4.2 End-to-end Performance
from the KV caching system with layer-wise pre-loading
In the end-to-end experiments, we use 9K conversations from optimization. The historical KV cache loading simultaneously
ShareGPT [37] and the average number of turns in these occurs with the prefilling on the new input tokens.
conversations is 5.75. Thus the total number of conversation GPU time. Figure 16 shows the end-to-end GPU time to
turns is about 52K. We warm up the KV caching system finish all inference jobs in the workload. We observe that
using the first 10K conversation turns and then evaluate the AS achieves speedups of 4.0×, 1.9×, 3.3×, and 3.4× for
performance on the following 42K turns. LLaMA-13B, LLaMA-65B, LLaMA-70B and Falcon-40B re-
Cache hit rate. We first present the cache hit rate in AS spectively, compared to RE. The performance improvements

9
RE-GPU AS-Storage RE AS-Prefill Computation Load KV Cache Computation AS

Total Prefilling Time (s)


Total Prefilling Time (s)
AS-GPU AS-Load KV Cache AS Save KV Cache
3 1.00 7.5

Execution Time (s)


800
600 0.75
2 5.0
Cost ($)

400 0.50
1 2.5
200 0.25
0 LLaMA LLaMA LLaMA Falcon 0 500/500 600/400 700/300 800/200 900/100
0.00 NO-PL PL-B0 PL-B5 PL-B10PL-B15 0.0 1000 1200 1400 1600
-13B -65B -70B -40B His Length / Prompt Length x Prompt Length

Figure 17: Inference cost. Figure 18: Recomputation v.s. Figure 19: AS with no pre- Figure 20: Performance im-
AttentionStore. loading v.s. AS pre-loading pact of using write overlap.
with various buffer sizes.

of AS are from two aspects, which are mitigation of recom-


total cost in AS for LLaMA-13B, LLaMA-65B, LLaMA-70B,
puting KV caches of the historical tokens, and the mitigation
and Falcon-40B, respectively.
of recomputing KV caches after context overflow. Regarding
the mitigation of re-prefilling, AS efficiently saves the KV
cache in the KV caching system and loads it when necessary 4.3 Ablation Studies
for historical tokens. On the other hand, RE discards the KV
cache once a job is finished, requiring the redoing of prefilling 4.3.1 Recomputation v.s. AttentionStore
for every job to reproduce the KV cache. In terms of miti- We investigate the prefilling performance of different methods
gating the recomputation of KV caches after context flow, under varying historic and new token ratios. Different meth-
RE applies token truncation which invalidates the KV cache ods prefill the same 1K tokens under the batch size of 16 on
for each truncation due to the embedded position encoding an A100 GPU for LLaMA-13B. RE computes the KV cache
in the KV caches. This prompts RE to recompute the KV for all tokens, while AS loads the KV cache of historical to-
cache based on the truncated historical tokens. In contrast, kens from the KV caching system and partially prefills the
AS decouples the position information from the KV caches, new input tokens. For example, the setting 600/400 means AS
allowing direct truncation of the KV cache. This approach loads the KV cache of 600 tokens and computes the KV cache
avoids the recomputation of KV caches that RE requires. for 400 tokens. Overall, AS outperforms RE in all tested set-
Note that AttentionStore also not only aids in minimizing the tings, as shown in Figure 18. This advantage becomes more
prefilling time but also helps reduce the decoding time [61]. pronounced as the percentage of newly input tokens decreases
Specifically, under continuous batching, each newly arrived (from 500 to 100), as depicted by the middle bar of each bar
job must complete prefilling before it can join other decod- group. Although the KV cache loading time for AS gradually
ing jobs. This process blocks the execution of decoding jobs, increases with the percentage of historical tokens (from 500 to
resulting in prolonged decoding time. However, Attention- 900), the layer-wise pre-loading scheme effectively eliminates
Store mitigates this issue by minimizing prefilling time for this loading time, as demonstrated by the third bar of each bar
newly arrived jobs. The reduction in prefilling time reflects group. Note that when the KV cache loading time exceeds
the decoding time improvement observed in RE, as depicted the prefilling time (e.g., the second bar of setting 900/100),
in Figure 16. AS can conceal the KV cache loading time by enabling a read
Inference cost. We evaluate the resource cost based on buffer. The impact of the read buffer is evaluated in the next
the on-demand price of AWS EC2 instances [5, 6], i.e., subsection.
$5/hour per A100 GPU, $0.0088/hour/GB for DRAM and
$0.000082/hour/GB for SSD. Figure 17 shows the total costs 4.3.2 Overlapped KV cache Access
of different methods for completing the workload. Compared
to RE, AS achieves significant cost savings for LLaMA-13B, This subsection evaluates the effectiveness of the proposed
LLaMA-65B, LLaMA-70B, and Falcon-40B, amounting to overlapping access techniques for loading and saving KV
70%, 43%, 66%, and 68%, respectively. These cost savings caches. The model used is LLaMA-13B with a single GPU
primarily stem from the reduced GPU time, as AS effectively and the batch size is set to 16.
reduces redundant prefilling for historical tokens and recom- Layer-wise KV cache pre-loading. In the experiments,
putation costs during context overflow, as depicted in Fig- we set the length of historical tokens to 1K and the length of
ure 16. AS employs cost-effective storage mediums including newly input tokens to 100 for investigating the effectiveness
host memory and disks to cache the KV caches during inactive of the lay-wise pre-loading scheme. The first bar in Figure 19,
conversation sessions. The storage cost from the host memory i.e., NO-PL, shows the time of prefilling without the pre-
and disks constitutes 16.4%, 9.0%, 9.0%, and 9.0% of the loading scheme that includes two parts: the KV cache loading

10
AS-DRAM LRU-DRAM FIFO-DRAM AS LRU FIFO OF AS OF AS
AS-SSD LRU-SSD FIFO-SSD
100 30 100
15
Hit Rate (%)

Hit Rate (%)


75 20 75

Time (H)

Time (H)
10
50 50
10 5
25 25
0 128G/2T 128G/5T 128G/10T 0 128G/2T 128G/5T 128G/10T 0 LLaMA LLaMA LLaMA Falcon 0 LLaMA LLaMA LLaMA Falcon
DRAM / SSD DRAM / SSD -13B -65B -70B -40B -13B -65B -70B -40B

(a) Impact on hit rate. (b) Impact on GPU time. (a) Impact on hit rate. (b) Impact on GPU time.

Figure 21: Comparison of the eviction algorithms under vari- Figure 22: Context overflow impact.
ous storage settings.

Analyzing the breakdown of hit rate, for the configuration of


time and the computation time of the newly input tokens. The 128G/2T, LRU and FIFO only achieve 0.5% and 0.5% (too
following bars in Figure 19 show the prefilling time when tiny to display) DRAM hit rates, with the remaining 12.4%
the layer-wise pre-loading scheme has different sizes of read and 9.0% being disk hit rates. Even with the overall hit rate
buffers. For clarity, we use the number of layers to represent increasing to 58% and 48% respectively for the larger capac-
the buffer size, e.g., PL-B0 indicates no read buffer and PF- ity of 128G/10T, LRU and FIFO still exhibit limited DRAM
B5 indicates a read buffer size of 5 layers of KV cache. We hit rates of approximately 0.6% and 0.5% respectively. This
observe although there is no read buffer, i.e., PL-B0, the pre- is because LRU and FIFO lack awareness of future KV cache
loading scheme reduces the prefilling time by 35% compared information and cannot pre-fetch KV caches from disks to
to NO-PL. PF-B15 perfectly overlaps the KV cache loading host memory, thereby limiting their ability to improve DRAM
time and reduces the prefilling time by 61% compared to hit rates. In contrast, AS achieves a cache hit rate of up to
NO-PL. 86%, with over 99.6% of the hits occurring in DRAM due to
Asynchronous KV cache saving. In the experiments, we its scheduler-aware policy.
set different prompt lengths ranging from 1K to 1.6K and the
number of decoding steps to 20 for investigating the effec-
tiveness of the asynchronous saving scheme. As shown in 4.3.4 Performance of Decoupled KV Cache Truncation
Figure 20, we observe that the saving time increases as the When the context window exceeds its limit, AttentionStore
prompt length grows, since the size of the KV cache to be truncates the KV cache directly, thus avoiding the need for
saved increases. To mitigate the saving overhead, Attention- re-computation and reducing overhead. We evaluate the ef-
Store employs the asynchronous saving scheme that allows fectiveness of the way AS used to manage the context over-
the KV cache saving to overlap with the execution of the in- flow. Specifically, we compare a baseline approach overflow
ference, reducing the overall execution time by 13% to 15%. (OF) that embeds positional encoding within the KV caches,
leading to the invalidation of KV caches in the KV caching
4.3.3 Scheduler-aware Fetching and Eviction system. OF relegates context overflow to be managed by re-
computation. This experiment uses 128GB DRAM and 10TB
We investigate the effectiveness of the scheduler-aware fetch- SSD. As evident from Figure 22a, comparing OF with AS,
ing and eviction in AttentionStore upon improving the cache the hit rates decrease by 17.6%, 41.5%, 18.1%, 18.4% for
hit rate. We compare the overall cache hit rates, DRAM hit LLaMA-13B, LLaMA-65B, LLaMA-70B, and Falcon-40B,
rates, and disk hit rates of AS and existing eviction policies respectively. This decline is attributed to the fact that if ap-
(including LRU and FIFO) across various storage configura- plying OF, every instance of context overflow necessitates
tions. context truncation, thereby invalidating the KV caches in the
As shown in Figure 21a, for the configuration of 128G/2T KV caching system. This decrease in hit rate subsequently
that indicates 128GB DRAM and 2TB SSD, AS outperforms translates to the longer GPU time as shown in Figure 22b.
LRU and FIFO in the overall cache hit rate by 27% and 31%, AttentionStore ensures the validity of the saved KV caches
respectively. With the increased SSD capacity (128G/10T), in the system when context overflow happens and promises a
AS achieves a remarkable hit rate of 86%, surpassing LRU higher hit rate and reduced GPU time. OF of LLaMA-65B ex-
(58%) and FIFO (48%). AS achieves high overall hit rates be- periences a low hit rate due to its limited 2K context window.
cause AS is aware of the future KV cache access information After serving a conversation of the first turn, the session easily
to avoid evicting the KV caches that will be used in the future. reaches the context window limit, consequently making the
The higher hit rates are translated to the reduced GPU time as associated KV cache invalidate. Subsequently, the following
shown in Figure 21b, where AS achieves speedup up to 2.7×. conversations in the same session face KV cache miss.

11
100 2.0K

Throughput (t/s)
HBM HBM+DRAM AS HBM HBM+DRAM AS
Hit Rate (%)

75 1.5K
100
50 1.0K 40

Hit Rate (%)


75 30
25 0.5K

Time (H)
0 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.0K0.000.050.100.150.200.250.300.35 50 20
0.00
Storage Capacity / DSpUT Storage Capacity / DSpUT 25 10
0 LLaMA
0.1 0.1 0.1 0.2
LLaMA LLaMA Falcon 0 LLaMA LLaMA LLaMA Falcon
(a) Impact on hit rate. (b) Impact on throughput. -13B -65B -70B -40B -13B -65B -70B -40B

Figure 23: Impact of storage capacity and the number of (a) Impact of caching storage (b) Impact of caching storage
distinct sessions. mediums on hit rates. mediums on GPU time.

Figure 24: Performance under various caching configurations.


4.3.5 The Cache Capacity Requirement
In this subsection, we investigate how much cache capac-
inference performance of different mechanisms. The hit rate
ity AttentionStore needs to achieve a remarkable cache hit
of the HBM-only caching method is nearly 0% for all mod-
rate. The cache capacity required is related to the maximum
els due to the limited capacity of HBM. Using HBM with
number of distinct conversation sessions served by an LLM
DRAM improves the cache hit rate to 3.4%, 1.7%, 7.7%, and
serving system per unit time (denoted as DSpUT ). The larger
19.1% for models LLaMA-13B, LLaMA-65B, LLaMA-70B
the DSpUT value is, the more distinct sessions the system
and Falcon-40B, respectively. In contrast, by further extend-
handles per unit time, and the more KV cache storage space
ing the cache capacity with SSDs, AttentionStore improves
is required. Moreover, due to the limitation of the maximum
the cache hit rate to 86%, 71%, 89%, and 90% for models
context window, the maximum KV cache capacity required by
LLaMA-13B, LLaMA-65B, LLaMA-70B, and Falcon-40B,
one conservation session is fixed, i.e., equal to the length of the
respectively. With higher hit rates, AttentionStore significantly
maximum context window multiplied by the KV size of each
improves the inference performance compared to the HBM-
token, which is denoted as CCpS. Thus the required maximum
only/HBM+DRAM policies as shown in Figure 24b.
cache capacity per unit time is CCpUT = DSSpUT ∗CCpS.
In AttentionStore, the KV cache of each session has a TTL
(time to live) that indicates its maximum saving time since 4.3.7 Accuracy of Decoupled Positional Encoding
the last access. The TTL is set as the unit time mentioned
above. By configuring the cache capacity of AttentionStore To maintain the validity of the stored KV caches, Attention-
as CCpUT , we can achieve a cache hit rate of 100% if not Store decouples positional encoding from the KV caches and
considering the newly arrived conversations. Nevertheless, to embeds the new positional encodings when reusing the stored
achieve a high cache hit rate in real scenarios, we do not need KV caches as presented in Section 3.4. We evaluate the im-
to configure such a large capacity since the hotness of cached pact of the different schemes including AS, token truncation
items is different. (TT), and naive KV cache truncation (NKVT) on the per-
To figure out the relationship between the required cache plexity (PPL) and the accuracy of LLMs leveraging widely
capacity (RCC) and CCpUT , we evaluate the cache hit rate used benchmarks. In situations where the number of histor-
and the decoding throughput under different ratios of RCC ical tokens exceeds the context window limit, TT removes
to CCpUT . In the experiment, we set the TTL to one hour. the historical tokens and recomputes the KV caches for the
As shown in Figure 23a, when the ratio RCC/CCpUT is remaining tokens, and NKVT directly discards the KV caches
0.1, we achieve the cache hit rate of 51%. When the ratio associated with the positional encoding and utilizes the trun-
RCC/CCpUT is 0.25, we achieve the cache hit rate of 98%. cated KV caches instead.
As the hit rate reaches the peak, the throughput also meets its PPL. PPL is a metric used to evaluate the quality of a model
peak, as shown in Figure 23b. in generating tokens [14, 58]. Lower PPL values indicate that
the model is better at predicting the text and demonstrates a
4.3.6 Impact of Caching Storage Mediums greater understanding of the language. Table 1 shows the PPL
comparison of LLaMA-7B and LLaMA-13B in AS, TT, and
Some existing works [19,30,67] employ only the HBM space NKVT settings using datasets WikiText-2 [15], C4 [39], and
for caching the KV caches of multi-turn conversations. We PTB [27]. TT consistently achieves a low PPL by recomputing
here compare the performance of mechanism caching KVs on KV caches after context window overflow. AS also maintains
HBMs with that of AttentionStore caching KVs on DRAM a low PPL, comparable to TT (with a difference of < 0.02), by
and SSDs. In the experiments, we configure the size of the incorporating new positional encodings into the KV caches
HBM cache as 10GB, the size of DRAM as 128GB, and the after truncation. In contrast, NKVT exhibits a high PPL (>
size of SSDs as 10TB. Figure 24a shows cache hit rates and 103 ) due to the coupling of positional encoding within its KV

12
Table 1: PPL comparison of different methods.
RE AS RE AS

Prefill Throughput (t/s)


2.5 300K
Dataset Model AS TT NKVT
2.0
200K

TTFT (s)
LLaMA-7B 5.47 5.48 2198.7 1.5

22X
WikiText-2 1.0
LLaMA-13B 4.91 4.90 1647.7 100K
0.5
LLaMA-7B 8.48 8.49 2543.5 0.0 10K
PTB 4K 8K 12K 16K 20K 24K 28K 4K 8K 12K 16K 20K 24K 28K
LLaMA-13B 7.61 7.60 1865.8 Sequence Length Sequence Length
LLaMA-7B 6.96 6.98 2343.5 (a) Impact on TTFT. (b) Impact on prefill throughput.
C4
LLaMA-13B 6.44 6.45 1745.6
Figure 25: Prefilling performance of long sequence inference.
Table 2: Accuracy of different methods.

Benchmark Model AS TT NKVT users submit a series of analysis tasks for the same document,
forming a multi-turn conversation session. The size of the
LLaMA-7B 43.7% 43.4% 21.8% document varies from 4K to 28K. Each session consists of 6
MMLU
LLaMA-13B 52.3% 53.2% 29.6% analysis tasks, with each task requiring an input of 256 tokens
LLaMA-7B 66.0% 65.9% 12.0% and producing an output of 64 tokens. In the experiments,
LongEval
LLaMA-13B 68.0% 68.0% 14.0% we use a batch size of 1 and evaluate the performance of the
second and subsequent turns.
LLaMA-7B 77.1% 77.2% 48.9%
PIQA TTFT. Figure 25a shows the average TTFT for different
LLaMA-13B 80.5% 80.4% 50.2%
sequence lengths. The TTFT of RE gradually increases to
2.5s when the sequence length reaches 28K. In contrast, AS
caches. Directly truncating the KV caches would scramble has only about 0.12s of TTFT, resulting in a 95% reduction
the coupled positional information, resulting in the models’ in TTFT compared to RE.
failure to maintain a low PPL. Prefilling throughput. Figure 25b illustrates the measured
Accuracy. To analyze the accuracy of the models in answer- prefilling throughput. AS significantly improves the prefill-
ing questions after truncation, we conduct experiments using ing throughput, achieving a speedup up to 22× compared to
the MMLU [18], LongEval [23, 58], and PIQA [7] bench- RE. The prefilling throughput of RE cannot improve as the
marks. Specifically, we first input a long text to simulate the sequence length grows due to being bounded by the com-
overflow of historical inputs to trigger the truncation opera- putational capability. AS observes a continuous increase in
tion, and then append the questions from the benchmarks as the prefilling throughput as the sequence length grows by
new inputs afterward. As shown in Table 2, both AS and TT efficiently reusing the historical KV cache.
provide high comparable accuracy. TT achieves high accu- GPU time. Figure 26a shows the average GPU time to
racy by paying the recomputation cost for context window complete each analysis task. AS demonstrates consistent GPU
overflow, while our AS avoids this cost and still maintains time savings compared to RE, regardless of the sequence
high accuracy. In contrast, the NKVT has a much lower accu- length. When the sequence length increases, RE requires more
racy than AS and TT because the coupled positional encoding prefilling time to recompute the KV caches, which accounts
after KV cache truncation is miscoded, which results in more for a substantial portion of the total GPU time, i.e., 41%, for
disruption to new inputs. a sequence length of 28K. In contrast, AS efficiently reduces
prefilling costs by reusing KV caches, resulting in only 1.2%
of the GPU time being allocated to prefilling.
4.3.8 Performance for Long Sequence Inference
Output throughput. The overall output throughput is cal-
Modern LLMs continue to incorporate longer context win- culated by the number of generated tokens divided by the
dows to accommodate a greater amount of information, em- total processing time of a task, as shown in Figure 26b. As
powering long sequence inference applications (e.g., docu- the sequence length increases, both RE and AS experience a
ment understanding [60] and code understanding [29]). We decrease in throughput. This is attributed to the increased com-
assess the efficacy of AttentionStore with models designed putational demands for computing attention of the lengthy
for long sequence inference in these applications. Specifi- sequence, resulting in a longer decoding time for each to-
cally, we deploy the Mistral-7B model [20] with a maximum ken. Notably, AS consistently surpasses RE in all scenarios,
32K context window on one A100 GPU with 80GB HBM, demonstrating an improvement in output throughput of up
employing a GQA factor of 8 [2, 20]. We evaluate a documen- to 67%. These improvements in output throughput primarily
tation analysis application as an example. In this application, stem from the elimination of KV cache recomputation.

13
Decoding AS Prefill AS prefill RE AS
6 Conclusion

Output Throughput (t/s)


7.5 40
This paper proposes AttentionStore, a new attention that al-
GPU Time (s)

30
5.0 lows the reuse of the KV caches for any ensuing turns of the
20 same conversation, achieving a significant reduction in the re-
2.5
10 computation overhead of KV caches in LLMs. To improve the
0.0 4K 8K 12K 16K 20K 24K 28K 0 4K 8K 12K 16K 20K 24K 28K efficiency of AttentionStore, we design overlapped KV cache
Sequence Length Sequence Length access, hierarchical KV cache placement, and positional en-
coding decoupled KV cache truncation schemes. Extensive
(a) Impact on GPU time. (b) Impact on output throughput. experimental results demonstrate that AttentionStore signif-
Figure 26: Overall performance of long sequence inference. icantly decreases the TTFT by up to 87% and improves the
prompt prefilling throughput by 7.8× for multi-turn conversa-
tions. It reduces the end-to-end inference cost by up to 70%. It
also decreases the TTFT by up to 95% and enhances prompt
5 Related Work
prefilling throughput by 22× for long sequence inference.

KV Cache Management. Within a single-turn conversation,


the KV cache is widely used for improving the performance References
of the decoding phase [10, 45, 52, 54, 61, 67]. To reduce the
[1] Amey Agrawal, Ashish Panwar, Jayashree Mohan,
storage overhead of the KV cache on HBMs, existing work
Nipun Kwatra, Bhargav S Gulavani, and Ramachan-
employs quantization and compression techniques on KV
dran Ramjee. Sarathi: Efficient llm inference by piggy-
caches [13, 25, 58, 65]. To reduce the memory waste incurred
backing decodes with chunked prefills. arXiv preprint
by fragmentation, vLLM [22] takes inspiration from virtual
arXiv:2308.16369, 2023.
memory to allow the KV cache to use fine-granularity non-
continuous memory. These techniques are orthogonal to At- [2] Joshua Ainslie, James Lee-Thorp, Michiel de Jong,
tentionStore, which focuses on multi-turn conversations. Yury Zemlyanskiy, Federico Lebrón, and Sumit Sang-
LMDeploy [19] is an LLM inference framework that hai. Gqa: Training generalized multi-query transformer
caches the KV caches of multi-turn conversations on HBMs. models from multi-head checkpoints. arXiv preprint
RadixAttention [67], ChunkAttention [60], and Pensieve [62] arXiv:2305.13245, 2023.
are inference techniques that were developed concurrently
[3] Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko,
with AttentionStore. RadixAttention and ChunkAttention op-
Karen Khatamifard, Minsik Cho, Carlo C Del Mundo,
timize the inference tasks that share prompt prefixes. Tasks
Mohammad Rastegari, and Mehrdad Farajtabar. Llm in a
with the same prompt prefixes share the same KV caches to
flash: Efficient large language model inference with lim-
reduce the KV computation. Pensieve utilizes both GPU and
ited memory. arXiv preprint arXiv:2312.11514, 2023.
CPU memory to store KV caches for multi-turn conversations.
Different from all these works, AttentionStore exploits slower [4] Reza Yazdani Aminabadi, Samyam Rajbhandari, Am-
but larger storage hierarchies to save the KV caches to achieve mar Ahmad Awan, Cheng Li, Du Li, Elton Zheng,
high cache hit rates as presented in Section 4.3.6, and focuses Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff
on designing systemic techniques to address the challenges Rasley, et al. Deepspeed-inference: enabling efficient
of offloading to slower mediums. inference of transformer models at unprecedented scale.
Inference Parameter Offloading. FlexGen [43] offloads In Proceedings of the International Conference on High
both model weights and KV cache to DRAM and disks to Performance Computing, Networking, Storage and Anal-
support offline inference of LLMs. DeepSpeed Inference ysis, SC, 2022.
[4,40,41] offloads model weights to the DRAM and disks and
fetches them on demand. Lina [24] offloads infrequently used [5] AWS. Amazon ec2 p4d pricing. https://aws.amaz
expert weights of LLMs to the host memory to improve the on.com/ec2/instance-types/p4/.
memory efficiency. PowerInfer [44] and LLM in a flash [3] [6] AWS. Amazon ec2 pricing. https://aws.amazon.c
utilize sparsity [26, 28] in FFN computation to offload most om/ec2/pricing/.
of the inactive weights to the host memory or disks to re-
duce both memory usage and the computation. FastServe [56] [7] Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng
schedules the KV caches to the host memory for optimizing Gao, and Yejin Choi. Piqa: Reasoning about physical
the job completion time. In contrast, AttentionStore exploits commonsense in natural language. In Proceedings of
KV caching offloading to reduce the recomputation overhead the AAAI Conference on Artificial Intelligence, AAAI,
in multi-turn conversations. 2020.

14
[8] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, [19] InternLM. Lmdeploy. https://github.com/Inter
Quoc V Le, and Ruslan Salakhutdinov. Transformer- nLM/lmdeploy.
xl: Attentive language models beyond a fixed-length
context. arXiv preprint arXiv:1901.02860, 2019. [20] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch,
Chris Bamford, Devendra Singh Chaplot, Diego de las
[9] Asit Dan and Don Towsley. An approximate analysis Casas, Florian Bressand, Gianna Lengyel, Guillaume
of the lru and fifo buffer replacement schemes. In Pro- Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint
ceedings of the 1990 ACM SIGMETRICS conference on arXiv:2310.06825, 2023.
Measurement and modeling of computer systems, pages [21] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux,
143–152, 1990. Arthur Mensch, Blanche Savary, Chris Bamford, De-
vendra Singh Chaplot, Diego de las Casas, Emma Bou
[10] Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal,
Hanna, Florian Bressand, et al. Mixtral of experts. arXiv
Bin Yu, Ahmed Awadallah, and Subhabrata Mukher-
preprint arXiv:2401.04088, 2024.
jee. Skipdecode: Autoregressive skip decoding with
batching and caching for efficient llm inference. arXiv [22] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying
preprint arXiv:2307.02628, 2023. Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E Gon-
zalez, Hao Zhang, and Ion Stoica. Efficient memory
[11] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke management for large language model serving with
Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication pagedattention. In Proceedings of ACM Symposium
for transformers at scale. In Proceedings of Advances in on Operating Systems Principles, SOSP, 2023.
Neural Information Processing Systems, NeuIPS, 2022.
[23] Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lian-
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and min Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma,
Kristina Toutanova. Bert: Pre-training of deep bidirec- and Hao Zhang. How long can context length of open-
tional transformers for language understanding. arXiv source llms truly promise? In Workshop in Proceedings
preprint arXiv:1810.04805, 2018. of Advances in Neural Information Processing Systems,
NeuIPS Workshop, 2023.
[13] Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Ji-
awei Han, and Jianfeng Gao. Model tells you what to [24] Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and
discard: Adaptive kv cache compression for llms. arXiv Hong Xu. Accelerating distributed moe training and
preprint arXiv:2310.01801, 2023. inference with lina. In Proceedings of USENIX Annual
Technical Conference, ATC, 2023.
[14] Hila Gonen, Srini Iyer, Terra Blevins, Noah A Smith,
[25] Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao
and Luke Zettlemoyer. Demystifying prompts in lan-
Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis,
guage models via perplexity estimation. arXiv preprint
and Anshumali Shrivastava. Scissorhands: Exploit-
arXiv:2212.04037, 2022.
ing the persistence of importance hypothesis for llm
[15] Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, kv cache compression at test time. arXiv preprint
and Tie-Yan Liu. Frage: Frequency-agnostic word rep- arXiv:2305.17118, 2023.
resentation. In Proceedings of Advances in Neural In- [26] Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang
formation Processing Systems, NeuIPS, 2022. Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang,
Yuandong Tian, Christopher Re, et al. Deja vu: Con-
[16] Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng
textual sparsity for efficient llms at inference time. In
Ji, and Sinong Wang. Lm-infinite: Simple on-the-fly
Proceedings of International Conference on Machine
length generalization for large language models. arXiv
Learning, ICML, 2023.
preprint arXiv:2308.16137, 2023.
[27] Mitch Marcus, Beatrice Santorini, and Mary Ann
[17] Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Marcinkiewicz. Building a large annotated corpus of
Xu, and Yunhe Wang. Transformer in transformer. In english: The penn treebank. Computational linguistics,
Proceedings of Advances in Neural Information Pro- 19(2):313–330, 1993.
cessing Systems, NeuIPS, 2021.
[28] Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta,
[18] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei,
Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mohammad Rastegari, and Mehrdad Farajtabar. Relu
Measuring massive multitask language understanding. strikes back: Exploiting activation sparsity in large lan-
arXiv preprint arXiv:2009.03300, 2020. guage models. arXiv preprint arXiv:2310.04564, 2023.

15
[29] Daye Nam, Andrew Macvean, Vincent Hellendoorn, power next-generation ai scale. In Proceedings of In-
Bogdan Vasilescu, and Brad Myers. Using an llm ternational Conference on Machine Learning, ICML,
to help with code understanding. In Proceedings of 2022.
IEEE/ACM International Conference on Software Engi-
neering, ICSE, 2024. [41] Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley,
Shaden Smith, and Yuxiong He. Zero-infinity: Breaking
[30] NVIDIA. Fastertransformer. https://github.com/N the gpu memory wall for extreme scale deep learning.
VIDIA/FasterTransformer. In Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Anal-
[31] NVIDIA. Nvidia collective communications library
ysis, SC, 2021.
(nccl). https://developer.nvidia.com/nccl.

[32] OpenAI. https://openai.com/blog/chatgpt, [42] ShareGPT. Sharegpt. https://sharegpt.com/.


2024.
[43] Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan
[33] OpenAI. https://platform.openai.com/docs/a Li, Max Ryabinin, Daniel Y Fu, Zhiqiang Xie, Beidi
ssistants/how-it-works/managing-threads-a Chen, Clark Barrett, Joseph E Gonzalez, et al. Flexgen:
nd-messages, 2024. High-throughput generative inference of large language
models with a single gpu. In Proceedings of Interna-
[34] Adam Paszke, Sam Gross, Francisco Massa, Adam tional Conference on Machine Learning, ICML, 2023.
Lerer, James Bradbury, Gregory Chanan, Trevor Killeen,
Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. [44] Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen.
Pytorch: An imperative style, high-performance deep Powerinfer: Fast large language model serving with a
learning library. In Proceedings of Advances in Neural consumer-grade gpu. arXiv preprint arXiv:2312.12456,
Information Processing Systems, NeuIPS, 2019. 2023.

[35] Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo [45] Benjamin Spector and Chris Re. Accelerating llm infer-
Goiri, Aashaka Shah, Saeed Maleki, and Ricardo Bian- ence with staged speculative decoding. arXiv preprint
chini. Splitwise: Efficient generative llm inference using arXiv:2308.04623, 2023.
phase splitting. arXiv preprint arXiv:2311.18677, 2023.
[46] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan,
[36] Guilherme Penedo, Quentin Malartic, Daniel Hess- Wen Bo, and Yunfeng Liu. Roformer: Enhanced trans-
low, Ruxandra Cojocaru, Alessandro Cappelli, Hamza former with rotary position embedding. Neurocomput-
Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and ing, 568:127063, 2024.
Julien Launay. The RefinedWeb dataset for Falcon LLM:
outperforming curated corpora with web data, and web [47] Gemini Team, Rohan Anil, Sebastian Borgeaud,
data only. arXiv preprint arXiv:2306.01116, 2023. Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu
Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth,
[37] philschmid. Sharegpt raw. https://huggingface.co
et al. Gemini: a family of highly capable multimodal
/datasets/philschmid/sharegpt-raw/tree/main
models. arXiv preprint arXiv:2312.11805, 2023.
/sharegpt_90k_raw_dataset.

[38] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, [48] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Martinet, Marie-Anne Lachaux, Timothée Lacroix, Bap-
Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scal- tiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar,
ing transformer inference. In Proceedings of Machine et al. Llama: Open and efficient foundation language
Learning and Systems, MLSys, 2023. models. arXiv preprint arXiv:2302.13971, 2023.

[39] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine [49] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert,
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Li, and Peter J Liu. Exploring the limits of transfer Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al.
learning with a unified text-to-text transformer. Journal Llama 2: Open foundation and fine-tuned chat models.
of machine learning research, 2020. arXiv preprint arXiv:2307.09288, 2023.

[40] Samyam Rajbhandari, Conglong Li, Zhewei Yao, Min- [50] Valentin Touzeau, Claire Maïza, David Monniaux, and
jia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Jan Reineke. Fast and exact analysis for lru caches.
Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Proceedings of the ACM on Programming Languages,
Advancing mixture-of-experts inference and training to 3(POPL):1–29, 2019.

16
[51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob [62] Lingfan Yu and Jinyang Li. Stateful large lan-
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, guage model serving with pensieve. arXiv preprint
and Illia Polosukhin. Attention is all you need. In Pro- arXiv:2312.05516, 2023.
ceedings of Advances in Neural Information Processing
Systems, NeuIPS, 2017. [63] Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Artetxe, Moya Chen, Shuohui Chen, Christopher De-
[52] vLLM Project. vllm: Easy, fast, and cheap llm serving wan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt:
with pagedattention. https://github.com/vllm-p Open pre-trained transformer language models. arXiv
roject/vllm/. preprint arXiv:2205.01068, 2022.
[53] Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, [64] Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,
Lifan Yuan, Hao Peng, and Heng Ji. Mint: Evaluating Maosong Sun, and Qun Liu. Ernie: Enhanced language
llms in multi-turn interaction with tools and language representation with informative entities. arXiv preprint
feedback. arXiv preprint arXiv:2309.10691, 2023. arXiv:1905.07129, 2019.
[54] Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. [65] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong
Tabi: An efficient multi-level inference system for large Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong
language models. In Proceedings of the European Con- Tian, Christopher Ré, Clark Barrett, et al. H _2 o: Heavy-
ference on Computer Systems, EuroSys, 2023. hitter oracle for efficient generative inference of large
[55] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien language models. arXiv preprint arXiv:2306.14048,
Chaumond, Clement Delangue, Anthony Moi, Pierric 2023.
Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al.
[66] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Huggingface’s transformers: State-of-the-art natural lan-
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuo-
guage processing. arXiv preprint arXiv:1910.03771,
han Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-
2019.
judge with mt-bench and chatbot arena. arXiv preprint
[56] Bingyang Wu, Yinmin Zhong, Zili Zhang, Gang Huang, arXiv:2306.05685, 2023.
Xuanzhe Liu, and Xin Jin. Fast distributed inference
serving for large language models. arXiv preprint [67] Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff
arXiv:2305.05920, 2023. Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Chris-
tos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al.
[57] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Efficiently programming large language models using
Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xi- sglang. arXiv preprint arXiv:2312.07104, 2023.
aoyun Zhang, and Chi Wang. Autogen: Enabling next-
gen llm applications via multi-agent conversation frame-
work. arXiv preprint arXiv:2308.08155, 2023.
[58] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song
Han, and Mike Lewis. Efficient streaming lan-
guage models with attention sinks. arXiv preprint
arXiv:2309.17453, 2023.
[59] Linting Xue, Noah Constant, Adam Roberts, Mihir
Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua,
and Colin Raffel. mt5: A massively multilingual
pre-trained text-to-text transformer. arXiv preprint
arXiv:2010.11934, 2020.
[60] Lu Ye, Ze Tao, Yong Huang, and Yang Li. Chunkat-
tention: Efficient self-attention with prefix-aware kv
cache and two-phase partition. arXiv preprint
arXiv:2402.15220, 2024.
[61] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo-
jeong Kim, and Byung-Gon Chun. Orca: A distributed
serving system for transformer-based generative models.
In Proceedings of USENIX Symposium on Operating
Systems Design and Implementation, OSDI, 2022.

17

You might also like