Distserve: Disaggregating Prefill and Decoding For Goodput-Optimized Large Language Model Serving
Distserve: Disaggregating Prefill and Decoding For Goodput-Optimized Large Language Model Serving
Yinmin Zhong1 Shengyu Liu1 Junda Chen3 Jianbo Hu1 Yibo Zhu2 Xuanzhe Liu1
Xin Jin1 Hao Zhang3
1
token (TPOT), which represents the average time taken to tency requirements and preference for different forms of paral-
generate a token for each request (except for the first token)1 . lelism (§3). Colocating prefill and decoding, however, couples
Different applications place varying demands on each metric. their resource allocation, and prevents implementing differ-
For example, real-time chatbots [1] prioritize low TTFT for ent parallelism strategies more suited to meeting the specific
response promptness, while TPOT only remains important un- latency requirements of each phase.
til it is faster than human reading speed (i.e., 250 words/min). To overcome these challenges, we propose to disaggregate
Conversely, document summarization emphasizes low TPOT the prefill and decoding phases of LLM inference, assigning
for faster generation of the summary. them to separate GPUs. Our approach has two benefits. First,
Hence, given the application’s TTFT and TPOT require- operating each phase independently on different GPUs elimi-
ments, an effective LLM serving system should balance these nates prefill-decoding interference. Second, it allows to scale
needs and maximize per-GPU goodput, defined as the max- each phase independently with tailored resource allocation
imum request rate that can be served adhering to the SLO and model parallelism strategies to meet their specific latency
attainment goal (say, 90%) for each GPU provisioned – higher requirements. Although disaggregation causes communica-
per-GPU goodput directly translates into lower cost per query. tion of intermediate states between GPUs, we show that the
As the prefill and decoding phases share the LLM weights communication overhead is insubstantial (§3.3) in modern
and working memory, existing LLM serving systems typi- GPU clusters, and when managed appropriately, disaggrega-
cally colocate both phases on GPUs and maximize the overall tion significantly improves per-GPU goodput.
system throughput – tokens generated per second across all Based on the above insights, in this work, we build Dist-
users and requests – by batching the prefill and decoding steps Serve2 , a goodput-optimized LLM serving system by disag-
across requests [31, 54]. However, to meet latency require- gregating the prefill and decoding phases. Given TTFT and
ments, we find these systems must over-provision compute TPOT requirements, DistServe first scales each phase indepen-
resources. To see this, Figure 1 illustrates how the P90 TTFT dently by co-optimizing the GPU allocation and parallelism
and TPOT shift with increasing request rates when serving strategies of the prefill and decoding phase assuming serving
a 13B LLM using existing systems [32], with workload pat- a single model replica. The optimization ensures maximiz-
tern and two latency constraints set to emulate using LLM to ing the per-GPU goodput and may assign different numbers
generate a short summary for an article. Under the SLO attain- of GPUs and parallelism strategies to each phase depend-
ment of 90%, the maximum achievable goodput on a single ing on their respective latency requirements. DistServe then
A100 GPU, which is constrained by the more stringent one scales this allocation to multiple instances via replication un-
of TTFT and TPOT requirements, is about 1.6 requests per til meeting the user-required traffic rate (§4). DistServe also
second (rps). The performance contrasts sharply when each features an algorithm to place the prefill and decoding compu-
phase is served independently on a separate GPU, shown by tation according to their allocation schemes and the cluster’s
the orange and green curves, which achieve per-GPU goodput bandwidth to minimize the overhead of communicating inter-
of 5.6 rps for the prefill phase and 10 rps for decoding. Ide- mediate states between phases.
ally, by allocating 2 GPUs for prefill and 1 GPU for decoding, We implement DistServe as an orchestration layer on top
we can effectively serve the model with an overall goodput of the LLM inference engine. We evaluate DistServe on vari-
of 10 rps, or equally 3.3 rps per GPU, which is 2.1x higher ous LLMs, varying the workloads based on three important
than existing systems. The gap in goodput primarily stems real-world LLM applications: chatbots, programming assis-
from the colocation of the prefill and decoding – two phases tant, and document summary. Compared to state-of-the-art
with very distinct computational characteristics and latency solutions, DistServe can serve up to 7.4× more requests or
requirements (§2.1). 12.6× tighter SLO under various latency constraints. Our
First, colocation leads to strong prefill-decoding interfer- contributions are:
ence. A prefill step often takes much longer than a decoding • Identify the problems of prefill-decoding interference
step. When batched together, decoding steps in the batch and resource coupling in existing LLM serving systems
are delayed by the prefill steps, significantly elongating their and propose to disaggregate the two phases.
TPOT; similarly, the inclusion of decoding steps contributes • Design a novel placement algorithm to choose the
to a non-trivial increase in TTFT, as evidenced in Figure 2. goodput-optimal schema for prefill and decoding in-
Even if we schedule them separately, issues persist as they stances automatically.
begin to compete for resources. Decoding tasks awaiting GPU • Conduct a comprehensive evaluation of DistServe with
execution are subject to increased queuing delays due to on- realistic workloads.
going prefill tasks, and vice versa. Prioritized scheduling of
one phase risks failing the latency requirements of the other.
2 Background and Motivation
Second, the prefill and decoding computation differ in la- An LLM service follows a client-server architecture: the client
submits a sequence of text as a request to the server; the server
1 The overall request latency equals TTFT plus TPOT times the number
of generated tokens in the decoding phase. 2 https://github.com/LLMServe/DistServe
2
hosts the LLM on GPUs, runs inference over the request, and across all users and requests. However, as mentioned in §1
responds (or streams) the generation back to the client. As and elaborated later in §2.3, this approach leads to trade-offs
explained in §1, due to the unique prefill-decoding process, between TTFT and TPOT. An advanced variant of contin-
LLM service may impose aggressive service-level objectives uous batching [9] attempts to balance TTFT and TPOT by
(SLOs) on both TTFT and TPOT, varying with the applica- segmenting long prefill into chunks and attaching decoding
tion’s needs. The serving system must meet both SLOs while jobs with a chunked prefill – but essentially, it trades TTFT
minimizing the cost associated with expensive GPUs. In other for TPOT and cannot eliminate the interference (§2.3). In
words, we want the serving system to maximize the requests summary, batching prefill and decoding invariably leads to
served per second adhering to the SLO attainment goal for compromises in either TTFT or TPOT.
each GPU provisioned – maximizing per-GPU goodput. Next,
Model parallelism. In LLM serving, model parallelism is
we detail the LLM inference computation (§2.1) and discuss
generally divided as intra- and inter-operator parallelisms [33,
existing optimizations for LLM serving (§2.2).
46, 59]. Both can be used to support larger models but may
2.1 LLM Inference impact serving performance differently. Intra-operator paral-
Modern LLMs [37, 51] predict the next token given an input lelism partitions computationally intensive operators, such
sequence. This prediction involves computing a hidden repre- as matrix multiplications, across multiple GPUs, accelerat-
sentation for each token within the sequence. An LLM can ing computation but causing substantial communication. It
take a variable number of input tokens and compute their hid- reduces the execution time3 , hence latency, particularly for
den representations in parallel, and its computation workload TTFT of the prefill phase, but requires high bandwidth con-
increases superlinearly with the number of tokens processed nectivity between GPUs (e.g., NVLINK). Inter-operator par-
in parallel. Regardless of the input token count, the compu- allelism organizes LLM layers into stages, each running on
tation demands substantial I/O to move LLM weights and a GPU to form pipelines. It moderately increases execution
intermediate states from the GPU’s HBM to SRAM. This time due to inter-stage communication, but linearly scales the
process is consistent across varying input sizes. system’s rate capacity with each added GPU. In this paper,
The prefill step deals with a new sequence, often compris- we reveal an additional benefit of model parallelism: reduced
ing many tokens, and processes these tokens concurrently. queuing delay of both prefill and decoding phases, steaming
Unlike prefill, each decoding step only processes one new from shorter execution time. We delve into this further in
token generated by the previous step. This leads to significant §3. Besides model parallelism, replicating a model instance,
computational differences between the two phases. When irrespective of its model parallelism configurations, linearly
dealing with user prompts that are not brief, the prefill step scales the system’s rate capacity.
tends to be compute-bound. For instance, for a 13B LLM, These parallelism strategies create a complex space of op-
computing the prefill of a 512-token sequence makes an A100 timization that requires careful trade-offs based on the appli-
near compute-bound (see §3.1). In contrast, despite process- cation’s latency requirements.
ing only one new token per step, the decoding phase incurs a 2.3 Problems and Opportunities
similar level of I/O to the prefill phase, making it constrained
Colocating and batching the prefill and decoding computation
by the GPU’s memory bandwidth.
to maximize the overall system throughput, as in existing
During both phases, intermediate states, known as KV
systems, is cost-effective for service providers. However, in
caches [32], are generated at each token position, which are
the presence of SLOs, present approaches struggle to main-
needed again in later decoding steps. To avoid recomputing
tain both high service quality and low cost due to the issues
them, they are saved in GPU memory. Because of the shared
discussed below.
use of LLM weights and KV caches in memory, most LLM in-
ference engines opt to colocate the prefill and decoding phases Prefill-decoding interference. As Figure 2 shows, adding a
on GPUs, despite their distinct computational characteristics. single prefill job to a batch of decoding requests significantly
slows down both processes, leading to a marked increase in
2.2 LLM Serving Optimization
TTFT and TPOT. Specifically, the decoding tasks in the batch
In real-time online serving, multiple requests come and must must wait for lengthier prefill jobs to complete, thus extending
be served within SLOs. Batching and parallelizing their com- TPOT; the slowdown intensifies with a longer prefill, shown
putation is key for achieving low latency, high throughput, in Figure 2(b). Adding decoding jobs to prefill also increases
and high utilization of GPUs. the time to complete the prefill task, particularly when the
Batching. Current serving systems [9, 32, 54] utilize a batch- GPU is already at capacity (Figure 2 blue curves).
ing technique known as continuous batching. This method One attempt to mitigate this interference is called chunked-
batches the prefill of new requests with the decoding of on- prefill with piggyback [3,9]. It proposes to split the long prefill
going ones. It boosts the GPU utilization and maximizes 3 weemphasize “execution time” instead of latency here because latency
the overall system throughput – tokens generated per second comprises both execution time and queuing delay.
3
30 150
decoding-with-one-prefill Opportunities. To address these issues, we propose to dis-
25 decoding-only 125 aggregate the prefill and decoding phases. We use the term
20 prefill slowdown
Latency (ms)
Latency (ms)
100 instance to denote a unit of resources that manages exactly
15 75 one complete copy of model weights. One instance can cor-
10 prefill slowdown 50 decoding slowdown respond to many GPUs when model parallelism is applied.
5 decoding slowdown 25 Note that when we disaggregate the two phases to different
0 0 50 100 150 200 250 0 25 50 75 100 125 GPUs, each phase manages its copy of the model weights,
Batch Size Batch Size resulting in prefill instances and decoding instances. A prefill
(a) Input length = 128 (b) Input length = 1024 instance, upon receiving a request, performs only the prefill
Figure 2: Batch execution time when serving a 13B LLM computation for this request to generate the first output token.
as batch size increases. Compared between a decoding-only It then sends the intermediate results (mainly KV caches)
batch and the batch adding one more prefill job. to a decoding instance, which is responsible for subsequent
decoding steps. Because decoding computation often has low
into chunks and batch a prefill chunk with a few decoding jobs
GPU utilization, we may allocate multiple prefill instances
(a.k.a. piggybacking). This technique alleviates the slowdown
per decoding instance. This allows batching more decoding
of the decoding job caused by the long prefill job, but it does
jobs to achieve higher GPU utilization.
not eliminate it. Additionally, it results in an extra overhead
Disaggregating prefill and decoding naturally resolves the
for the prefill job which cannot be easily mitigated by adjust-
interference between the two phases and enables each to fo-
ing the chunk size. First, if the chunk size is set much lower
cus on its optimization target – TTFT or TPOT. Each type
than the inflection point that can saturate the GPU, then the
of instance can employ different resources and parallelism
prefill job will have a longer execution time since it competes
strategies to meet a variety of latency requirements. By ad-
with the decoding job in the same batch and cannot solely
justing the number of GPUs and parallelisms provided to
utilize the GPU resources. Second, if we increase the chunk
the two types of instances, we can maximize the per-device
size to nearly saturate the GPU, the chance of piggybacking
goodput of the overall system, avoiding over-provisioning,
will diminish since the remaining slots for decode tokens are
eventually translating to reduced cost-per-query adhering to
limited. Also, chunked-prefill causes significantly more mem-
service quality. Next, we develop ways to find out the best
ory access for the prefill jobs. This is because the KV cache
resource allocation and parallelism plan for each phase.
of all previous chunks have to be loaded from HBM to SRAM
repeatedly to compute each subsequent chunk. Concretely, 3 Tradeoff Analysis
if a prefill job is split into N equal chunks, we need to load Disaggregation uncouples the two phases and allows a dis-
N + (N − 1) + ... + 1 = O(N 2 ) chunks of KV Cache in total, tinct analysis of the characteristics of each phase, providing
compared to O(N) in the non-chunked case. This overhead valuable insights into the algorithm design. It also expands
will increase as the context length becomes longer. the design space: now each phase needs to be scaled and
Ineffective scheduling. Unbatching prefill and decoding jobs scheduled independently based on their latency requirements.
and scheduling them sequentially does not mitigate the inter- In this section, we analyze the computational pattern of pre-
ference. Decoding jobs may experience longer queuing delays fill (§3.1) and decoding instances (§3.2) post disaggregation.
due to waiting for ongoing prefill jobs on GPUs. Moreover, We aim to identify key parameters and derive guidelines for
batches dedicated to decoding often lead to GPU underutiliza- batching and parallelism in each phase. We then highlight sev-
tion. Prioritizing tasks in either phase adversely affects the eral practical deployment considerations (§3.3). This section
latency of the other, rendering priority scheduling ineffective. lays the foundation for per-gpu goodput optimization.
Resource and parallelism coupling. Colocating prefill and 3.1 Analysis for Prefill Instance
decoding phases on the same GPUs unavoidably share their
After disaggregation, the prefill phase generates the first to-
resource and parallelism settings. However, each phase has its
ken by processing all tokens of the user prompt in parallel.
unique computational characteristic and latency requirement
Assuming a given arrival rate, we aim to fulfill the service’s
that calls for more heterogeneous resource allocation. For
latency requirement on TTFT using the least resources.
example, the prefill phase tends to be compute-bound and
benefits from more intra-op parallelism to reduce execution Batching strategy. The prefill step is typically compute-
time to meet the tight SLO on TTFT. By contrast, the opti- intensive. Figure 3(a) shows how the throughput of the prefill
mal parallelism configuration of the decoding phase depends phase changes with the input length and the batch size. For
on the running batch size. In existing systems, due to cou- a 13B parameter LLM, processing a single sequence of 512
pling, resource allocation and parallelism plans are tailored tokens can fully engage an A100 GPU. Once the GPU be-
to satisfy the more demanding of TTFT and TPOT, which comes compute-bound, adding more requests to the batch no
may not be ideal for the other. This often leads to resource longer improves GPU efficiency. Instead, it proportionally
over-provisioning to meet both SLOs. extends the total processing time for the batch, inadvertently
4
10000 10000 1.50
Inter-Op
1.50
Inter-Op
input length: 128
Throughput (tokens/s)
Throughput (tokens/s)
8000 8000 1.25 Intra-Op 1.25 Intra-Op (K = 1.5)
input length: 256 Intra-Op (K = 1.6)
input length: 512 1.00 1.00
delaying all included requests. Hence, for prefill instances, For intra-op parallelism, we introduce a speedup coefficient
it is necessary to profile the specific LLM and GPUs in ad- K, where 1 < K < 2, reflecting the imperfect speedup caused
vance to identify a critical input length threshold, denoted as by high communication overheads of intra-op parallelism.
Lm , beyond which the prefill phase becomes compute-bound. With the execution time Ds = D K , the average TTFT for 2-
Batching more requests should only be considered when the degree intra-op parallelism is:
input length of the scheduled request is below Lm . In practice,
user prompts typically average over hundreds of tokens [8]. D RD2
Avg_T T FTintra = + . (3)
Batch sizes for the prefill instance are generally kept small. K 2K(K − RD)
Parallelism plan. To study the parallelism preferences for Comparing Eq. 2 and Eq. 3: at lower rates, where execution
prefill-only instances, we serve a 66B LLM on two A100 time (first term) is the primary factor, intra-op parallelism’s
GPUs with inter-op or intra-op parallelism strategy. To sim- reduction in execution time makes it more efficient. As the
plify the problem, we assume uniform requests input lengths rate increases and the queuing delay (second term) becomes
of 512 tokens and a Poisson arrival process. We compare more significant, inter-op parallelism becomes advantageous,
the resulting average TTFT at various arrival rates in Fig- concurred with Figure 4(a).
ure 4(a): intra-op parallelism is more efficient at lower arrival The prefill phase’s preference for parallelism is also influ-
rates, while inter-op parallelism gains superiority as the rate enced by TTFT SLO and the speedup coefficient K. Seen
increases. Disaggregation enables the prefill phase to function from Figure 4(a): A more stringent SLO will make intra-op
analogously to an M/D/1 queue, so we can use queuing theory parallelism more advantageous, due to its ability to reduce
to verify the observation. execution time. The value of K depends on factors such as the
We start by developing notations using the single-device input length, model architecture, communication bandwidth,
case without parallelism: each request’s execution time, de- and placement [46,59]. As shown in Figure 4(b), a decrease in
noted as D, remains constant due to uniform prefill length. K notably reduces the efficacy of intra-op parallelism. §4 de-
Since one request saturates the GPU, we schedule requests via velops algorithms that optimize the resource and parallelism
First-Come-First-Served (FCFS) without batching. Suppose configurations taking into consideration these knobs.
the Poisson arrival rate is R and the utilization condition of 3.2 Analysis for Decoding Instance
RD < 1, the average TTFT (Avg_T T FT ) can be modeled by
the M/D/1 queue [47] in close form: Unlike the prefill instance, a decoding instance follows a dis-
tinct computational pattern: it receives the KV caches and
RD2 the first output token from the prefill instance and generates
Avg_T T FT = D + . (1)
2(1 − RD) subsequent tokens one at a time. For decoding instances, our
where the first term represents the execution time and the optimization goal is to satisfy the application’s TPOT require-
second corresponds to the queuing delay. Based on Eq. 1, we ment using minimal computing resources.
incorporate parallelism below. Batching strategy. Since a single decoding job is heav-
With 2-way inter-op parallelism, we assume the request- ily bandwidth-bound, batching is key to avoiding low GPU
level latency becomes Ds , and the slowest stage takes Dm to utilization (hence high per-gpu goodput), as shown in Fig-
finish. We have D ≈ Ds ≈ 2 × Dm , due to negligible inter- ure 3(b). In existing systems where the prefill and decoding
layer activation communication [33, 59]. The average TTFT phases are colocated, increasing the decoding batch size is
with 2-way inter-op parallelism is derived as: difficult because it conflicts with meeting latency goals, par-
ticularly in scenarios with high request rates. This is because
RD2m RD2 sharing GPUs cause competition between prefill and decod-
Avg_T T FTinter = Ds + = D+ . (2) ing jobs, leading to a trade-off between TTFT and TPOT. For
2(1 − RDm ) 4(2 − RD)
5
60 15000 Linear Scaling Variable prefill length. §3 has assumed uniform prompt
Throughput (tokens/s)
Inter-op length across requests. In real deployments, depending on
Latency (ms)
Intra-op
50 10000 the LLM application, the lengths of requests are non-uniform.
The non-uniformity can cause pipeline bubbles [28, 36] for
40 5000 prefill instances applying inter-op parallelism because the
0 execution time of pipeline stages across requests of different
30
1 2 4 8 0 1 2 4 8 lengths will vary. This results in slight deviations from the
#GPUs #GPUs
conclusions indicated by using the M/D/1 queue model. To
Figure 5: Decoding phase latency and throughput when serv- address the problem, §4 develops algorithms that search for
ing a 13B LLM with batch size = 128 and input length = 256 parallelisms based on workloads, and resort to scheduling to
under different parallel degrees. minimize the bubbles (§4.3).
example, a higher arrival rate generates more prefill jobs, de- Communication overhead. Transferring KV caches from
manding greater GPU time to meet TTFT requirements if prefill to decoding instances incurs notable overheads. For
prioritizing prefill jobs, which in turn adversely affects TPOT. example, the KV cache size of a single 512-token request
On the contrary, disaggregation offers a solution by en- on OPT-66B is approximately 1.13GB. Assuming an aver-
abling the allocation of multiple prefill instances to a single age arrival rate of 10 rps, we need to transfer 11.3GB data
decoding instance. This approach allows for accumulating a per second—or equivalently 90Gbps bandwidth to render the
larger batch size on dedicated GPUs for the decoding phase overhead invisible. While many modern GPU clusters for
without sacrificing TPOT. LLMs are equipped with InfiniBand (e.g., 800 Gbps), in cases
where cross-node bandwidth is limited, DistServe relies on
Parallelism plan. Post-disaggregation, the batch size for de- the commonly available intra-node NVLINK, where the peak
coding may be constrained by GPU memory capacity, as it is bandwidth between A100 GPUs is 600 GB/s, again rendering
necessary to maintain the KV caches for all active requests. the transmission overhead negligible (see §6.3). However, this
Scaling the decoding instance with model parallelism or lever- requirement imposes additional constraints on the placement
aging advanced memory management techniques for LLM of prefill and decoding instances that we take into considera-
KV caches, such as Paged-Attention [32] and GQA [10], tion in the next section.
enable further scaling of the decoding batch size to nearly Through the analysis in this section, we identify the work-
compute-bound. As the decoding batch size continue to in- load pattern, placement constraints, SLO requirements, paral-
crease to approach the compute-bound, the decoding compu- lelism strategies, and resource allocation as key parameters
tation begins to resemble the prefill phase. With this observa- that create a web of considerations in designing the disag-
tion, we investigate how the latency and throughput change gregated serving system. How to automatically navigate the
under different parallelism degrees under large batch condi- search space to find the configuration that achieves optimal
tions in Figure 5: intra-op parallelism reduces latency with per-gpu goodput is challenging, and addressed next.
diminishing returns, caused by communication and reduced
utilization after partitioning. Inter-op parallelism can almost 4 Method
linearly scale the throughput. Hence, when the TPOT SLO
is stringent, intra-op parallelism is essential to reduce TPOT We built DistServe to solve the above challenges. Given the
to meet latency goals. Beyond this, inter-op parallelism is model, workload characteristic, latency requirements, and
preferable to enhance throughput linearly. SLO attainment target, DistServe will determine (a) the par-
It is worth noting that when the model can fit into the mem- allelism strategies for prefill and decoding instances, (b) the
ory of a single GPU, replication is a competitive option in number of each instance type to deploy, as well as (c) how
addition to model parallelism for both prefill and decoding to place them onto the physical cluster. We call the solution
instances, to linearly scale the system’s rate capacity. It may a placement. Our goal is to find a placement that maximizes
also reduce the queuing delay – as indicated by Eq. 1 – by the per-gpu goodput.
substituting R with R/N assuming requests are equally dis- As explained in §3.3, a key design consideration is to man-
patched to N replicas, at the cost of maintaining additional age communications between disaggregated prefill and de-
replicas of the model weights in GPU memory. coding phases, given varying cluster setups. In this section,
we first present two placement algorithms: one for clusters
3.3 Practical Problems with high-speed cross-node networks (§4.1) and the other
We have developed foundational principles for selecting batch- for environments lacking such infrastructure (§4.2); the lat-
ing and parallelisms for each phase. In this section, we discuss ter introduces additional constraints. We then develop online
and address several challenges encountered during the practi- scheduling optimizations that adapt to the nuances of real-
cal deployment of disaggregated prefill and decoding phases. world workloads (§4.3).
6
Algorithm 1 High Node-Affinity Placement Algorithm Algorithm 2 Low Node-Affinity Placement Algorithm
Input: LLM G, #node limit per-instance N, #GPU per-node Input: LLM G, #node limit per-instance N, #GPU per-node
M, GPU memory capacity C, workload W , traffic rate R. M, GPU memory capacity C, workload W , traffic rate R.
Output: the placement best_plm. Output: the placement best_plm.
configp , configd ← 0, / 0/ config∗ ← 0/
for intra_op ∈ {1, 2, ..., M} do for inter_op ∈ {1, 2, ..., N} do
N×M
for inter_op ∈ {1, 2, ..., intra_op } do P ← get_intra_node_configs(G, M,C, inter_op)
G.size
if inter_op×intra_op < C then for Pp ∈ P do
config ← (inter_op, intra_op) for Pd ∈ P do
Ĝ ← parallel(G, config) if Pp .num_gpus + Pd .num_gpus ≤ M then
config.goodput ← simu_prefill(Ĝ,W ) config ← (inter_op, Pp , Pd )
config .goodput config.goodput Ĝ p , Ĝd ← parallel(G, config)
if con f ig pp.num_gpus < con f ig.num_gpus then config.goodput ← simulate(Ĝ p , Ĝd ,W )
configp ← config config.∗ goodput config.goodput
if config. ∗ num_gpus < config.num_gpus then
config.goodput ← simu_decode(Ĝ,W ) ∗
config ← config
configd .goodput config.goodput
if con f igd .num_gpus < con f ig.num_gpus then R
n ← ⌈ config.∗ goodput ⌉
configd ← config
R R
best_plm ← (n, config∗ )
n, m ← ⌈ configp .goodput ⌉, ⌈ configd .goodput ⌉ return best_plm
best_plm ← (n, configp , m, configd )
return best_plm
and find its maximum goodput via binary search (similarly
for using simu_decode for decoding). After determining the
4.1 Placement for High Node-Affinity Cluster optimal parallel configurations for both prefill and decoding
On high node-affinity clusters equipped with Infiniband, KV instances, we replicate them to achieve the user-required over-
caches transmission overhead across nodes is negligible, Dist- all traffic rate according to their goodput.
Serve can deploy prefill and decoding instances across any The complexity of Algorithm 1 is O(NM 2 ), with N as the
two nodes without constraints. We propose a two-level place- node limit per instance and M representing the typical number
ment algorithm for such scenarios: we first optimize the par- of GPUs per node in modern clusters (e.g., 8). The search
allelism configurations for prefill and decoding instances sep- space is manageable and the solving time is under 1.3 minutes
arately to attain phase-level optimal per-gpu goodput; then, in our largest setting, as demonstrated in §6.5.
we use replication to match the overall traffic rate. Simulator building. Algorithm 1 relies on a simulator to es-
However, finding the optimal parallel configuration for a timate the goodput under various SLOs and SLO attainment
single instance type, such as for the prefill instance, is still goals given the workload and the parallelism plan. To build an
challenging, due to the lack of a simple analytical formula to accurate simulator, we analyze the FLOPs and the number of
calculate the SLO attainment (a.k.a., percentage of requests memory accesses for prefill and decoding phases respectively,
that meet TTFT requirement), given that the workload has and use a latency model to approximate the inference execu-
diverse input, output lengths, and irregular arrival patterns. tion time. See details in Appendix A. The simulator aligns
Gauging the SLO via real-testbed profiling is time-prohibitive. well with real profiling results, thanks to the high predictabil-
We thus resort to building a simulator to estimate the SLO at- ity of DNN workloads [23, 33], verified in §6.4.
tainment, assuming prior knowledge of the workload’s arrival By far, we have developed Algorithm 1 assuming we can
process and input and output length distributions. Although place the prefill and decoding instance between any two nodes
short-term interval is impossible to predict, the workload (or on the same node) of the cluster, and the KV cache trans-
pattern over longer timescales (e.g., hours or days) is often mission utilizes high bandwidth network. In many real clus-
predictable [33, 55]. DistServe fits a distribution from the ters, GPUs inside a node access to high-bandwidth NVLINK
history request traces and resamples new traces from the dis- while GPUs distributed across nodes have limited bandwidth.
tribution as the input workload to the simulator to compute We next develop an algorithm to address this constraint.
the SLO attainment. Next, DistServe simply enumerates the
placements and finds the maximum rate that meets the SLO 4.2 Placement for Low Node-Affinity Cluster
attainment target with binary search and simulation trials. A straightforward solution is to always colocate prefill and
Algorithm 1 outlines the process. We enumerate all feasible decoding instances on the same node, utilizing the NVLINK,
parallel configurations, subject to cluster capacity limit, for which is commonly available inside a GPU node. For large
both prefill and decoding instances. Then, for a specific pre- models, e.g. with 175B parameters (350GB), we may be un-
fill phase configuration, we use simu_prefill to simulate able to even host a single pair of prefill and decoding instances
7
in an 8-GPU node (80G × 8 = 640G < 350 × 2GB). We in- Requests
Controller
corporate this as additional placement constraints and co-
optimize it with model parallelism, presented in Algorithm 2.
The key insight is that KV cache transfer occurs exclu-
Prefill Instance Decoding Instance
sively between corresponding layers of prefill and decoding
LLM Model LLM Model
instances. Leveraging inter-op parallelism, we group layers KV Cache
into stages and divide each instance into segments, termed GPU GPU Transfer
GPU GPU
as instance segments, with each segment maintaining one GPU GPU GPU GPU
specific inter-op stage. By colocating prefill and decoding Parallel Runtime Parallel Runtime
segments of the same stage within a single node, we force
the transfer of intermediate states to occur only via NVLINK.
Figure 6: DistServe Runtime System Architecture
Inside a node, we set the same parallelism and resource allo-
cation for segments of the same instance. Given the typical prefill jobs by simply retaining the KV Cache in the GPU
limitation of GPUs per node (usually 8), we can enumerate memory after processing the prompt. Hence, each type of in-
possible configurations inside one node and use the simulator stance operates at its own pace without complex coordination.
to identify the configurations that yield the best goodput.
As outlined in Algorithm 2, we begin by enumerating inter- Replaning. The resource and parallelism plan in DistServe is
op parallelism degrees to get all the possible instance seg- optimized for a specific workload pattern, which may become
ments. For each segment, we get all possible intra-node paral- suboptimal if the workload pattern changes over time. Dist-
lelism configurations by calling get_intra_node_configs. Serve implement periodic replanning. A workload profiler
Then we use simulation to find the optimal one and replicate monitors key parameters such as the average input and output
it to satisfy the target traffic rate. length of the requests, the average arrival rate, etc. If a signif-
icant pattern shift is detected, DistServe will trigger a rerun
4.3 Online scheduling of the placement algorithm based on recent historical data.
The runtime architecture of DistServe is shown in Figure 6. This process is expedient – the proposed algorithm runs in
DistServe operates with a simple FCFS scheduling policy. seconds (§6.5) and reloading LLM weights can be completed
All incoming requests arrive at a centralized controller, then within minutes – far shorter than the hourly scale at which
dispatched to the prefill instance with the shortest queue for real-world workload variations tend to occur.
prefill processing, followed by dispatch to the least loaded de- Preemption and fault tolerance. DistServe does not imple-
coding instance for decoding steps. This setup, while simple, ment advanced runtime policies like preemption [26] and
is optimized with several key enhancements tailored to the fault tolerance [58], which are complementary to disaggre-
nuances of real-world workloads. gation. Nevertheless, we discuss how they fit into DistServe.
Reducing pipeline bubbles. To mitigate the pipeline bubbles In DistServe, the FCFS policy can lead to a “convoy effect”,
caused by non-uniform prompt lengths (§3.3), we schedule where longer requests block shorter ones in the prefill stage.
the requests in a way that balances the execution time across Incorporating preemptive strategies, as suggested in existing
all batches in the pipeline. This is achieved by noting that, literature [53], could enhance efficiency and is feasible within
for both prefill and decoding instances, the number of new our system’s architecture. While not a primary focus in the
tokens in the batch is a reliable indicator of the batch’s real current DistServe, fault tolerance is a critical aspect for con-
execution time. For prefill instances, we profile the target sideration. In traditional colocation- and replication-based
model and GPU to figure out the shortest prompt length Lm systems, a fault in one instance typically does not disrupt
needed to saturate the GPU. We schedule prefill batches with a other replica instances. However, in DistServe, the depen-
total sequence length close to Lm , by either batching multiple dency between prefill and decoding instances introduces the
requests shorter than Lm or individually scheduling requests risk of fault propagation. For example, a fault in a single de-
longer than Lm . For decoding instances, we set Lm as the coding instance mapped to multiple prefill instances could
largest batch size. potentially cripple the entire service and cluster. We leave
both as future work.
Combat busrtiness. Burstiness in workloads can cause a
deluge of KV caches to transfer from prefill to decoding in- 5 Implementation
stances, risking memory overload on decoding instances. To DistServe is an end-to-end distributed serving system for
circumvent this, DistServe employs a “pull” method for KV LLMs with a placement algorithm module, a RESTful API
cache transmission rather than a “push” approach – decoding frontend, an orchestration layer, and a parallel execution en-
instances fetch KV cache from prefill instances as needed, us- gine. The algorithm module, frontend, and orchestration layer
ing the GPU memory of prefill instances as a queuing buffer. are implemented with 6.5K lines of Python code. The par-
This way, the prefill instance can continue handling other allel execution engine is implemented with 8.1K lines of
8
Application Model Size TTFT TPOT Dataset 6.1 Experiments Setup
Chatbot OPT-13B 26GB 0.25s 0.1s ShareGPT [8]
Chatbot OPT-66B 132GB 2.5s 0.15s ShareGPT [8] Cluster testbed. We deploy DistServe on a cluster with 4
Chatbot OPT-175B 350GB 4.0s 0.2s ShareGPT [8] nodes and 32 GPUs. Each node has 8 NVIDIA SXM A100-
Code Completion OPT-66B 132GB 0.125s 0.2s HumanEval [14]
Summarization OPT-66B 132GB 15s 0.15s LongBench [13]
80GB GPUs connected with NVLINK. The cross-node band-
width is 25Gbps. Due to the limited cross-node bandwidth,
Table 1: Workloads in evaluation and latency requirements. we use the low node-affinity placement algorithm (§2) for
DistServe in most of the experiments except for the ablation
1e 3 1e 3 1e 3
study (§6.4) which uses simulation.
8 Input (avg=755.5) Input (avg=171.3) 15 Input (avg=1738.3)
Output (avg=200.3) 6 Output (avg=98.2) Output (avg=90.7) Model and workloads setup. Similar to prior work on LLM
6
10 serving [32], we choose the OPT [56] model series, which
Density
4 4
5
is a representative LLM family widely used in academia and
2 2
industry. Newer GPT model families are adopting memory-
0 0 500 1000 1500 2000 0 0 500 1000 1500 2000 0 0 500 1000 1500 2000 efficient attention mechanisms like GQA [10] and MQA [44].
# Tokens # Tokens # Tokens
DistServe will show better performance on these models be-
(a) ShareGPT (b) HumanEval (c) LongBench cause the transmission overhead is lower due to the decrease
Figure 7: The input and output length distributions of (a) in KV cache size. We choose OPT which uses the classic
ShareGPT, (b) HumanEval, and (c) LongBench datasets. MHA [52] to put enough pressure on the transmission over-
head. We use FP16 precision in all experiments. For work-
loads, as shown in Table 1, We choose three typical LLM
C++/CUDA code. applications and set the SLOs empirically based on their ser-
The placement algorithm module implements the algorithm vice target because there exists no available SLO settings for
and the simulator mentioned in §4 which gives the placement these applications as far as we know. For each application,
decision for a specific model and cluster setting. The fron- we select a suitable dataset and sample requests from it for
tend supports an OpenAI API-compatible interface where evaluation. Since all the datasets do not include timestamps,
clients can specify the sampling parameters like maximum we generate request arrival times using Poisson distribution
output length and temperature. The orchestration layer man- with different request rates. Due to the space limit, we test the
ages the prefill and decoding instances, responsible for request chatbot workload on all three OPT models and the other two
dispatching, KV cache transmission, and results delivery. It workloads on OPT-66B, which matches the largest size in the
utilizes NCCL [6] for cross-node GPU communication and recent open-source LLM series [51].
asynchronous CudaMemcpy for intra-node communication, • Chatbot [1]: We use the ShareGPT dataset [8] for the
which avoids blocking the GPU computation during transmis- chatbot application, which is a collection of user-shared
sion. Each instance is powered by a parallel execution engine, conversations with ChatGPT. For OPT-13B, the TTFT SLO
which uses Ray [35] actor to implement GPU workers that is set to 0.25s for responsiveness and the TPOT SLO is
execute the LLM inference and manage the KV Cache in set to 0.1s which is higher than the normal human read
a distributed manner. It integrates many recent LLM opti- speed. For OPT-66B and OPT-175B, we slightly relax the
mizations like continuous batching [54], FlashAttention [20], two SLOs due to the increase in model execution latency.
PagedAttention [32] and supports popular open-source LLMs • Code completion [14]: We use the HumanEval [14] dataset
such as OPT [56] and LLaMA [51]. for the code completion task. It includes 164 programming
problems with a function signature or docstring which is
6 Evaluation used to evaluate the performance of code completion mod-
els. Since the code completion model is used as a personal
In this section, we evaluate DistServe under different sizes real-time coding assistant, we set both SLOs to be stringent.
of LLMs ranging from 13B to 175B and various applica- • Summarization [5]: It is a popular LLM task to generate
tion datasets including chatbot, code-completion, and sum- a concise summary for a long article, essay, or even an
marization. The evaluation shows that DistServe consistently academic paper. We use LongBench [13] dataset which
outperforms the current state-of-the-art system across all the contains the summarization task4 . As shown in Figure 7,
settings (§6.2). Specifically, DistServe can handle up to 7.4× LongBench has much longer input lengths than the other
higher rates and 12.6× more stringent SLO while meeting two datasets. So we set a loose TTFT SLO but require a
the latency requirements for over 90% requests. Addition- stringent TPOT.
ally, we analyze the latency breakdown in DistServe to show
the communication overhead is insubstantial thanks to our Metrics. We use SLO attainment as the major evaluation met-
bandwidth-aware placement algorithm (§6.3) and do abla- ric. Under a specific SLO attainment goal (say, 90%), we are
tion studies of our techniques (§6.4). Finally, we profile the 4We capped the input lengths in LongBench because OPT’s absolute
execution time of our placement algorithm (§6.5). positional embedding only supports a maximum length of 2048.
9
DistServe DeepSpeed-MII vLLM
SLO Attainment (%)
100 100 100
50 50 50
50 50 50
0 1.50 1.25 1.00 0.75 0 3.0 2.5 2.0 1.5 1.0 0 1.50 1.25 1.00 0.75
SLO Scale SLO Scale SLO Scale
(a) OPT-13B (b) OPT-66B (C) OPT-175B
concerned with two things: the maximum per-GPU goodput 6.2 End-to-end Experiments
and the minimal SLO the system can handle. We are partic-
In this Section, we compare the end-to-end performance of
ularly interested in an SLO attainment of 90% (indicated by
DistServe against the baselines on real application datasets.
the vertical lines in all curve plots), but will also vary the
rate and latency requirements to observe how the SLO attain- Chatbot. We evaluate the performance of DistServe on the
ment changes. We also include the results in the Appendix for chatbot application for all three OPT models. The first row
an SLO attainment of 99% to show the system performance of Figure 8 illustrates that when we gradually increase the
under a more stringent SLO attainment target. rate, more requests will violate the latency requirements and
the SLO attainment decreases. The vertical line shows the
Baselines. We compare DistServe to two baseline systems: maximum per-GPU rate the system can handle to meet latency
• vLLM [32]: vLLM is a representative LLM serving sys- requirements for over 90% of the requests.
tem widely used in both academia and industry. It sup- On the ShareGPT dataset, DistServe can sustain 2.0×–
ports continuous batching [54] to increase throughput and 4.6× higher request rate compared to vLLM. This is because
paged-attention [32] to reduce memory fragmentation dur- DistLLM eliminates the prefill-decoding interference through
ing KV cache allocation. However, it colocates the prefill disaggregation. Two phases can optimize their own objec-
and decoding computation to maximize the overall system tives by allocating different resources and employing tailored
throughput and struggles to meet the latency requirements parallelism strategies. Specifically, by analyzing the chosen
cost-efficiently. Since vLLM only supports intra-op paral- placement strategy5 for 175B, we find the prefill instance
lelism, we follow previous work [32] to set intra-op equals has inter-op = 3, intra-op = 3; and the decoding instance has
1, 4, and 8 for the three OPT models, respectively. inter-op = 3, intra-op = 4. Under this placement, DistServe
can effectively balance the load between the two instances on
• DeepSpeed-MII [3]: DeepSpeed Model Implementations ShareGPT, meeting latency requirements at the lowest cost.
for Inference (MII) supports chunked-prefill by decompos- This non-trivial placement strategy is challenging to manu-
ing long prompts into smaller chunks and composing with ally find, proving the effectiveness of the algorithm. In the
short prompts to exactly fill a target token budget. It miti- case of vLLM, collocating prefill and decoding greatly slows
gates but cannot eliminate the prefill-decoding interference down the decoding phase, thereby significantly increasing
caused by the long prefill job. We set its intra-op the same TPOT. Due to the stringent TPOT requirements of chatbot
as vLLM for OPT-13B and OPT-66B for a fair comparison. applications, although vLLM meets the TTFT SLO for most
However, DeepSpeed-MII cannot serve OPT-175B whose requests, the overall SLO attainment is dragged down by a
vocab_size = 50272 because its underlying kernel imple- large number of requests that violate the TPOT SLO. Com-
mentation requires vocab_size/intra_op is a multiple of pared to DeepSpeed-MII, DistServe can sustain 1.6×–7.4×
8 where intra-op equals 8 does not satisfy. Setting intra- higher request rate. DeepSpeed-MII shows better performance
op equals 4 can satisfy this requirement but will cause the
out-of-memory issue. 5 All the placements chosen by DistServe can be found in Appendix B.
10
DistServe DeepSpeed-MII vLLM
SLO Attainment (%)
100 100 100 100
50 50 50 50
0 0.5 1.0 1.5 2.0 0 1.5 1.0 0.5 0 0.2 0.4 0.6 0 10 8 6 4 2
Per-GPU Rate (req/s) SLO Scale Per-GPU Rate (req/s) SLO Scale
(a) Code Completion (b) Summarization
Figure 9: Code completion and summarization tasks with OPT-66B on HumanEval and LongBench datasets, respectively.
on larger models because the prefill job is larger and chunked- Prefill Queuing Decoding Queuing
Prefill Execution Decoding Execution
prefill mitigates the interference to some extent. However, Transmission
due to the reasons discussed in §2.3, chunked prefill is slower 100 1.0
CDF
50
The second row of Figure 8 indicates the robustness to 0.4 OPT-13B
the changing latency requirements of the two systems. We 25 0.2 OPT-66B
OPT-175B
fix the rate and then linearly scale the two latency require- 0 0.0
0.03 0.09 0.16 0.22 0.28 0.0 0.1 0.2
ments in Table 1 simultaneously using a parameter called Per-GPU Rate (req/s) Transmission Time (s)
SLO Scale. As SLO Scale decreases, the latency requirement
is more stringent. We aim to observe the most stringent SLO Figure 10: Left: Latency breakdown when serving OPT-175B
Scale that the system can withstand while still achieving the on ShareGPT dataset with DistServe. Right: The CDF func-
attainment target. Figure 8 shows that DistServe can achieve tion of KV Cache transmission time for three OPT models.
1.8×–3.2× more stringent SLO than vLLM and 1.7×–1.8×
more stringent SLO than DeepSpeed-MII, thus providing slowdown in the decoding phase with long prefill jobs and
more engaging service quality to the users. fails to meet the TPOT requirement.
The results above are all under the 90% SLO attainment
Code completion. Figure 9(a) shows the performance of target. We observe that DistServe can have better perfor-
DistServe on the code completion task when serving OPT- mance under a more stringent attainment target (say, 99%)
66B. DistServe can sustain 5.7× higher request rate and 1.4× and present the results in Appendix C.
more stringent SLO than vLLM. Compared to DeepSpeed-
MII, DistServe can sustain 1.6× higher request rate and 1.4× 6.3 Latency Breakdown
more stringent SLO. As a real-time coding assistant, the code To understand DistServe’s performance in detail, we make a
completion task demands lower TTFT than chatbot, this leads latency breakdown of the requests in DistServe. We divide the
to both systems ultimately being constrained by the TTFT processing lifecycle of a request in DistServe into five stages:
requirement. However, in comparison, by eliminating the in- prefill queuing, prefill execution, transmission, decoding queu-
terference of the decoding jobs and automatically increasing ing, and decoding execution. The total time consumed by all
intra-operation parallelism in prefill instances through the requests in each stage is then summed up to determine their
searching algorithm, DistServe reduces the average latency respective proportions in the system’s total execution time.
of the prefill jobs, thereby meeting the TTFT requirements of Figure 10(a) shows the latency breakdown for the OPT-
more requests. 175B models on the ShareGPT dataset. We chose OPT-175B
Summarization. Figure 9(b) shows the performance of Dist- because the KV Cache transmission is more demanding for
Serve on the summarization task when serving OPT-66B. larger models. In fact, even for OPT-175B, the KV Cache
DistServe achieves 4.3× higher request rate and 12.6× more transmission only accounts for less than 0.1% of the total
stringent SLO than vLLM. Compared to DeepSpeed-MII, latency. Even by examining the CDF of the absolute transmis-
DistServe achieves 1.8× higher request rate and 2.6× more sion time shown in Figure 10(b), we observe that over 95%
stringent SLO. The requests sampled from LongBench dataset of requests experience a delay of less than 30ms, despite our
have long input lengths, which brings significant pressure to testbed having only limited cross-node bandwidth. This is due
the prefill computation. However, due to the loose require- to the algorithm described in §4.2, where we require the prefill
ment of TTFT for the summarization task, the TPOT service and decoding instance to maintain the same stage on one ma-
quality becomes particularly important. Since vLLM collo- chine, enabling the use of intra-node NVLINK bandwidth for
cates prefill and decoding phases, it experiences a greater transmission, thus significantly reducing transmission delay.
11
Rate vLLM DistServe-Low 80 DistServe-Low
(req/s) Real System Simulator Real System Simulator
DistServe-High
1.0 97.0% 96.8% 100.0% 100.0% 60
Time (s)
1.5 65.5% 65.1% 100.0% 100.0%
2.0 52.8% 51.0% 99.3% 99.3% 40
2.5 44.9% 46.1% 87.3% 88.3%
3.0
3.5
36.7%
27.8%
38.3%
28.0%
83.0%
77.3%
84.1%
77.0%
20
4.0 23.6% 24.1% 70.0% 68.9%
0 2 4 8 16 32
Table 2: Comparison of the SLO attainment reported by the The number of GPUs
simulator and the real system under different rates. Figure 12: Algorithm Running Time
DistServe-High DistServe-Low vLLM++ vLLM
100 100
SLO Attainment (%)
12
compromised. In this case, techniques such as chunked-prefill introduces a heterogeneous-aware scheduling approach that
with piggyback [3, 9] may be preferred since it can fill each can efficiently match cluster resources to elastic resource-
batch to the compute-bound threshold, thereby maintaining adaptive jobs. Clockwork [23] and Shepherd [55] provide
higher GPU utilization in every iteration. latency-aware scheduling and preemption to improve the serv-
Resource-constrained scenarios. Small-scale enterprises ing goodput, but they only target traditional small models.
and individual researchers often lack the resources to deploy AlpaServe [33] focuses on LLMs, employing model paral-
LLMs on large-scale clusters [45,48]. In resource-constrained lelism to statistically multiplex the GPU execution thus im-
scenarios, such as environments with only a few or even a proving the resource utilization. However, it only targets the
single GPU, the design space for DistServe is significantly non-autoregressive generation. DistServe is the first work to
limited. It struggles or even fails to adjust the parallel strate- optimize the goodput for autoregressive LLM inference.
gies and resource allocation to effectively enhance serving
performance. In this case, simpler architectural choices like Resource disaggregation. Resource disaggregated sys-
non-disaggregated systems [3, 32] may reduce deployment tems [17, 25, 43] decouple the hardware resources from the
complexity and optimize operational efficiency. traditional monolithic server infrastructure and separate them
into resource pools to manage independently. It allows for
Long-context scenarios. Nowadays, more and more GPT more flexible, efficient, and scalable deployment and increases
models support extremely long contexts, such as Claude- resource utilization. Many applications benefit from a truly
3 [11], Gemini-1.5 [22], and Large World Model (LWM) [34], disaggregated data center with high-speed network bandwidth
which all have a 1M context window. In such scenarios, the and heterogenous hardware support [12, 30, 57]. DistServe
transmission overhead will increase as the size of the KV shares the concept by disaggregating its system components,
cache grows linearly with the prompt length. However, the allowing for independent resource scaling and management.
prefill computation grows quadratically, so the relative dura-
tion of transmission and prefill job decreases. Meanwhile, a Model parallelism for training. DistServe is orthogonal
longer context further exacerbates the disparity in computa- to the large body of work on model parallelism in train-
tional demands between prefill and decoding jobs, leading to ing [28,36,40,46,59]. As described in §3.3, inference-serving
increased interference between them. Therefore, the disaggre- workloads have unique characteristics not found in training
gation approach proposed in DistServe remains promising in settings. Where these systems do intersect with DistServe, is
long-context serving. in their methods for implementing model parallelism along
8 Related Work various dimensions. DistServe can integrate new parallelism
optimizations into its placement searching algorithm.
Inference serving. There has been plenty of work on in-
ference serving recently. They range from general-purpose 9 Conclusion
production-grade systems like TorchServe [7] and NVIDIA We present DistServe, a new LLM serving architecture that
Triton [19] to systems optimized specifically for Transformer- disaggregates the prefill and decoding computation. DistServe
based LLMs [9, 18, 21, 33, 50, 53, 54, 60]. Among them, maximizes the per-gpu goodput – the maximum request rate
Orca [54] introduces continuous batching to increase through- that can be served adhering to the SLO attainment goal for
put. vLLM [32] proposes paged-attention for fine-grained each GPU provisioned, hence resulting in up to 7.4× lower
KV cache management. SARATHI [9] suggests a chunked- cost per LLM query with guaranteed satisfaction of SLOs.
prefill approach, splitting a prefill request into chunks and Our findings affirm that as latency becomes an increasingly
piggybacking decoding requests to improve hardware utiliza- important metric for LLM services, prefill and decoding dis-
tion. FastServe [53] implements iteration-level preemptive aggregation is a vital strategy in promising improved perfor-
scheduling to mitigate the queuing delay caused by long jobs. mance and service quality guarantees.
However, they all employ a colocation approach for prefill
and decoding processing, thus leading to severe interference. Acknowledgments. We sincerely thank our shepherd and the
There are also concurrent works such as Splitwise [38], Tetri- anonymous reviewers for their valuable feedback. This work
Infer [27] and DéjàVu [49] which adopt similar disaggregation was supported by the National Natural Science Foundation of
idea to optimize LLM inference, further confirming the ef- China under the grant numbers 62172008, 62325201, and the
fectiveness of this method. Differently, DistServe emphasizes National Natural Science Fund for the Excellent Young Sci-
the goodput optimization scenario more and takes a closer entists Fund Program (Overseas). Junda Chen is supported by
look at the aspect of network bandwidth. UCSD fellowship and Hao Zhang is supported by UCSD fac-
Goodput-optimized systems. Optimizing goodput is a hot ulty startup fund. Xin Jin is the corresponding author. Yinmin
topic in DL applications. Pollux [39] improves scheduling Zhong, Xuanzhe Liu, and Xin Jin are also with the Key Lab-
performance in DL clusters by dynamically adjusting re- oratory of High Confidence Software Technologies (Peking
sources for jobs to increase cluster-wide goodput. Sia [29] University), Ministry of Education.
13
References William Saunders, Christopher Hesse, Andrew N. Carr,
[1] Introducing chatgpt. https://openai.com/blog/ Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa,
chatgpt, 2022. Alec Radford, Matthew Knight, Miles Brundage, Mira
Murati, Katie Mayer, Peter Welinder, Bob McGrew,
[2] Bard, an experiment by google. https://bard. Dario Amodei, Sam McCandlish, Ilya Sutskever, and
google.com/, 2023. Wojciech Zaremba. Evaluating large language models
trained on code. 2021.
[3] Deepspeed model implementations for inference (mii),
2023. [15] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan,
Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri
[4] Inflection tech memo. https://inflection.ai/ Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman,
assets/Inflection-1.pdf, 2023. et al. Evaluating large language models trained on code.
[5] Lanchain usecase: Summarization, 2023. arXiv preprint arXiv:2107.03374, 2021.
[6] Nvidia collective communications library (nccl), 2023. [16] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng,
Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
[7] Serve, optimize and scale pytorch models in production, Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Sto-
2023. ica, and Eric P. Xing. Vicuna: An open-source chatbot
impressing gpt-4 with 90%* chatgpt quality, 2023.
[8] Sharegpt teams. https://sharegpt.com/, 2023.
[17] Compute Express Link Consortium. Compute express
[9] Amey Agrawal, Ashish Panwar, Jayashree Mohan, link, 2023. Accessed: 2023-12-07.
Nipun Kwatra, Bhargav S Gulavani, and Ramachan-
dran Ramjee. Sarathi: Efficient llm inference by piggy- [18] NVIDIA Corporation. Fastertransformer, 2019.
backing decodes with chunked prefills. arXiv preprint
arXiv:2308.16369, 2023. [19] NVIDIA Corporation. Triton inference server: An opti-
mized cloud and edge inferencing solution., 2019.
[10] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury
Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: [20] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra,
Training generalized multi-query transformer models and Christopher Ré. Flashattention: Fast and memory-
from multi-head checkpoints, 2023. efficient exact attention with io-awareness, 2022.
[11] Anthropic. Introducing the next generation of [21] Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou.
claude. https://www.anthropic.com/news/ Turbotransformers: an efficient gpu serving system for
claude-3-family, 2024. transformer models. In ACM PPoPP, 2021.
[12] Andrew Audibert, Yang Chen, Dan Graur, Ana [22] Google. Our next-generation model: Gemini
Klimovic, Jiri Simsa, and Chandramohan A. Thekkath. 1.5. https://deepmind.google/technologies/
A case for disaggregation of ml data processing, 2022. gemini/, 2024.
[13] Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai [23] Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao,
Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Antoine Kaufmann, Ymir Vigfusson, and Jonathan
Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Mace. Serving DNNs like clockwork: Performance
Longbench: A bilingual, multitask benchmark for long predictability from the bottom up. In 14th USENIX Sym-
context understanding, 2023. posium on Operating Systems Design and Implementa-
tion (OSDI 20), pages 443–462. USENIX Association,
[14] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, November 2020.
Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri
Edwards, Yuri Burda, Nicholas Joseph, Greg Brock- [24] Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao,
man, Alex Ray, Raul Puri, Gretchen Krueger, Michael Antoine Kaufmann, Ymir Vigfusson, and Jonathan
Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Mace. Serving DNNs like clockwork: Performance
Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, predictability from the bottom up. In USENIX OSDI,
Alethea Power, Lukasz Kaiser, Mohammad Bavarian, 2020.
Clemens Winter, Philippe Tillet, Felipe Petroski Such, [25] Zhiyuan Guo, Zijian He, and Yiying Zhang. Mira: A
Dave Cummings, Matthias Plappert, Fotios Chantzis, program-behavior-guided far memory system. In Pro-
Elizabeth Barnes, Ariel Herbert-Voss, William Heb- ceedings of the 29th Symposium on Operating Systems
gen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Principles, SOSP ’23, page 692–708, New York, NY,
Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,
USA, 2023. Association for Computing Machinery.
14
[26] Mingcong Han, Hanze Zhang, Rong Chen, and Haibo [36] Deepak Narayanan, Aaron Harlap, Amar Phanishayee,
Chen. Microsecond-scale preemption for concurrent Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger,
GPU-accelerated DNN inferences. In 16th USENIX Phillip B. Gibbons, and Matei Zaharia. Pipedream: Gen-
Symposium on Operating Systems Design and Imple- eralized pipeline parallelism for dnn training. In ACM
mentation (OSDI 22), pages 539–558, Carlsbad, CA, SOSP, 2019.
July 2022. USENIX Association.
[37] OpenAI. Gpt-4 technical report, 2023.
[27] Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng
Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, [38] Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo
Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. Goiri, Aashaka Shah, Saeed Maleki, and Ricardo Bian-
Inference without interference: Disaggregate llm infer- chini. Splitwise: Efficient generative llm inference using
ence for mixed downstream workloads, 2024. phase splitting, 2023.
[28] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan [39] Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subra-
Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, manya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gre-
Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng gory R. Ganger, and Eric P. Xing. Pollux: Co-adaptive
Chen. Gpipe: Efficient training of giant neural networks cluster scheduling for goodput-optimized deep learn-
using pipeline parallelism, 2019. ing. In 15th USENIX Symposium on Operating Sys-
tems Design and Implementation (OSDI 21), pages 1–18.
[29] Suhas Jayaram Subramanya, Daiyaan Arfeen, Shouxu USENIX Association, July 2021.
Lin, Aurick Qiao, Zhihao Jia, and Gregory R Ganger.
Sia: Heterogeneity-aware, goodput-optimized ml-cluster [40] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and
scheduling. In Proceedings of the 29th Symposium on Yuxiong He. Zero: Memory optimizations toward train-
Operating Systems Principles, pages 642–657, 2023. ing trillion parameter models, 2020.
[30] Xin Jin, Zhihao Bai, Zhen Zhang, Yibo Zhu, Yinmin [41] Reuters, 2023.
Zhong, and Xuanzhe Liu. Distmind: Efficient resource
disaggregation for deep learning workloads. IEEE/ACM [42] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten
Transactions on Networking, pages 1–16, 2024. Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu
Liu, Tal Remez, Jérémy Rapin, et al. Code llama:
[31] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Open foundation models for code. arXiv preprint
Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- arXiv:2308.12950, 2023.
lez, Hao Zhang, and Ion Stoica. Efficient memory man-
agement for large language model serving with page- [43] Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying
dattention. In Proceedings of the ACM SIGOPS 29th Zhang. {LegoOS}: A disseminated, distributed {OS}
Symposium on Operating Systems Principles, 2023. for hardware resource disaggregation. In 13th USENIX
Symposium on Operating Systems Design and Imple-
[32] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying mentation (OSDI 18), pages 69–87, 2018.
Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza-
lez, Hao Zhang, and Ion Stoica. Efficient memory man- [44] Noam Shazeer. Fast transformer decoding: One write-
agement for large language model serving with pagedat- head is all you need, 2019.
tention, 2023. [45] Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan
[33] Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi
Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang,
Hao Zhang, Joseph E Gonzalez, et al. Alpaserve: Sta- Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen:
tistical multiplexing with model parallelism for deep High-throughput generative inference of large language
learning serving. arXiv, 2023. models with a single gpu, 2023.
[34] Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. [46] Mohammad Shoeybi, Mostofa Patwary, Raul Puri,
World model on million-length video and language with Patrick LeGresley, Jared Casper, and Bryan Catanzaro.
ringattention. arXiv, 2024. Megatron-lm: Training multi-billion parameter language
models using model parallelism, 2020.
[35] Philipp Moritz, Robert Nishihara, Stephanie Wang,
Alexey Tumanov, Richard Liaw, Eric Liang, Melih Eli- [47] John F Shortle, James M Thompson, Donald Gross, and
bol, Zongheng Yang, William Paul, Michael I. Jordan, Carl M Harris. Fundamentals of queueing theory, vol-
and Ion Stoica. Ray: A distributed framework for emerg- ume 399. John Wiley & Sons, 2018.
ing AI applications. In USENIX OSDI, 2018.
15
[48] Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. [55] Hong Zhang, Yupeng Tang, Anurag Khandelwal, and
Powerinfer: Fast large language model serving with a Ion Stoica. Shepherd: Serving dnns in the wild. 2023.
consumer-grade gpu, 2023.
[56] Susan Zhang, Stephen Roller, Naman Goyal, Mikel
[49] Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Artetxe, Moya Chen, Shuohui Chen, Christopher De-
Tarnawski, and Ana Klimovic. Déjàvu: Kv-cache wan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mi-
streaming for fast, fault-tolerant generative llm serving, haylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel
2024. Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang,
[50] Yiming Su, Chengcheng Wan, Utsav Sethi, Shan Lu, and Luke Zettlemoyer. Opt: Open pre-trained trans-
Madan Musuvathi, and Suman Nath. Hotgpt: How to former language models, 2022.
make software documentation more useful with a large [57] Yiying Zhang. Make it real: An end-to-end implementa-
language model? In Proceedings of the 19th Workshop tion of a physically disaggregated data center. SIGOPS
on Hot Topics in Operating Systems, pages 87–93, 2023. Oper. Syst. Rev., 57(1):1–9, jun 2023.
[51] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
[58] Kai Zhao, Sheng Di, Sihuan Li, Xin Liang, Yujia Zhai,
Martinet, Marie-Anne Lachaux, Timothée Lacroix, Bap-
Jieyang Chen, Kaiming Ouyang, Franck Cappello, and
tiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar,
Zizhong Chen. Ft-cnn: Algorithm-based fault tolerance
Aurelien Rodriguez, Armand Joulin, Edouard Grave,
for convolutional neural networks. IEEE Transactions
and Guillaume Lample. Llama: Open and efficient foun-
on Parallel and Distributed Systems, 32(7):1677–1689,
dation language models, 2023.
2021.
[52] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, [59] Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao
and Illia Polosukhin. Attention is all you need. Neural Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang,
Information Processing Systems, 2017. Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E.
Gonzalez, and Ion Stoica. Alpa: Automating inter- and
[53] Bingyang Wu, Yinmin Zhong, Zili Zhang, Gang Huang, Intra-Operator parallelism for distributed deep learning.
Xuanzhe Liu, and Xin Jin. Fast distributed inference In USENIX OSDI, 2022.
serving for large language models. arXiv preprint
arXiv:2305.05920, 2023. [60] Zhe Zhou, Xuechao Wei, Jiejing Zhang, and Guangyu
Sun. {PetS}: A unified framework for {Parameter-
[54] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo- Efficient} transformers serving. In USENIX ATC, 2022.
jeong Kim, and Byung-Gon Chun. Orca: A distributed
serving system for {Transformer-Based} generative
models. In USENIX OSDI, 2022.
16
A Latency Model for LLM Inference l tokens, the attention kernel needs to perform a total of
To accurately simulate the goodput of different placement 2sl + 3sl · (l/b) ≈ 3sl · (l/b) memory reads and writes, along-
strategies, we use an analytical model to predict the execution side 2sl 2 +sl(l/b) ≈ 2sl 2 FLOPs. So the AI is 2b/3 = 10.677
time of the prefill and decoding phases in LLM inference. (when b = 16) or 21.333 (when b = 32), indicating that it is
In modern LLM serving systems [18, 32, 53], memory- a memory-bound operation on A100 GPU. Therefore, the
bound operations like Softmax and LayerNorm are usually whole attention layer latency (including all requests and all
fused with matrix multiplication kernels for efficiency. Thus heads) can be modeled as:
the GEMMs dominate the overall latency and our analysis B−1
3sli2 3nst2 3ht2
primarily focuses on them. T2 = C2 · n · ∑ = C2 · = C2 ·
i=0 b b b
A.1 Symbol Definition
Overall, the latency of the prefill phase can be modeled as:
Here are symbols related to the architecture of the model:
• h: hidden size 3ht2
TPre f ill = C1 · (4th2 + 2thm) +C2 · +C3
• n: number of heads b
• s: head size (h = n · s) We use C3 to quantify other overheads like Python Run-
• m: FFN intermediate size time, system noise, and so on. Then we use profiling and
Note: If tensor parallelism is used, h, n, and m should be interpolation to figure out the values of C1 , C2 , and C3 .
divided by the tensor parallelism size.
Below are symbols that characterize the batch to be exe- A.3 Decoding Phase Latency Modeling
cuted: Similarly, we first focus on the following GEMMs in the
• B: batch size decoding phase:
• l0 , l1 , . . . , lB−1 : input length of each request within the
batch GEMM Name Shape of M Shape of N
• t: number of tokens in the batch, (t = ∑B−1 i=0 li ) QKV Linear (B, h) (h, 3h)
• t2 : squared sum of the input lengths (t2 = ∑B−1 2
i=0 li ) Attn Output (B, h) (h, h)
• b: block size in the attention kernel. This parameter is FFN Input (B, h) (h, m)
used in FlashAttention [20], a common kernel optimiza- FFN Output (B, m) (m, h)
tion technique adopted by current LLM serving systems.
A.2 Prefill Phase Latency Modeling The AI of these operations is O(B). B is limited by the GPU
memory size and stringent latency requirements, so in existing
Since the attention operation uses specially optimized kernels, serving scenarios, these operations are memory-bound. The
we first discuss the other four matrix multiplications in the total memory reads and writes is 8Bh + 4h2 + 2hm + 2Bm,
prefill phase: and since h and m are usually significantly larger than B, we
can model the latency as:
GEMM Name Shape of M Shape of N
QKV Linear (t, h) (h, 3h)
T3 = C4 · (4h2 + 2hm)
Attn Output (t, h) (h, h)
FFN Input (t, h) (h, m) As for the decoding attention operation, for one attention
FFN Output (t, m) (m, h) head and a request with l generated tokens, it needs to per-
form 3sl memory reads and writes, alongside 2sl FLOPs. It
The arithmetic intensity (AI) of these operations is O(t). is memory-bound, so we can model the latency of decoding
On NVIDIA A100-80GB GPU, it is compute-bound when AI attention as:
is over 156. Since t usually can reach several hundred in real B−1
cases, all of these operations are compute-bound. Therefore, T4 = C5 · n · 3s ∑ li = C5 · 3ht
we can model the latency of these operations according to the i=0
total FLOPs: Summing up, the latency of the decoding phase is:
2
T1 = C1 · (4th + 2thm) TDecoding = C4 · (4h2 + 2hm) +C5 · 3ht
Next, we discuss the prefill attention operation with
Here we do not introduce the overhead term (like C3 in the
FlashAttention [20] optimization. Since the attention only
profiling stage) because 4h2 + 2hm is already a constant, and
operates among the tokens in the same request, current im-
the overhead can be put into C4 . Similarly, we use profiling
plementations launch attention kernels for each request in
and interpolation to figure out the values of C4 and C5 .
the same batch. For one attention head and a request with
17
DistServe DeepSpeed-MII vLLM
SLO Attainment (%)
100 100 100
50 50 50
50 50 50
0 1.50 1.25 1.00 0.75 0 3.0 2.5 2.0 1.5 1.0 0 1.50 1.25 1.00 0.75
SLO Scale SLO Scale SLO Scale
(a) OPT-13B (b) OPT-66B (C) OPT-175B
Figure 13: Chatbot application with OPT models on the ShareGPT dataset.
50 50 50 50
0 0.5 1.0 1.5 2.0 0 1.5 1.0 0.5 0 0.2 0.4 0.6 0 10 8 6 4 2
Per-GPU Rate (req/s) SLO Scale Per-GPU Rate (req/s) SLO Scale
(a) Code Completion (b) Summarization
Figure 14: Code completion and summarization tasks with OPT-66B on HumanEval and LongBench datasets, respectively.
B DistServe Placements in End-to-end Experi- C End-to-end Results under 99% SLO attain-
ments ment
Table 3 shows the tensor parallelism (TP) and pipeline paral- Figure 13 and Figure 14 show the end-to-end performance
lelism (PP) configurations for prefill and decoding instances between DistServe and baselines with the same setup in §6.2
chosen by DistServe in the end-to-end experiments §6.2. except that the SLO attainment goal is changed to 99%. We
can see that under a more stringent SLO attainment goal,
Prefill Decoding compared to vLLM, DistServe can still sustain 3×–8× higher
Model Dataset
TP PP TP PP rate and 1.24×–6.67× more stringent SLO. When compared
OPT-13B ShareGPT 2 1 1 1 to DeepSpeed-MII, DistServe can achieve 1.32×–8× higher
OPT-66B ShareGPT 4 1 2 2 rate and 1.20×–1.58× more stringent SLO.
OPT-66B LongBench 4 1 2 2
OPT-66B HumanEval 4 1 2 2
OPT-175B ShareGPT 3 3 4 3
18