0% found this document useful (0 votes)
5 views

DiSCo_Device-Server_Collaborative_LLM-Based_Text_S

DiSCo is a device-server cooperative scheduler designed to enhance the Quality of Experience (QoE) for users of large language models (LLMs) in text streaming services by optimizing request routing and response generation between devices and servers. It addresses challenges such as high costs and latency by implementing cost-aware dispatching policies and a token-level migration framework, which together improve Time-To-First-Token (TTFT) and reduce serving costs significantly. Evaluations indicate that DiSCo can enhance user experience by reducing TTFT by up to 78% while cutting costs by as much as 84% across various configurations.

Uploaded by

bbhmmredive
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

DiSCo_Device-Server_Collaborative_LLM-Based_Text_S

DiSCo is a device-server cooperative scheduler designed to enhance the Quality of Experience (QoE) for users of large language models (LLMs) in text streaming services by optimizing request routing and response generation between devices and servers. It addresses challenges such as high costs and latency by implementing cost-aware dispatching policies and a token-level migration framework, which together improve Time-To-First-Token (TTFT) and reduce serving costs significantly. Evaluations indicate that DiSCo can enhance user experience by reducing TTFT by up to 78% while cutting costs by as much as 84% across various configurations.

Uploaded by

bbhmmredive
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

DiSCo: Device-Server Collaborative LLM-Based Text Streaming Services

Ting Sun1 , Penghan Wang1 , Fan Lai1


1
University of Illinois Urbana-Champaign, United States
[email protected]

Abstract 2024) and Google’s Gemini Nano (Google, 2024).


The rapid rise of large language models (LLMs) The Quality of Experience (QoE) for interactive
in text streaming services has introduced sig- applications is primarily evaluated by two critical
nificant cost and Quality of Experience (QoE) metrics: Time-To-First-Token (TTFT) in the prefill
arXiv:2502.11417v1 [cs.LG] 17 Feb 2025

challenges in serving millions of daily requests, stage, which quantifies the initial response latency,
especially in meeting Time-To-First-Token and Time-Between-Token (TBT) during the decode
(TTFT) and Time-Between-Token (TBT) re- stage, which measures the consistency of token de-
quirements for real-time interactions. Our real-
livery speed (Databricks, 2023; Liu et al., 2024a,b).
world measurements show that both server-
based and on-device deployments struggle to On-server deployments lower serving costs by
meet diverse QoE demands: server deploy- sharing infrastructure among many requests but
ments face high costs and last-hop issues often introduce unpredictable high latency due to
(e.g., Internet latency and dynamics), while request queueing delays (Agrawal et al., 2024), and
on-device LLM inference is constrained by re- the internet speed to end users fluctuates. While
sources. on-device deployment enables increasingly capable
We introduce DiSCo, a device-server coop- LLMs with sufficient accuracy, it suffers from slow
erative scheduler designed to optimize users’ processing speeds for long prompts and high energy
QoE by adaptively routing requests and mi-
consumption. For example, an iPhone running a
grating response generation between endpoints
while maintaining cost constraints. DiSCo em-
7B parameter LLM can operate for less than two
ploys cost-aware scheduling, leveraging the hours on a full charge (Liu et al., 2024c).
predictable speed of on-device LLM inference This paper introduces a novel paradigm for cost-
with the flexible capacity of server-based in- constrained device-server cooperative inference.
ference to dispatch requests on the fly, while We incorporate both server usage (e.g., monetary
introducing a token-level migration mechanism costs) and device energy costs via a dynamic ex-
to ensure consistent token delivery during mi- change rate, which end users can adjust to balance
gration. Evaluations on real-world workloads—
response generation between the cloud and devices.
including commercial services like OpenAI
GPT and DeepSeek, and open-source deploy- As such, we can strategically distribute inference
ments such as LLaMA3—show that DiSCo requests between endpoints and dynamically mi-
can improve users’ QoE by reducing tail TTFT grate ongoing token generation to maximize QoE.
(11-52%) and mean TTFT (6-78%) across dif- However, realizing this vision presents several fun-
ferent model-device configurations, while dra- damental challenges:
matically reducing serving costs by up to 84%
• Unified Cost Management: The total serving
through its migration mechanism while main-
taining comparable QoE levels. cost combines heterogeneous resource expen-
ditures from both endpoints—monetary costs
1 Introduction from server API usage and energy costs from
Large language models (LLMs) have revolution- device computation. The relative value of en-
ized various applications, with over 60% focus- ergy costs varies dynamically based on device
ing on conversational interactions such as chatbots context (e.g., battery level, charging status) and
(Grand View Research, 2023). Meeting high serv- user preferences for server spending, making it
ing demands requires scaling deployments across challenging to establish a unified optimization.
on-premise servers in the cloud and on-device in- • Runtime Uncertainty: The dynamic nature of
ference, as seen in Apple Intelligence (Gunter et al., networks (e.g., latency jitters) and serving loads
tions. This design maintains user experience
while saving resource costs across endpoints.
Device Server
token stream token stream Through extensive evaluation using real-world
control control
traces from commercial LLM streaming API ser-
TTFT
Dispatch Migration
TBT vices (including GPT and DeepSeek) and on-device
controller
Predictor controller
Predictor
DiSCo
deployments, we demonstrate that DiSCo improves
mean and tail TTFT by up to 50% without violation
prompt token stream of TBT, significantly reducing the costs.
delivery
Overall, we make the following contributions:
User Token buffer • We characterize QoE challenges in device-
server cooperative LLM inference through ex-
Figure 1: DiSCo acts as a middleware to optimize QoE tensive real-world measurements.
by adaptively dispatching and migrating response gen-
• We design novel scheduling policies that opti-
eration between device and server endpoints under cost
constraints. mize QoE under cost constraints.
• We develop a token-level migration frame-
make it challenging to accurately predict TTFT work to enable generation handoff between end-
for in-flight request migration. Moreover, any points, preserving token delivery consistency.
scheduling mechanism must be lightweight to • We demonstrate DiSCo’s effectiveness in com-
avoid introducing large overhead to the already mercial services and open-source benchmarks.
latency-sensitive services.
• Migration Impact on Token Delivery: While 2 Background and Motivation
dynamic migration between endpoints can re- 2.1 LLM Token Mixture and Routing
duce overall running costs, it risks disrupting
TBT. The challenge lies in determining when Device-server collaborative approaches have
and how to perform migration while minimiz- evolved along two directions. First, systems like
ing the degradation of user experience and the EdgeShard (Zhang et al., 2024) and WDMoE (Xue
increase in costs. et al., 2024a) partition LLMs across multiple end-
points when a single device cannot host the en-
As shown in Figure 1, we introduce DiSCo, tire model. LLMCad (Xu et al., 2023) uses on-
a Device-Server Cooperative scheduler that ad- device models to reduce server costs, while Per-
dresses these challenges via two key innovations: LLM (Yang et al., 2024) optimizes energy con-
• Cost-Aware Dispatching Policies: We intro- sumption across devices and servers under con-
duce two dispatching mechanisms targeting dif- straints. Second, routing-based approaches (Ong
ferent cost constraints. For server cost con- et al., 2024; Ding et al., 2024) balance cost and ac-
straints, we employ a length-threshold based curacy by directing simple requests to small models
dispatching that routes requests shorter than and complex queries to advanced ones. However,
a dynamically computed threshold to devices. these approaches do not optimize token delivery
For device energy constraints, we implement a metrics (TTFT and TBT) under cost constraints.
delay-based dispatching mechanism where de-
vices wait for a computed interval before start- 2.2 LLM-Based Text Streaming Applications
ing local inference. Both mechanisms adapt Over 60% of LLM-backed applications focus on
their thresholds based on unified cost measures streaming conversational interactions, such as chat-
that combine server monetary costs and device bots, virtual assistants, and language translation.
energy consumption. QoE in these text streaming services is often
• Token-Level Migration Framework: We en- quantified by two critical metrics: time-to-first-
able seamless generation handoff between end- token (TTFT) for initial responsiveness and time-
points through a novel migration protocol that between-tokens (TBT) for delivery smoothness
preserves token delivery consistency. Our throughout the entire interaction timeline.
framework employs delayed migration tim- Current LLM systems struggle to meet user ex-
ing to minimize interruption, while a token pectations for these metrics, with TTFTs ranging
buffer ensures smooth delivery during transi- from hundreds of milliseconds to over ten sec-
onds—far exceeding the ideal latencies of tens Command DeepSeek 3080x2-Qwen7b A40-Qwen7B
of milliseconds for interactive applications (Mäki- LLaMA GPT 3080x2-Llama8B A40-Llama8B
3.498 0.075
Patola and Hämäläinen, 2004; Žádník et al., 2022). 2.624 0.056

TTFT (s)

TTFT (s)
Token consumption patterns vary by output modal- 1.749 0.037
0.875 0.019
ity: in visual text scenarios, reading speeds dif- 0.0000 0.0000
25 50 75 100 25 50 75 100
fer across demographic groups, with the major- Inference Round Inference Round
ity (52%) aged 25-44 reading at 4-5 tokens per (a) On-Server TTFTs. (b) On-Device TTFTs.
second, while older groups generally read more
slowly (Liu et al., 2024a; Brysbaert, 2019; Petrov Figure 2: On-device TTFT performance is more stable.
et al., 2024). Audio output consumption shows
more consistency, averaging 3-4 tokens per second
across languages (Liu et al., 2024a; Parachuk, 2022; the high resource demands, but this introduces
Barnard, Dom, 2022). Notably, conventional evalu- issues like queuing delays, resource contention
ation metrics like token generation throughput or from batching (Yu et al., 2022; Kwon et al., 2023;
average time-per-output-token provide incomplete Agrawal et al., 2024), and last-hop network latency
insights, as they fail to capture the crucial relation- variations (Li et al., 2024a). Our measurements
ship between token delivery timing and actual user reveal that these factors can cause significant TTFT
consumption patterns. spikes for GPT-4-mini, from 0.3 seconds to several
seconds during high-load periods.
2.3 Limitations of Existing Text Streaming Given these complementary limitations, we in-
Applications vestigate the following research question: Can a
cooperative paradigm be designed to combine on-
Existing LLM serving relies on two deployment server and on-device inference to improve QoE
paradigms: on-server and on-device inference. while managing both energy and monetary costs?
With rapid hardware and software advancements,
on-device LLMs have achieved accuracy levels suf- 3 Characterizing LLM Inference
ficient for many applications, as evidenced by the
integration of Apple Intelligence (Gunter et al., This section characterizes LLM inference per-
2024) and Google’s Gemini Nano (Google, 2024) formance in on-server and on-device paradigms,
into iOS and Android platforms, where they effec- which informs our design.
tively handle text completion and message com- We evaluate four commercial streaming
position tasks. While on-device LLMs may still LLM APIs: OpenAI’s GPT-4o-mini (OpenAI,
be inadequate for complex tasks (e.g., advanced 2024), DeepSeek’s DeepSeek-V2.5 (DeepSeek,
mathematical reasoning), we focus on the growing 2024), Cohere’s Command (Cohere, 2024), and
category of applications where current on-device Hyperbolic-hosted LLaMA-3-70b-Instruct (Hy-
models already achieve satisfactory accuracy. For perbolic, 2024). For on-device analysis, we
these applications, the challenge is not model capa- deploy Qwen-2.5-7B-Instruct (Alibaba, 2024)
bility, but rather the substantial monetary or energy and Llama-3.1-8B-Instruct (Grattafiori et al.,
cost demands of LLM inference. 2024) on both server-grade (NVIDIA A40, 48GB)
and consumer-grade (dual NVIDIA RTX 3080,
Unfortunately, both serving paradigms face
denoted as 3080x2) GPUs. We sample 1,000
challenges. On-device inference, despite en-
requests from the Alpaca dataset (Taori et al.,
abling faster generation owing to its dedicated re-
2023), following a Poisson distribution with a
sources (Song et al., 2023; Xue et al., 2024b), suf-
mean request arrival interval of 30 seconds.
fers from extended TTFT for long prompts due
to limited processing speeds, and substantial en- TTFT characteristics. Our measurements reveal
ergy consumption that scales linearly with response contrasting TTFT patterns between on-device and
lengths (Li et al., 2024b). For instance, a fully on-server inference. As shown in Figure 2, on-
charged iPhone running a 7B parameter LLM can device inference exhibits stable TTFTs when pro-
operate for less than two hours (Liu et al., 2024c)— cessing identical prompts at 60-second intervals,
insufficient for day-long mobile use. primarily reflecting the prefill duration due to ded-
On the other hand, on-server deployments re- icated local hardware resources. In contrast, on-
quire request batching to amortize costs due to server inference experiences high variations and
Model Deployment Pearson Coef. nities: (i) on-server TTFT is largely unpredictable
Command Server 0.0142 and shows minimal correlation with prompt length,
GPT-4o-mini Server 0.0236 whereas on-device TTFT scales nearly linearly
DeepSeek-V2.5 Server -0.0273
LLaMA-3-70b-Instruct Server 0.0402
with prompt length and is highly predictable; and
LLaMA-3.1-8b-Instruct Device 0.8424 (ii) both paradigms achieve token generation speeds
that exceed typical user consumption rates.
Table 1: Pearson coefficient between prompt length and Taking these findings together—particularly the
TTFT in on-server deployment is weak. predictable performance of on-device inference
Command DeepSeek 3080x2-LLaMA and the elastic scaling capabilities of server-based
LLaMA GPT A40-QWen inference—we observe opportunities for optimiza-
TBT Time Series Analysis TBT Distribution Analysis tion in cost-constrained device-server cooperative
0.300 1.00
serving. Dynamic request migration between
0.200 0.75
TBT (s)

0.50
server and device endpoints during response gener-
CDF

0.100 0.25 ation can yield significant cost savings.


0.0000 25 50 75 100 0.000.00 0.05 0.10
Inference Round TBT (s) 4 DiSCo Policies
1
Figure 3: On-device TBT performance is more stable. DiSCo optimizes both QoE and cost through (1)
dispatch control that determines where to initiate
significant tail latency, attributed to network delays, token generation, and (2) migration control that
request queuing, and resource contention. enables dynamic handoff during generation. The
We summarize the TTFT performance of 1,000 dispatch controller optimizes TTFT by strategically
requests in Table 1. We observe that on-device routing requests, while the migration controller
TTFT scales linearly with prompt length due to maintains consistent TBT while reducing costs.
hardware constraints (Li et al., 2024b), while on-
server TTFT shows minimal prompt-length sensi- 4.1 Problem Formulation
tivity through advanced resource scaling (Zhong We propose a unified cost model combining both
et al., 2024; Patel et al., 2024; Hu et al., 2024). monetary bills from on-server inference and en-
ergy bills from on-device inference. Let cps and
TBT characteristics. TBT characterizes the I/O-
cds denote the per-token monetary costs for server
bound decode stage latency. Analysis of temporal
prefill and decode phases respectively, while cpd
samples and distributions across six setups (Fig-
and cdd represent the per-token energy costs for de-
ure 3) reveals higher TBT variability in on-server
vice prefill and decode phases. Converting between
inference compared to on-device execution. More
energy and monetary costs is done by a dynamic
importantly, both deployment approaches achieve
exchange rate λ, adjusted by users to reflect their
generation speeds exceeding user consumption
preferences. We offer a user-friendly tunable bud-
rates (§2.2), making cooperative serving practical.
get ratio b ∈ [0, 1], representing the additional cost
Opportunities and challenges. Our studies allowance beyond baseline costs. Our optimization
further reveal that as on-device models con- objectives focus on: (1) minimizing both mean and
tinue to improve—often fine-tuned for specific tail TTFT, and (2) maintaining consistent token
tasks (Gunter et al., 2024; Liu et al., 2024c)—their delivery at a specified pace (i.e., stable TBT).
performance increasingly matches that of on-server
models in popular applications like instruction- 4.2 Dispatch Controller: Cost-Aware Request
following and translation (detailed in §5 and Ap- Routing
pendix D). However, deploying these models on- Based on our analysis in §3, server-side TTFT
device introduces challenges such as long prefilling shows weak correlation with prompt length due
latency and startup overhead. to various factors (network delay, request queue-
On the other hand, our real-world studies of ing, etc.). We model server TTFTs as a known
conversational workloads highlight key opportu- distribution, obtained either from server-provided
1
information or device-side profiling. In contrast,
On-server inference, such as in GPT, streams tokens with
each packet containing multiple tokens, resulting in near-zero device-side TTFT exhibits a linear relationship
perceived TBTs. with prompt length, with the coefficient determined
through offline profiling. Server-Constrained Optimization. When
Our key insight is that the optimization prob- server costs dominate (max(cps , cds ) >
p d
lem naturally decomposes into two scenarios based min(cd , cd )), we need to carefully manage
on dominant cost factors: device-constrained sce- server resource usage under a budget constraint
narios where energy consumption is the primary E[Is (l)l] ≤ b · E[l], where Is (l) indicates server
bottleneck, and server-constrained scenarios where execution. Our analysis in §3 shows that device
API monetary costs dominate. This decomposi- TTFT scales linearly with prompt length as
tion enables efficient solutions. Pseudocode for the Td (l) = kl + c, while server TTFT has minimal
dispatch controller is in Appendix F. length correlation. This suggests a length-based
routing strategy: short prompts run on device to
Device-Constrained Optimization. When de- conserve server budget, while long prompts use
vice costs dominate (min(cpd , cdd ) > max(cps , cds )), both endpoints to minimize TTFT.
we need to carefully manage device resource us- We determine the length threshold lth by:
age under a budget constraint E[Id (l)l] ≤ b · E[l],
Z lth
where l is the prompt length and Id (l) indicates
device execution. The key challenge is balancing l · p(l)dl = (1 − b) · E[l] (3)
0
between two goals: leveraging device execution to
bound worst-case latency while conserving energy This ensures prompts shorter than lth consume
on shorter prompts where possible. exactly (1 − b) fraction of total expected tokens
Our solution uses a wait-time strategy: for each through device-only execution, leaving the remain-
prompt of length l, we first try server execution and ing longer prompts with sufficient server budget
wait for time w(l) before potentially starting device for concurrent execution on both endpoints.
execution. This conserves device energy when the 4.3 Migration Controller: Cost-Efficient
server responds quickly. We determine the optimal Token Delivery
wait time through a two-phase approach:
When both endpoints process a request, the con-
• Phase 1 (Tail Protection): We reserve bud-
strained endpoint may win the prefill phase but
get portion α for worst-case scenarios by set-
incur higher decode costs. In such cases, we can
ting a maximum wait time wtail = F −1 (1 −
migrate token generation to the other endpoint to
min(α, b)), where F (·) is the server TTFT
reduce total cost while maintaining quality.
distribution. This ensures we have device re-
sources ready when server latency exceeds its Efficient Token Transfer. When endpoints share
(1 − min(α, b))-th percentile. the same vocabulary, we transmit token IDs rather
• Phase 2 (Average Case): With the remaining than complete token representations. Additionally,
budget (b − α), we set length-dependent wait we avoid transferring intermediate states (e.g., at-
times: tention key-value cache) for two practical reasons:
( (1) endpoints often employ different model archi-
0 if l ≤ lth tectures optimized for their respective hardware,
w(l) = (1) making state transfer incompatible, and (2) inter-
min(βl, wtail ) otherwise
mediate state transfer would incur significant net-
where lth is a threshold below which we start work overhead. Migration triggers when projected
device execution immediately, and β is chosen cost savings exceed overhead:
to satisfy:
Cmigration = ∆cddecode × lremaining (4)
Z ∞
(1 − F (βl)) · cpd · l · p(l)dl = (b − α) · E[l] where ∆cddecode = |cds − cdd | and lremaining de-
lth
(2) note the per-token decode cost difference between
endpoints, and the expected remaining sequence
This design guarantees worst-case TTFT
length, respectively.
through wtail while optimizing average perfor-
mance by adaptively adjusting wait times based Buffer-Based Migration Protocol. To ensure
on prompt length. Whichever endpoint (server or smooth token delivery during migration, we in-
device) generates the first token continues to the troduce a token buffer that leverages the natural
decode phase, while the other terminates. gap between token generation speed (rg tokens/s)
tokens generated before migration DiSCo-D Stoch-D vLLM DiSCo-S Stoch-S llama.cpp
A prompt remaining tokens Pixel 7Pro Pixel 7Pro Xiaomi 14
Bloom 1.1B Bloom 560M Qwen 1.5 0.5B
B prompt tokens generated after migration 1.5 1.5 1.5

(s)
LLaMA
1.0

Latency (s) Latency


1.0 1.0
Figure 4: Token generation migration between end- 0.5 0.5 0.5
points. Row A shows the original sequence on the
0.75 0.75 0.75

Command
source endpoint, while Row B shows the sequence after 0.50 0.50 0.50
migration to the target endpoint, maintaining consistent 0.25 0.25 0.25
token delivery while reducing cost. 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
Budget Ratio Budget Ratio Budget Ratio
Figure 5: Mean TTFT reduction of DiSCo remains sig-
and human consumption rate (rc tokens/s, typically nificant on DiffusionDB trace.
rg > rc ). The buffer size is set to:
in mobile environments. For end-to-end cost com-
B = rc × tm (5) parison, we quantify server costs using commercial
where tm is the estimated migration overhead time. API token pricing and device costs using FLOPs-
Migration begins only when the buffer contains based energy consumption. The detailed cost anal-
enough tokens (B) to mask the migration latency. ysis can be found in Appendix E.
As shown in Figure 4, this design enables seam- Baselines. We compare DiSCo with four on-
less handoff: the source endpoint (Row A) contin- server, on-device, and cooperative deployments:
ues generation until the target endpoint (Row B)
• vLLM (Kwon et al., 2023): Processes all re-
is ready, ensuring uninterrupted token delivery to
quests using remote server-based deployment.
users despite the underlying endpoint transition.
• llama.cpp (Gerganov, 2024): Processes all re-
5 Evaluation quests using local device-based deployment.
Through extensive experimentation with four • Stoch-S: A server-constrained approach that
production-grade LLM services and state-of-the- randomly routes requests to the device while
art open-source models, we demonstrate DiSCo’s capping the server budget.
exceptional performance. Our rigorous evaluation • Stoch-D: A device-constrained approach that
spanning diverse deployment scenarios reveals that randomly routes requests to the server while
DiSCo delivers remarkable improvements - reduc- capping the device budget.
ing both mean TTFT (6-78%) and tail TTFT (11- For end-to-end cost comparison, we include two
52%) while achieving cost savings of up to 83.6% additional baselines: DiSCo-D w/o Migration and
compared to existing approaches, all while preserv- DiSCo-S w/o Migration.
ing consistent token generation throughput.
Metrics. We evaluate the system performance
5.1 Evaluation Setup using both TTFT and TBT, including their mean
Testbeds and Workloads. Our testbed is a server and tail values. They are analyzed across varying
with 4 NVIDIA A40 GPUs, each featuring 48GB cost budgets, defined as the ratio of input tokens
memory. We evaluate DiSCo using both commer- processed by the constrained endpoint (device or
cial LLM traces and on-device deployments. For server) to the total input tokens.
server-side evaluation, we collect traces from four For each experiment, we report the mean value
production services: OpenAI’s GPT-4o-mini (Ope- over 10 runs.
nAI, 2024), DeepSeek-V2.5 (DeepSeek, 2024), Co-
here’s Command (Cohere, 2024), and Hyperbolic- 5.2 End-to-end Performance
hosted LLaMA-3-70b-Instruct (Hyperbolic, 2024). DiSCo improves TTFT performance. Figure 6
For on-device evaluation, we test three representa- and Table 2 show that DiSCo significantly out-
tive device-model configurations (Li et al., 2024b): performs baseline methods in both device- and
Pixel 7 Pro with Bloom 1.1B (31.32/13.93 tokens/s server-constrained settings, showing improvements
for prefill/decode), Pixel 7 Pro with Bloom 560M across mean and tail (P99) TTFT metrics for vari-
(51.80/20.14 tokens/s), and Xiaomi 14 with Qwen ous services, including GPT, LLaMA, DeepSeek,
1.5 0.5B (79.90/21.47 tokens/s). These configura- and Command. In the GPT experiments, DiSCo
tions span different compute-capability trade-offs demonstrates particularly notable tail latency reduc-
Tail TTFT Reduction DiSCo retains TBT performance while lowering
Platform Constraint Pixel 7Pro Pixel 7Pro Xiaomi 14 the cost. Table 3 evaluates DiSCo’s TBT perfor-
B-1.1B B-560M Q-0.5B mance across various traces under both server and
Server 23.85% 37.41% 44.04% device constraints. For requests involving migra-
GPT
Device 26.39% 21.48% 16.32% tion, we measure two key metrics: the average num-
Server 11.08% 23.09% 26.29%
LLaMA
Device 35.67% 29.30% 21.29% ber of migrations per request and the tail (P99) TBT
Server 0.00%* 3.88% 15.53% latency. Results show that while migrations delay
DeepSeek
Device 30.91% 28.01% 25.08% only a negligible number of tokens compared to
Server 47.93% 50.93% 52.23%
Command
Device 34.78% 31.53% 24.42% typical generation lengths of hundreds or thousands
of tokens, they do not impact the perceived token
Table 2: Average reduction of tail TTFT compared delivery smoothness, demonstrating DiSCo’s abil-
to stochastic dispatching across the whole cost budget ity to maintain consistent streaming performance
range. Devices include Pixel 7Pro and Xiaomi 14, while even during endpoint transitions.
models include Bloom-1.1B, Bloom-560M, and Qwen- As shown in Figure 7, our token-level migration
1.5-0.5B. (*Tail TTFT remains constant.)
mechanism substantially reduces the end-to-end
cost across all evaluated scenarios. For device-
Mean P99 TBT
Trace Constraint
delay_num delay_num P99
constrained cases (DiSCo-D), the migration mech-
anism achieves up to 72.7% cost reduction com-
Server 4.21 9.40 0.209
GPT
Device 6.59 6.59 0.217 pared to the non-migration baseline, with the im-
Server 5.53 11.00 0.209 provement being most significant at higher budget
LLaMA
Device 10.01 10.01 0.217 ratios. Similarly, in server-constrained scenarios
Server 8.13 11.00 0.209
DeepSeek
Device 17.17 17.17 0.217 (DiSCo-S), the cost reduction reaches 83.6%, par-
Server 3.25 8.00 0.209 ticularly evident in DeepSeek and Command model
Command
Device 8.54 8.54 0.217 deployments. These significant cost reductions are
consistently observed across device-model pairs.
Table 3: Performance metrics for different models under
server and device constraints, showing the number of 5.3 Performance Breakdown and Ablation
delayed tokens during migration and TBT (Time Be-
Study
tween Tokens) P99 statistics. The average is computed
over the requests that have performed the migration. Impact of Prompt Sending Interval. To evalu-
ate our system under realistic workload patterns,
we conduct experiments using stratified sampling
tions, decreasing P99 TTFT by up to 40% relative based on request frequency from DiffusionDB
to Stochastic dispatching across all device config- (Wang et al., 2022). Specifically, we select traces
urations, while mean TTFT is also reduced sub- from ten users across different activity levels to
stantially, with reductions between 20-30% across capture diverse interaction patterns. We pair these
diverse budget ratios. In the LLaMA setup, we real-world request intervals with prompts randomly
observe a unique trade-off pattern. For budget ra- drawn from the Alpaca dataset (Taori et al., 2023).
tios below 20% when the device is the constrained The results shown in Figure 5 demonstrate that
endpoint, DiSCo exhibits a slightly higher mean DiSCo’s performance advantages persist across
TTFT than the baseline. This outcome is inten- varying user activity patterns.
tional, as DiSCo prioritizes tail latency reduction
in low-budget scenarios, yielding substantial gains Quality of Generated Responses. We con-
in P99 TTFT—reducing tail latency by up to 50%. duct comprehensive experiments using instruction-
This prioritization enables more responsive perfor- following tasks on multiple model configurations.
mance under restrictive budget conditions. We employ three LLM-based judges (GPT-4o,
DeepSeek and Command experiments demon- Gemini1.5-pro, and QWen2.5-72b) to assess re-
strate similar patterns of improvement as the pre- sponse quality, and examine two representative mi-
vious two traces, with DiSCo consistently outper- gration scenarios: from a smaller to larger model
forming baseline approaches. In the DeepSeek (3B-7B) and vice versa (7B-3B). Figure 8 shows
scenario, DiSCo maintains stable latency even as that DiSCo maintains quality scores across different
the budget ratio increases, whereas the baseline sequence lengths, migration patterns, and judges.
systems show increasing latency variance. Detailed results are presented in Appendix D.
DiSCo-D Stoch-D vLLM DiSCo-S Stoch-S llama.cpp
Pixel 7Pro Pixel 7Pro Xiaomi 14 Pixel 7Pro Pixel 7Pro Xiaomi 14
Bloom 1.1B Bloom 560M Qwen 1.5 0.5B Bloom 1.1B Bloom 560M Qwen 1.5 0.5B
0.45 1.00 1.5 1.5 1.5
Latency (s) Latency (s)

Latency (s) Latency (s)


0.6

LLaMA
0.75
GPT

0.40 0.5 1.0 1.0 1.0


0.50 0.5
0.35 0.4 0.5
1.0 1.0 1.0 0.75 0.75 0.75

Command
DeepSeek

0.8 0.50 0.50 0.50


0.5 0.6 0.8
0.25 0.25 0.25
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
Budget Ratio Budget Ratio Budget Ratio Budget Ratio Budget Ratio Budget Ratio
Figure 6: Mean TTFT tested using four traces. DiSCo achieves superior TTFT performance than the baselines.

DiSCo-D DiSCo-D w/o Migration DiSCo-S DiSCo-S w/o Migration


Pixel 7Pro Pixel 7Pro Xiaomi 14 Pixel 7Pro Pixel 7Pro Xiaomi 14
Bloom 1.1B Bloom 560M Qwen 1.5 0.5B Bloom 1.1B Bloom 560M Qwen 1.5 0.5B
200 150 200 300
E2E Cost

E2E Cost
200 200

LLaMA
200
GPT

100 100 100


50 100 100
0 0
500
Command
DeepSeek

300
E2E Cost

E2E Cost
200 200 200 200
250 200
100
0 0 0 0 0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
Budget Ratio Budget Ratio Budget Ratio Budget Ratio Budget Ratio Budget Ratio
Figure 7: The migration mechanism in DiSCo achieves superior end-to-end cost.

GPT Gemini QWen 10


3B-7B 7B-3B 15
Time (ms)
GPT-1k
Time (ms)

GPT-1k
10 Syn-10k Syn-10k
5
5.0 5.0 Syn-100k Syn-100k
Score

5
2.5 2.5 00 0.25 0.5 0.75 1
00 0.25 0.5 0.75 1
Budget Ratio Budget Ratio
0.00 100 200 0.00 100 200
Max Seq Length of the First-Endpoint Model (a) DiSCo-D (b) DiSCo-S
Figure 8: Response quality evaluation. Each subplot
represents a distinct model pair configuration (e.g., 3B- Figure 9: DiSCo’s overhead is trivial and can scale well.
7B indicates migration from a 3B to a 7B model). The
x-axis shows the maximum sequence length processed ms for synthetic datasets of 10K and 100K samples
by the first endpoint before migration, while the y-axis respectively. DiSCo-D, while being more computa-
shows the quality scores assigned by different LLM tionally intensive, still maintained practical perfor-
judges. Results demonstrate consistent quality preserva-
mance levels: 0.486 ms, 1.741 ms, and 14.856 ms
tion across various migration scenarios.
for 1K, 10K, and 100K samples, respectively.
Scalability Analysis. We conducted comprehen- 6 Conclusions
sive performance evaluations of DiSCo-D and
DiSCo-S on a MacBook Pro with M1 processor, This paper introduces DiSCo, a device-server co-
using both synthetic datasets and a real-world GPT operative scheduler that addresses QoE and cost
trace of 1,000 records, across target frequencies challenges in LLM serving for real-time conversa-
from 0 to 1. To generate synthetic data that ac- tional applications for end users. DiSCo uses cost-
curately reflects real-world scenarios, we fitted aware scheduling and token-level migration to dy-
log-normal distributions to the prompt lengths and namically optimize Time-To-First-Token (TTFT)
TTFT from the real trace by following the mean and and Time-Between-Token (TBT) across device and
standard deviation of the logarithm. As shown in server endpoints. Our evaluations on real-world
Figure 9, for DiSCo-S, the execution time showed traces from platforms like GPT and DeepSeek show
remarkable efficiency: 0.128 ms for the real trace that DiSCo significantly improves both TTFT and
with 1K samples, scaling to just 0.969 ms and 9.082 TBT while reducing costs.
7 Limitations Marc Brysbaert. 2019. How many words do we read
per minute? a review and meta-analysis of reading
While DiSCo demonstrates significant improve- rate. Journal of memory and language.
ments in LLM serving efficiency, we acknowledge
Lingjiao Chen, Matei Zaharia, and James Zou. 2023.
several important limitations of our current work: Frugalgpt: How to use large language models while
reducing cost and improving performance. arXiv
Model Coverage. We focus on scenarios where preprint arXiv:2305.05176.
on-device LLMs achieve sufficient accuracy for tar-
get applications. While this covers many common Cohere. 2024. Command model. Accessed on 21 Sep
2024.
use cases, DiSCo may not be suitable for applica-
tions requiring complex reasoning. Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and
Christopher Ré. 2022. Flashattention: Fast and
Energy Modeling. For device energy consump- memory-efficient exact attention with io-awareness.
tion, we use a linear energy model based on FLOPs. Advances in Neural Information Processing Systems,
Real-world device energy consumption patterns 35:16344–16359.
can be more complex, varying with factors such Databricks. 2023. LLM Inference Per-
as battery state, temperature, and concurrent work- formance Engineering: Best Practices.
loads. https://www.databricks.com/blog/
\llm-inference-performance-engineering-best\
Scalability Considerations. Our current imple- -practices. [Online; accessed 27-September-
2024].
mentation and evaluation focus on single-device
scenarios. Extending DiSCo to handle multi-device DeepSeek. 2024. Deepseek-v2.5 model. Accessed on
collaborative serving presents additional challenges 21 Sep 2024.
in terms of coordination overhead and resource al-
location that warrant further investigation. Tim Dettmers, Mike Lewis, Younes Belkada, and Luke
Zettlemoyer. 2022. Gpt3. int8 (): 8-bit matrix mul-
tiplication for transformers at scale. Advances in
8 Ethical Considerations Neural Information Processing Systems, 35:30318–
30332.
Our work focuses solely on optimizing the effi-
ciency of LLM serving systems through device- Ali Diba, Vivek Sharma, Ali Pazandeh, Hamed Pirsi-
avash, and Luc Van Gool. 2017. Weakly supervised
server collaboration and does not introduce new
cascaded convolutional networks. In Proceedings of
language generation capabilities or content. All the IEEE conference on computer vision and pattern
experiments were conducted using publicly avail- recognition, pages 914–922.
able models and datasets. While our work may
indirectly benefit the accessibility of LLM services Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim,
Subhabrata Mukherjee, Victor Ruhle, Laks V. S. Lak-
by reducing costs and improving performance, we shmanan, and Ahmed Hassan Awadallah. 2024. Hy-
acknowledge that broader ethical considerations brid llm: Cost-efficient and quality-aware query rout-
around LLM deployment and usage are important ing.
but outside the scope of this technical contribution.
Georgi Gerganov. 2024. llama.cpp. Accessed on 21
Sep 2024.

References Google. 2024. Gemini nano. Accessed on 21 Sep 2024.


Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-
Mohan, Nipun Kwatra, Bhargav S Gulavani, Alexey Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr-
Tumanov, and Ramachandran Ramjee. 2024. Tam- ishnan, Marc’Aurelio Ranzato, Francisco Guzmán,
ing throughput-latency tradeoff in llm inference with and Angela Fan. 2022. The flores-101 evaluation
sarathi-serve. In Proceedings of 18th USENIX Sym- benchmark for low-resource and multilingual ma-
posium on Operating Systems Design and Implemen- chine translation. Transactions of the Association for
tation, 2024, Santa Clara. Computational Linguistics, 10:522–538.

Alibaba. 2024. Qwen2.5 family. Accessed on 21 Sep Grand View Research. 2023. Large language model
2024. (llm) market size, share & trends analysis report by
component, by application, by enterprise size, by end-
Barnard, Dom. 2022. Average speaking rate and words use, by region, and segment forecasts, 2023 - 2030.
per minute. Accessed on 21 Sep 2024. Grand View Research.
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal
Abhinav Pandey, Abhishek Kadian, Ahmad Al- Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh
Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- Ramanathan, Viktor Kerkez, Vincent Gonguet, Vir-
ten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh ginie Do, Vish Vogeti, Vítor Albiero, Vladan Petro-
Goyal, Anthony Hartshorn, Aobo Yang, Archi Mi- vic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whit-
tra, Archie Sravankumar, Artem Korenev, Arthur ney Meers, Xavier Martinet, Xiaodong Wang, Xi-
Hinsvark, Arun Rao, Aston Zhang, Aurelien Ro- aofang Wang, Xiaoqing Ellen Tan, Xide Xia, Xin-
driguez, Austen Gregerson, Ava Spataru, Baptiste feng Xie, Xuchao Jia, Xuewei Wang, Yaelle Gold-
Roziere, Bethany Biron, Binh Tang, Bobbie Chern, schlag, Yashesh Gaur, Yasmine Babaei, Yi Wen,
Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao,
Chris Marra, Chris McConnell, Christian Keller, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing
Christophe Touret, Chunyang Wu, Corinne Wong, Chen, Zoe Papakipos, Aaditya Singh, Aayushi Sri-
Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Al- vastava, Abha Jain, Adam Kelsey, Adam Shajnfeld,
lonsius, Daniel Song, Danielle Pintz, Danny Livshits, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand,
Danny Wyatt, David Esiobu, Dhruv Choudhary, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alexei
Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Baevski, Allie Feinstein, Amanda Kallet, Amit San-
Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, gani, Amos Teo, Anam Yunus, Andrei Lupu, An-
Elina Lobanova, Emily Dinan, Eric Michael Smith, dres Alvarado, Andrew Caples, Andrew Gu, Andrew
Filip Radenovic, Francisco Guzmán, Frank Zhang, Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchan-
Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis An- dani, Annie Dong, Annie Franco, Anuj Goyal, Apara-
derson, Govind Thattai, Graeme Nail, Gregoire Mi- jita Saraf, Arkabandhu Chowdhury, Ashley Gabriel,
alon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Ashwin Bharambe, Assaf Eisenman, Azadeh Yaz-
Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan dan, Beau James, Ben Maurer, Benjamin Leonhardi,
Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Is- Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi
han Misra, Ivan Evtimov, Jack Zhang, Jade Copet, Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Han-
Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, cock, Bram Wasti, Brandon Spence, Brani Stojkovic,
Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Brian Gamido, Britt Montalvo, Carl Parker, Carly
Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Burton, Catalina Mejia, Ce Liu, Changhan Wang,
Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-
Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Hsiang Chu, Chris Cai, Chris Tindal, Christoph Fe-
Joseph Rocca, Joshua Johnstun, Joshua Saxe, Jun- ichtenhofer, Cynthia Gao, Damon Civin, Dana Beaty,
teng Jia, Kalyan Vasuden Alwala, Karthik Prasad, Daniel Kreymer, Daniel Li, David Adkins, David
Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Xu, Davide Testuggine, Delia David, Devi Parikh,
Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Diana Liskovich, Didem Foss, Dingkang Wang, Duc
Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Kushal Le, Dustin Holland, Edward Dowling, Eissa Jamil,
Lakhotia, Lauren Rantala-Yeary, Laurens van der Elaine Montgomery, Eleonora Presani, Emily Hahn,
Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Emily Wood, Eric-Tuan Le, Erik Brinkman, Este-
Louis Martin, Lovish Madaan, Lubo Malo, Lukas ban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun,
Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Felix Kreuk, Feng Tian, Filippos Kokkinos, Firat
Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Ozgenel, Francesco Caggioni, Frank Kanayet, Frank
Paluri, Marcin Kardas, Maria Tsimpoukelli, Mathew Seide, Gabriela Medina Florez, Gabriella Schwarz,
Oldham, Mathieu Rita, Maya Pavlova, Melanie Kam- Gada Badeer, Georgia Swee, Gil Halpern, Grant
badur, Mike Lewis, Min Si, Mitesh Kumar Singh, Herman, Grigory Sizov, Guangyi, Zhang, Guna
Mona Hassan, Naman Goyal, Narjes Torabi, Niko- Lakshminarayanan, Hakan Inan, Hamid Shojanaz-
lay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, eri, Han Zou, Hannah Wang, Hanwen Zha, Haroun
Ning Zhang, Olivier Duchenne, Onur Çelebi, Patrick Habeeb, Harrison Rudolph, Helen Suk, Henry As-
Alrassy, Pengchuan Zhang, Pengwei Li, Petar Va- pegren, Hunter Goldman, Hongyuan Zhan, Ibrahim
sic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Damlaj, Igor Molybog, Igor Tufanov, Ilias Leontiadis,
Praveen Krishnan, Punit Singh Koura, Puxin Xu, Irina-Elena Veliche, Itai Gat, Jake Weissman, James
Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Geboski, James Kohli, Janice Lam, Japhet Asher,
Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jen-
Robert Stojnic, Roberta Raileanu, Rohan Maheswari, nifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy
Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ron- Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe
nie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Cummings, Jon Carvill, Jon Shepard, Jonathan Mc-
Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sa- Phie, Jonathan Torres, Josh Ginsburg, Junjie Wang,
hana Chennabasappa, Sanjay Singh, Sean Bell, Seo- Kai Wu, Kam Hou U, Karan Saxena, Kartikay Khan-
hyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sha- delwal, Katayoun Zand, Kathy Matosich, Kaushik
ran Narang, Sharath Raparthy, Sheng Shen, Shengye Veeraraghavan, Kelly Michelena, Keqian Li, Ki-
Wan, Shruti Bhosale, Shun Zhang, Simon Van- ran Jagadeesh, Kun Huang, Kunal Chawla, Kyle
denhende, Soumya Batra, Spencer Whitman, Sten Huang, Lailin Chen, Lakshya Garg, Lavender A,
Sootla, Stephane Collot, Suchin Gururangan, Syd- Leandro Silva, Lee Bell, Lei Zhang, Liangpeng
ney Borodinsky, Tamar Herman, Tara Fowler, Tarek Guo, Licheng Yu, Liron Moshkovich, Luca Wehrst-
Sheasha, Thomas Georgiou, Thomas Scialom, Tobias edt, Madian Khabsa, Manav Avalani, Manish Bhatt,
Martynas Mankus, Matan Hasson, Matthew Lennie, ing with elastic memory pool. arXiv preprint
Matthias Reso, Maxim Groshev, Maxim Naumov, arXiv:2406.17565.
Maya Lathi, Meghan Keneally, Miao Liu, Michael L.
Seltzer, Michal Valko, Michelle Restrepo, Mihir Pa- Hyperbolic. 2024. Llama3-70b-instruct by hyperbolic.
tel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Accessed on 21 Sep 2024.
Mike Macey, Mike Wang, Miquel Jubert Hermoso, Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying
Mo Metanat, Mohammad Rastegari, Munish Bansal, Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E.
Nandhini Santhanam, Natascha Parks, Natasha Gonzalez, Hao Zhang, and Ion Stoica. 2023. Effi-
White, Navyata Bawa, Nayan Singhal, Nick Egebo, cient memory management for large language model
Nicolas Usunier, Nikhil Mehta, Nikolay Pavlovich serving with pagedattention. In Proceedings of the
Laptev, Ning Dong, Norman Cheng, Oleg Chernoguz, ACM SIGOPS 29th Symposium on Operating Systems
Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Principles.
Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pe-
dro Rittner, Philip Bontrager, Pierre Roux, Piotr Hanchen Li, Yuhan Liu, Yihua Cheng, Siddhant Ray,
Dollar, Polina Zvyagina, Prashant Ratanchandani, Kuntai Du, and Junchen Jiang. 2024a. Eloquent:
Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel A more robust transmission scheme for llm token
Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu streaming. In NAIC.
Nayani, Rahul Mitra, Rangaprabhu Parthasarathy,
Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Xiang Li, Zhenyan Lu, Dongqi Cai, Xiao Ma, and Meng-
Wang, Russ Howes, Ruty Rinott, Sachin Mehta, wei Xu. 2024b. Large language models on mobile
Sachin Siby, Sai Jayesh Bondu, Samyak Datta, Sara devices: Measurements, analysis, and insights. In
Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Proceedings of the Workshop on Edge and Mobile
Satadru Pan, Saurabh Mahajan, Saurabh Verma, Foundation Models, EdgeFM ’24.
Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lind-
say, Shaun Lindsay, Sheng Feng, Shenghao Lin, Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li,
Shengxin Cindy Zha, Shishir Patil, Shiva Shankar, Hanyu Zhao, Wencong Xiao, Qi Xu, Xiafei Qiu, Shen
Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Li, et al. 2024a. Infinite-llm: Efficient llm service
Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, for long context with distattention and distributed
Stephanie Max, Stephen Chen, Steve Kehoe, Steve kvcache. arXiv preprint arXiv:2401.02669.
Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-
Summer Deng, Sungmin Cho, Sunny Virk, Suraj Ming Chen, Wei-Chen Wang, Guangxuan Xiao,
Subramanian, Sy Choudhury, Sydney Goldman, Tal Xingyu Dang, Chuang Gan, and Song Han. 2024b.
Remez, Tamar Glaser, Tamara Best, Thilo Koehler, Awq: Activation-aware weight quantization for on-
Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim device llm compression and acceleration. Proceed-
Matthews, Timothy Chou, Tzook Shaked, Varun ings of Machine Learning and Systems, 6:87–100.
Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai
Mohan, Vinay Satish Kumar, Vishal Mangla, Vlad Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023.
Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Ring attention with blockwise transformers for near-
Vladimir Ivanov, Wei Li, Wenchen Wang, Wen- infinite context. arXiv preprint arXiv:2310.01889.
wen Jiang, Wes Bouaziz, Will Constable, Xiaocheng
Tang, Xiaojian Wu, Xiaolan Wang, Xilun Wu, Xinbo Jiachen Liu, Zhiyu Wu, Jae-Won Chung, Fan
Gao, Yaniv Kleinman, Yanjun Chen, Ye Hu, Ye Jia, Lai, Myungjin Lee, and Mosharaf Chowdhury.
Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, 2024a. Andes: Defining and enhancing quality-
Youngjin Nam, Yu, Wang, Yu Zhao, Yuchen Hao, of-experience in llm-based text streaming services.
Yundi Qian, Yunlu Li, Yuzi He, Zach Rait, Zachary arXiv preprint arXiv:2404.16283.
DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang,
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray,
Zhiwei Zhao, and Zhiyu Ma. 2024. The llama 3 herd
Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi
of models.
Yao, Shan Lu, Ganesh Ananthanarayanan, et al.
2024b. Cachegen: Kv cache compression and stream-
Tom Gunter, Zirui Wang, Chong Wang, Ruoming
ing for fast large language model serving. In Pro-
Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang,
ceedings of the ACM SIGCOMM 2024 Conference,
Chen Chen, Chung-Cheng Chiu, David Qiu, et al.
pages 38–56.
2024. Apple intelligence foundation language mod-
els. arXiv preprint arXiv:2407.21075. Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen
Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong,
Ashit Gupta, Anirudh Deodhar, Tathagata Mukher- Ernie Chang, Yangyang Shi, Raghuraman Krish-
jee, and Venkataramana Runkana. 2022. Semi- namoorthi, et al. 2024c. Mobilellm: Optimizing
supervised cascaded clustering for classification of sub-billion parameter language models for on-device
noisy label data. arXiv preprint arXiv:2205.02209. use cases. arXiv preprint arXiv:2402.14905.
Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao
Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyril-
Yungang Bao, Ninghui Sun, et al. 2024. Mem- lidis, and Anshumali Shrivastava. 2024d. Scis-
serve: Context caching for disaggregated llm serv- sorhands: Exploiting the persistence of importance
hypothesis for llm kv cache compression at test time. NLLB Team, Marta R Costa-jussà, James Cross, Onur
Advances in Neural Information Processing Systems, Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef-
36. fernan, Elahe Kalbassi, Janice Lam, Daniel Licht,
et al. 2022. No language left behind: Scaling
Teemu Mäki-Patola and Perttu Hämäläinen. 2004. La- human-centered machine translation (2022). URL
tency tolerance for gesture controlled continuous https://arxiv.org/abs/2207.04672.
sound instrument without tactile feedback. In In-
ternational Computer Music Conference. Citeseer. Jakub Žádník, Markku Mäkitalo, Jarno Vanne, and
Pekka Jääskeläinen. 2022. Image and video cod-
MLC team. 2023. MLC-LLM. ing techniques for ultra-low latency. ACM Comput.
Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Surv.
Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed
Zijie J. Wang, Evan Montoya, David Munechika,
Kadous, and Ion Stoica. 2024. Routellm: Learning
Haoyang Yang, Benjamin Hoover, and Duen Horng
to route llms with preference data.
Chau. 2022. DiffusionDB: A large-scale prompt
OpenAI. 2024. Gpt-4o mini model. Accessed on 21 gallery dataset for text-to-image generative models.
Sep 2024. arXiv:2210.14896 [cs].

Tara Parachuk. 2022. Speaking rates comparison table. Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu,
Accessed on 21 Sep 2024. Julien Demouth, and Song Han. 2023. Smoothquant:
Accurate and efficient post-training quantization for
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka large language models. In International Conference
Shah, Inigo Goiri, Saeed Maleki, and Ricardo Bian- on Machine Learning, pages 38087–38099. PMLR.
chini. 2024. Splitwise: Efficient generative llm in-
ference using phase splitting. In 2024 ACM/IEEE Daliang Xu, Wangsong Yin, Xin Jin, Ying Zhang,
51st Annual International Symposium on Computer Shiyun Wei, Mengwei Xu, and Xuanzhe Liu. 2023.
Architecture (ISCA). Llmcad: Fast and scalable on-device large language
model inference. arXiv preprint arXiv:2309.04255.
Aleksandar Petrov, Emanuele La Malfa, Philip H.S. Torr,
and Adel Bibi. 2024. Language model tokenizers Nan Xue, Yaping Sun, Zhiyong Chen, Meixia Tao, Xi-
introduce unfairness between languages. In Proceed- aodong Xu, Liang Qian, Shuguang Cui, and Ping
ings of the 37th International Conference on Neural Zhang. 2024a. Wdmoe: Wireless distributed large
Information Processing Systems. language models with mixture of experts. arXiv
preprint arXiv:2405.03131.
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery,
Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin
Xiao, Shivani Agrawal, and Jeff Dean. 2023. Effi- Xia, and Haibo Chen. 2024b. Powerinfer-2: Fast
ciently scaling transformer inference. Proceedings large language model inference on a smartphone.
of Machine Learning and Systems, 5:606–624. arXiv preprint arXiv:2406.06282.
Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Zheming Yang, Yuanhao Yang, Chang Zhao, Qi Guo,
Yongwei Wu, Weimin Zheng, and Xinran Xu. 2024. Wenkai He, and Wen Ji. 2024. Perllm: Personal-
Mooncake: Kimi’s kvcache-centric architecture for ized inference scheduling with edge-cloud collab-
llm serving. arXiv preprint arXiv:2407.00079. oration for diverse llm services. arXiv preprint
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, arXiv:2405.14636.
Patrick LeGresley, Jared Casper, and Bryan Catan-
zaro. 2019. Megatron-lm: Training multi-billion Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yi-
parameter language models using model parallelism. neng Zhang, Stephanie Wang, Tianqi Chen, Baris
arXiv preprint arXiv:1909.08053. Kasikci, Vinod Grover, Arvind Krishnamurthy, and
Luis Ceze. 2025. Flashinfer: Efficient and customiz-
Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. able attention engine for llm inference serving. arXiv
2023. Powerinfer: Fast large language model serv- preprint arXiv:2501.01005.
ing with a consumer-grade gpu. arXiv preprint
arXiv:2312.12456. Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo-
jeong Kim, and Byung-Gon Chun. 2022. Orca: A
Ting Sun, Penghan Wang, and Fan Lai. 2025. Hygen: distributed serving system for Transformer-Based
Efficient llm serving via elastic online-offline request generative models. In 16th USENIX Symposium
co-location. on Operating Systems Design and Implementation
(OSDI 22).
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, Mingjin Zhang, Jiannong Cao, Xiaoming Shen, and
and Tatsunori B. Hashimoto. 2023. Stanford alpaca: Zeyang Cui. 2024. Edgeshard: Efficient llm in-
An instruction-following llama model. https:// ference via collaborative edge computing. arXiv
github.com/tatsu-lab/stanford_alpaca. preprint arXiv:2405.14371.
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, PowerInfer (Song et al., 2023) and PowerInfer-2
Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. (Xue et al., 2024b) optimize on-device LLM infer-
2024. DistServe: Disaggregating prefill and decod-
ence speed by leveraging sparsity in model activa-
ing for goodput-optimized large language model serv-
ing. In 18th USENIX Symposium on Operating Sys- tions. DiSCo acts as a middle layer to schedule and
tems Design and Implementation (OSDI 24). migrate response generation between servers and
devices.
A Additional Related Work
B Cold Start Evaluation
General LLM Inference. LLMs generate text
responses auto-regressively, producing one token This section presents cold start performance mea-
at a time based on preceding tokens. The process surements for the Qwen-2.5 model series across dif-
consists of two stages that can potentially be exe- ferent hardware configurations. The experiments
cuted on different endpoints: (i) Prefill stage: The were conducted on two platforms: Windows 10
model processes the input text (prompt), calculates with NVIDIA RTX 3060 12GB and Linux with
and stores intermediate model states–i.e., the key NVIDIA A40 48GB. A fixed prompt "How to use
and value cache (KV cache) of tokens–to generate GitHub?" was used throughout all experiments. We
the first token. A token represents a word or part measured two critical metrics: model loading time
of a word the model can interpret. Once the first and TTFT for Qwen-2.5 models ranging from 0.5B
token is generated, it is appended to the end of the to 7B parameters, all using FP16 precision. The ex-
prompt and the generation process moves on to the perimental setup consisted of 10 measurement runs,
(ii) Decode stage: The model processes the updated with 2 additional warmup runs to ensure measure-
prompt (including previously generated tokens) to ment stability. It is worth noting that such warmups
generate the next token. The decode stage contin- can potentially mask the true gap between model
ues until a stopping condition is met (e.g., reaching loading and prompt prefill time due to various op-
an end-of-sequence token or the maximum genera- timizations, including OS page cache. To main-
tion length). tain authentic cold start conditions, we explicitly
cleared the CUDA cache and performed garbage
On-Server LLM Serving. Existing work have collection before each run.
focused on GPU kernel optimization (Dao et al., The results revealed several significant patterns.
2022; Ye et al., 2025), KV-cache management (Lin On the RTX 3060, the loading time exhibits an ap-
et al., 2024a; Liu et al., 2024d; Qin et al., 2024), proximately linear increase with model size, rang-
model parallelism (Shoeybi et al., 2019; Pope et al., ing from 1.29s for the 0.5B model to 4.45s for the
2023; Liu et al., 2023), quantization (Xiao et al., 3B model. While TTFT follows a similar trend, the
2023; Lin et al., 2024b; Dettmers et al., 2022), and processing time is substantially lower, ranging from
scheduling (Yu et al., 2022; Kwon et al., 2023; 0.051s to 0.145s. On the A40 GPU, despite observ-
Agrawal et al., 2024; Sun et al., 2025). For example, ing longer loading times, TTFT is significantly re-
Orca (Yu et al., 2022) introduced continuous batch- duced across all models, maintaining a remarkably
ing to improve serving throughput, while vLLM consistent value regardless of model size. These
(Kwon et al., 2023) developed PagedAttention to findings indicate that while model loading remains
reduce LLM memory restraint. Sarathi (Agrawal more resource-intensive on our Linux setup, the
et al., 2024) implemented chunked prefill to miti- inference performance benefits substantially from
gate inter-request interference within batches. An- the A40’s superior computational capabilities.
des (Liu et al., 2024a) addresses QoE for individual
requests from the server side but lacks awareness C Prediction-based Model Selection
of network jitter and device potential. These server-
side advancements complement DiSCo design. This section provides a comparative analysis of
several TTFT prediction methods. For selecting
On-Device LLMs. Google’s Gemini Nano the endpoint with a lower TTFT for each request,
(Google, 2024) and Apple’s Apple Intelligence TTFT prediction is imperative. For on-device infer-
(Gunter et al., 2024) have been integrated to An- ence, TTFT prediction is straightforward, as TTFT
droid OS and iOS devices, respectively. MLC- exhibits a linear relationship with prompt length.
LLM (MLC team, 2023) and llama.cpp (Gerganov, Conversely, on-server inference TTFT is character-
2024) efficiently deploy various LLM on devices. ized by high variability, rendering prediction chal-
Model MAPE(%) MAE(s)
Metric 0.5B 1.5B 3B 7B
Command
Windows 10 (NVIDIA RTX 3060 12GB)
Moving Average 39.40 0.0899
Load Time (s) 1.29 2.48 4.45 - ExponentialSmoothing 53.51 0.1047
TTFT (s) 0.051 0.105 0.145 - Random Forest 39.33 0.0966
Linux (NVIDIA A40 48GB) XGBoost 35.43 0.0905

Load Time (s) 1.53 3.12 5.72 13.43 DeepSeek-V2.5


TTFT (s) 0.025 0.026 0.033 0.033 Moving Average 27.80 0.3959
ExponentialSmoothing 27.39 0.3771
Table 4: Model loading time during cold start can sig- Random Forest 32.97 0.4745
nificantly slow down TTFT. Average Qwen-2.5 model XGBoost 27.51 0.4001
performance over 10 runs. The 7B model exceeds the GPT-4o-mini
memory capacity of the RTX 3060 and thus cannot be Moving Average 24.55 0.0995
evaluated. ExponentialSmoothing 20.88 0.0844
Random Forest 28.68 0.1128
XGBoost 24.83 0.0997
lenging. Moreover, the prediction method itself LLaMA-3-70b-Instruct
must be computationally efficient, as its overhead
Moving Average 42.18 0.3312
also contributes to end-to-end TTFT. ExponentialSmoothing 40.27 0.3154
Table 5 presents a comparative analysis of four Random Forest 49.67 0.3875
common lightweight time-series-based prediction XGBoost 43.94 0.3451
methods applied to traces collected from three
Table 5: Comparative analysis of Moving Average, Ex-
prevalent LLM services. Our correlation analy- ponential Smoothing, Random Forest, and XGBoost
sis (Table 1) revealed no significant correlation prediction models across Command, DeepSeek, GPT,
between prompt length and TTFT; thus, prompt and LLaMA model traces. Metrics include Mean Ab-
length is omitted as a feature in these prediction solute Percentage Error (MAPE) and Mean Absolute
methods. We demonstrate that none of these meth- Error (MAE).
ods offers sufficient accuracy for TTFT prediction.

D Response Quality endpoints.

This section examines the quality of responses gen- D.2 Evaluation Methodology
erated by DiSCo, with a particular focus on quality Evaluation Framework We establish a compre-
preservation during endpoint transitions. We first hensive assessment framework encompassing both
establish bounds on generation quality, then present automated metrics and LLM-based evaluation. Our
our evaluation methodology, and finally demon- framework evaluates two distinct tasks:
strate through extensive experiments that DiSCo
maintains consistent quality across different model • Instruction Following: We evaluate 500 data
configurations and tasks. items from the Alpaca dataset (Taori et al.,
2023) using our structured prompt template,
D.1 Quality Bounds with quality assessment performed by multi-
A critical aspect of DiSCo is maintaining genera- ple LLM judges: Gemini1.5-pro, GPT-4o, and
tion quality during endpoint transitions. We employ QWen2.5-72b-instruct.
a systematic approach to quality preservation (Diba
• Translation Quality: We assess Chinese-to-
et al., 2017; Gupta et al., 2022; Chen et al., 2023).
English translation on 500 data items from
Specifically, for endpoints A and B with quality
Flores_zho_Hans-eng_Latn dataset (Team
metrics QA and QB (measured by LLM scores
et al., 2022; Goyal et al., 2022) using the
or ROUGE scores), we find that any migrated se-
ROUGE-1 metric.
quence M with quality QM satisfies:
These two tasks are popular on end-user de-
min(QA , QB ) ≤ QM ≤ max(QA , QB ) (6) vices. Understandably, for complex tasks such as
advanced math reasoning, we notice DisCo can
This bound ensures that migration does not de- lead to accuracy drops compared to the on-server
grade quality beyond the capabilities of individual model due to the limited capability of the on-device
models, yet still achieves better performance than
the on-device counterpart. 0.2

ROUGE1
Experimental Setup We configure our experi- 0.5B-7B
ments with: 0.1 3B-7B
7B-0.5B
7B-3B
• A fixed maximum generation length of 256 0.00 100 200
tokens
Max Seq Length of the First-Endpoint Model
• First endpoint’s maximum generation length
varied through [0, 4, 16, 64, 256] tokens GPT Gemini QWen
0.5B-7B 7B-0.5B
• Four model combinations: 0.5B-7B, 3B-7B,
7B-0.5B, and 7B-3B (prefix and suffix denote 5.0 5.0

Score
the model sizes of first and second endpoints
2.5 2.5
respectively)
0.00 100 200 0.00 100 200
The generation transitions to the second endpoint Max Seq Length of the First-Endpoint Model
when the first endpoint reaches its length limit with-
out producing an end-of-generation token, creating Figure 10: Quality evaluation results of DiSCo. The
natural boundary conditions for analysis. top figure shows translation quality evaluation using
For instruction-following tasks, we employ the ROUGE-1 scores, demonstrating that DiSCo consis-
tently achieves higher quality than the on-device base-
following structured evaluation template:
line. The bottom figure presents evaluation scores from
JUDGE_PROMPT = """Strictly evaluate the different LLM judges on instruction-following capabil-
quality of the following answer on a scale ities, where each subplot represents a different model
of 1-10 (1 being the worst, 10 being the pair comparison with varied first-endpoint model’s max-
best). First briefly point out the problem
of the answer, then give a total rating in imum sequence length. The consistent patterns across
the following format. different LLM judges demonstrate the robustness of our
evaluation framework.
Question: {question}

Answer: {answer}
E Experiment Settings for End-to-end
Evaluation: (your rationale for the rating, Cost
as a brief text)
For on-device LLMs, we quantify cost using
Total rating: (your rating, as a number
between 1 and 10) FLOPs (floating-point operations). For on-server
""" LLM services, we use their respective pricing rates
at the time of experimentation. We set the energy-
D.3 Results and Analysis to-monetary conversion ratio (energy_to_money)
D.3.1 Quality Metrics to 0.3 $ per million FLOPs for server-constrained
experiments and 5 $ per million FLOPs for device-
Our comprehensive evaluation reveals several key
constrained experiments. To establish a compre-
findings:
hensive cost model that enables direct comparison
between device and server computation costs, we
• Bounded Quality: The combined sequence
analyze both the computational complexity of on-
quality consistently remains bounded between
device models through detailed FLOPs calculations
individual model performance levels
(Section E.1) and the pricing structures of commer-
• Translation Performance: ROUGE-1 scores cial LLM services (Section E.2). The generation
maintain stability between 0.23 and 0.26 length limit is set to 128.

• Instruction Following: Scores show consis- E.1 FLOPs of On-Device LLMs


tent ranges from 4 to 6 To accurately quantify the computational cost per
token in both prefill and decode stages, we conduct
Length BLOOM-1.1B BLOOM-560M Qwen-0.5B Component BLOOM-1.1B BLOOM-560M Qwen-0.5B
Prefill Phase Embedding 31.24 25.00 31.51
L = 32 0.85 0.45 0.39 Attention 13.01 10.00 16.56
L = 64 0.93 0.50 0.45 FFN 24.48 20.00 20.38
L = 128 1.25 0.65 0.69 LayerNorm 0.02 0.02 0.04
Output 31.24 25.00 31.51
Decode Phase
L = 32 0.82 0.42 0.37
L = 64 0.82 0.42 0.37 Table 7: Component Ratios at L=128 (%)
L = 128 0.82 0.42 0.37
Model Vendor Input price Output price
Table 6: Prefill and Decode FLOPs (billions) DeepSeek-V2.5 DeepSeek 0.14 0.28
GPT-4o-mini OpenAI 0.15 0.60
LLaMa-3.1-70b Hyperbolic 0.40 0.40
a detailed FLOPs analysis using three representa- LLaMa-3.1-70b Amazon 0.99 0.99
tive models: BLOOM-1.1B, BLOOM-560M, and Command Cohere 1.25 2.00
GPT-4o OpenAI 2.50 10.0
Qwen1.5-0.5B. All models share a 24-layer archi- Claude-3.5-Sonnet Anthropic 3.00 15.0
tecture but differ in other parameters: BLOOM- o1-preview OpenAI 15.0 60.0
1.1B (dmodel = 1024, 16 heads, FFN dim=4096),
BLOOM-560M (dmodel = 512, 8 heads, FFN Table 8: LLM service pricing (USD per 1M Tokens).
Input prices refer to tokens in the prompt, while output
dim=2048), and Qwen1.5-0.5B (dmodel = 768, 12
prices apply to generated tokens.
heads, FFN dim=2048).

Per-token FLOPs computation. The total E.2 LLM Service Pricing


FLOPs for processing each token consist of five
This section provides further details on the pricing
components:
of LLM services. Table 8 presents the pricing mod-
els for several commercial Large Language Mod-
FLOPstotal = FLOPsattn + FLOPsffn
els (LLMs) as of October 28, 2024. The pricing
+ FLOPsln + FLOPsemb + FLOPsout structure follows a dual-rate model, differentiating
(7) between input (prompt) and output (generation) to-
kens. These rates represent the public pricing tiers
For a sequence of length L, the attention compu- available to general users, excluding any enterprise-
tation differs between stages. In prefill: specific arrangements or volume-based discounts.
 L2 dmodel F Pseudocode for Cost-Aware Adaptive
FLOPsattn = nlayers · 3d2model +
n
 heads
Request Scheduling
2
+ Ldmodel + dmodel (8) The request scheduling algorithm consists of three
key components. Algorithm 1 defines the input
While in decode, KV caching eliminates the parameters and determines whether the scenario is
quadratic term: device-constrained or server-constrained based on
the relative costs. For device-constrained scenarios,
 Ldmodel Algorithm 2 implements a wait-time strategy to
FLOPsattn = nlayers · 3d2model +
n protect tail latency while conserving device energy
 heads
2
+ Ldmodel + dmodel (9) when possible. For server-constrained scenarios,
Algorithm 3 employs a length-based routing ap-
proach to optimize TTFT while maintaining the
Table 6 presents the total FLOPs across differ-
server budget constraint. These algorithms work to-
ent sequence lengths. The decode phase maintains
gether to achieve the dual objectives of minimizing
constant FLOPs regardless of sequence length due
latency and managing costs.
to KV caching, while prefill phase FLOPs increase
with sequence length. A breakdown of computa-
tional cost by component (Table 7) reveals that em-
bedding and output projection operations account
for the majority of FLOPs, particularly in models
with large vocabularies.
Algorithm 1 Variable Definitions and Constraints
Require:
1: p(l): Length distribution
2: F (t): TTFT CDF of server
3: b ∈ [0, 1]: Budget ratio
p
4: cd , cdd : Device prefill/decode costs
p
5: cs , cds : Server prefill/decode costs
6: α ∈ (0, 1): Tail ratio
Ensure: Policy type based on cost constraints
p p
7: if min(cd , cdd ) > max(cs , cds ) then Device-
constrained
8: else Server-constrained

Algorithm 3 Server-constrained Scheduling


Algorithm 2 Device-constrained Scheduling Require: Variables from Algorithm 1
Require: Variables from Algorithm 1 1: // Find length threshold to split execution
1: // Phase 1: Set maximum wait time for tail modes Rl
protection 2: Compute lth where: 0 th l · p(l)dl = (1 − b) ·
R∞
2: wtail ← F −1 (1 − min(α, b)) 0 l · p(l)dl
3: // Initialize wait times for all prompt lengths 3: // Initialize execution policy map
4: W ← {l : wtail for all l} 4: P ← ∅
5: if b ≤ α then 5: for l ∈ support(p(l)) do
6: return W {Use max wait time for all 6: if l < lth then
lengths} 7: P [l] ← (1, 0) {(Id , Is ): Device only}
7: end if 8: else
8: // Phase 2: Optimize wait times with remaining 9: P [l] ← (1, 1) {(Id , Is ): Concurrent exe-
budget cution}
9: available_budget ← b − α 10: end if
10: for l ∈ sort(support(p(l))) do 11: end for
11: length_cost ← p(l) · l · (1 − α) 12: return P {Map from lengths to execution
12: if available_budget ≥ length_cost then indicators}
13: W [l] ← 0 {Start device immediately}
14: available_budget ← available_budget -
length_cost
15: else
16: // Find optimal wait time that meets bud-
get
17: Find w∗ ∈ [0, wtail ] where:
18: F (w∗ )· length_cost + (b - avail-
able_budget) = b
19: W [l] ← w∗
20: break
21: end if
22: end for
23: return W {Map from prompt lengths to wait
times}

You might also like