0% found this document useful (0 votes)
82 views14 pages

MoE-Infinity - Offloading-Efficient MoE Model Serving

Uploaded by

liux67353
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views14 pages

MoE-Infinity - Offloading-Efficient MoE Model Serving

Uploaded by

liux67353
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

MoE-Infinity: Offloading-Efficient MoE Model Serving

Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina
University of Edinburgh
United Kingdom

Abstract significantly degrades model quality [3, 12], motivating users


This paper presents MoE-Infinity, an offloading-efficient to adopt offloading as a quality-preserving, complementary
serving system for sparse mixture-of-experts (MoE) models. memory saving technique. (ii) MoEs typically selectively acti-
To optimize offloading, MoE-Infinity achieves novel request- vate experts when processing tokens, meaning less host-GPU
level tracing for expert activation, capturing MoE’s sparse bandwidth is needed when transferring experts. (iii) Recent
execution patterns such as selective activation, group ac- hardware development also indicates increasing values of
arXiv:2401.14361v2 [cs.LG] 1 Aug 2024

tivation, and skewed reuse. Leveraging the request-level offloading, exemplified by rapidly growing PCIe bandwidth
trace, MoE-Infinity performs effective expert prefetching and the recent fast chip-to-chip links (900GB/s) connecting
and expert caching, achieving high efficiency in transferring on-chip DRAM and HBM in the Grace-Hopper Superchip.
model parameters from host memory to GPU memory. Ex- Recently, LLM serving systems have begun integrating of-
perimental results demonstrate that MoE-Infinity achieves floading capabilities. Systems like vLLM [22] and TensorRT-
low latency comparable to expensive full-GPU deployments, LLM [23] support offloading for kv-cache, but since they
which require up to 4X more GPU resources than MoE- target GPU affluent environment, they cannot offload MoE
Infinity. Compared to offloading-supporting LLM serving model parameters. Other systems like Mixtral Offloading [8],
systems such as DeepSpeed-Inference, Llama.cpp, Mixtral Llama.cpp [24] and DeepSpeed-Inference [1] consider users
Offloading, and BrainStorm, MoE-Infinity exhibits superior with limited GPUs and support parameter offloading. Yet,
latency performance, providing 2-20X improvements when these systems often experience significant latency overheads
serving various MoE models for a large collection of LLM when offloading is enabled. This latency issue primarily
tasks. MoE-Infinity ’s source code is publicly available at arises for several reasons: (i) They incur excessive traffic
https://github.com/TorchMoE/MoE-Infinity in transferring experts from host memory to GPU memory,
(ii) Their GPU cache exhibits poor performance, causing
repetitive fetching of experts, and (iii) Even when prefetch-
1 Introduction
ing is enabled, the timing of launching the prefetching is
Mixture-of-Experts (MoE) models have become leading meth- late, significantly blocking GPUs.
ods in delivering state-of-the-art (SOTA) AI services, with re- In this paper, we explore how to optimize offloading ef-
cent examples including Mixtral [18], Arctic [30], Qwen [32], ficiency when serving MoE models. Our exploration is mo-
Switch Transformer [9], Meta’s NLLB [2, 4], DBRX [33], and tivated by the following observations: (i) In many serving
Grok-1 [35]. These models process user prompts (collections scenarios, MoE models exhibit low activation ratios. For in-
of tokens) through MoE layers. Here, routers distribute the stance, the SOTA Snowflake’s Artic model activates 1.5 -
tokens to various smaller neural networks termed ’experts’. 14.5% experts when processing requests with batch sizes
MoEs, often utilized as Large Language Models (LLMs), pro- from 1 to 32. Even with less sparse Mixtral-8x7B, the activa-
cess input tokens during the prefill phase and generate to- tion ratio is still below 25% when processing a request. This
kens through multiple iterations during the decoding phase. low activation ratio indicates the potential for a substan-
A significant challenge for serving MoEs is their substan- tial reduction of traffic between the host and GPU. (ii) MoE
tial GPU resource requirements. MoEs possess up to trillions models often activate a specialized group of experts in an
of parameters; for instance, the Switch Transformer contains iteration when processing a prompt, primarily due to how
up to 1.5 trillion parameters distributed across 61,440 experts. these experts are pre-trained to exhibit complementary capa-
Other MoE models such as Mixtral, DBRX, Grok-1, and Arc- bilities. With traces that reflect this group activation pattern,
tic have hundreds of billions of parameters across various we could predict which experts are soon to activate and per-
expert layers, requiring up to 1 TB GPU memory. Storing all form timely prefetching, and finally (iii) We observe that a
these experts necessitates extensive GPU resources, which skewed set of experts is heavily reused when processing a
cannot be afforded by users with limited GPU resources [21]. prompt over multiple iterations. This skewed reuse pattern
A promising strategy to reduce the GPU cost involves pa- means that we could have a small cache sufficient to keep
rameter offloading, meaning keeping experts in host memory reused experts.
and fetching them into GPUs during inference. Offloading To leverage the above observations, we need to address
has gained traction for several reasons: (i) While existing several open concerns: (i) New methods are needed for trac-
MoE quantization techniques can reduce data precision to ing MoE inference. Existing methods, such as those adopted
BF16 or Int8 [10, 11, 20, 36], dropping below these precisions
1
by BrainStorm [5] and DeepUM [19], are designed for tracing
dynamic neural networks. With MoEs, they use model-level
counting, meaning that all experts are counted for their us-
age for the entire lifetime of the time. After serving many
requests, their model-level expert counts become uniform
(detailed in Section 3.1) and cannot effectively guide prefetch-
ing and caching. Our new insight is: in MoE models, the
skewed reuse and group activation patterns only emerge at Figure 1. MoE model example.
the individual request level, not the model level. (ii) New
methods are needed for prefetching and caching experts.
We have thoroughly evaluated MoE-Infinity against base-
These methods not only need to leverage the new request-
line approaches. Our evaluation covers various MoE mod-
level trace that records the sparse activation patterns of MoE
els (including Snowflake Arctic [30], Mixtral [18], Google
models, but also need to consolidate prefetching and caching
Switch Transformer [9] and Meta NLLB-MoE [4]) and over
decisions during the complex decoding phase.
290 LLM tasks contributed by three datasets (FLAN [34],
We design MoE-Infinity, a new offloading-efficient MoE
BIGBench [31], and MMLU [14]). Benefiting from the sig-
model serving system. Our design leads to several contribu-
nificantly improved prefetching and caching performance,
tions:
MoE-Infinity can achieve 2-20X improvement in latency
(1) Request-level tracing of expert activation. MoE-Infinity
performance compared to an array of offloading-supported
features a new request-level tracing method for MoE exe-
LLM serving systems, including DeepSpeed-Inference [1],
cution, supported by two novel data structures, the Expert
Llama.cpp [24], Mixtral Offloading [8], and BrainStorm [5].
Activation Matrix (EAM) and EAM Collection (EAMC). The
Compared to the expensive full-GPU serving solutions used
design intuition here is that for MoE models, it is more effec-
by users with affluent GPU resources, MoE-Infinity can
tive to trace the distribution of the experts activated when
achieve almost 64-95% of the latency and throughput perfor-
processing a request. To identify the group activation pat-
mance while saving GPU costs by up to 8X, making MoE-
tern, we can maintain an EAM within each iteration when
Infinity a competitive high-performance choice for users
processing a request (named iteration EAM). To show the
with limited GPU resources.
skewed reuse pattern, we can accumulate iteration EAMs
to form a request’s EAM. The other new intuition here is
that the request’s EAMs can be effectively stored in EAMC
2 Background and Motivation
online. This EAMC only needs a limited number of slots 2.1 Serving MoE Models
to keep different requests’ EAMs, sufficient to recover the We describe the MoE model architecture with an example
sparse patterns associated with different MoE models. shown in Figure 1. Here, we omit self-attention modules and
(2) Activation-aware prefetching and caching. We im- focus on the MoE layers. Each MoE layer consists of a router
plement activation-aware prefetching and caching in MoE- and a group of experts, which are Feed Forward Networks
Infinity to enhance offloading efficiency. Our prefetching (FFNs). The router assigns each token to specific experts.
strategy is informed by two design intuitions: (i) Comparing MoE models can vary in configuration. For example, Mixtral-
the current iteration’s EAM with prior EAMs in the EAMC 8x7B has 8 experts per layer, with each expert managing
enables accurate prediction of expert activation, facilitating 0.15B parameters (340MB), and the router selects 2 experts
timely prefetching. (ii) Prefetching experts across multiple per token. Switch-128x0.2B, while similar in total model
layers concurrently, while prioritizing those closest to the size, features 128 experts per layer, each expert handling
layer currently executed. These two insights significantly 33MB, with the router selecting 1 expert per token. The MoE
improve the timeliness and bandwidth cost of prefetching model example processes three input prompts (referred to as
operations. Additionally, these prefetching operations can requests). To process these prompts, the model first enters a
be enhanced by a new expert caching strategy which (i) uses prefill phase. Then, it moves into a decoding phase, which
accumulated iterations’ EAMs to form the current request’s iteratively generates outputs. This process is illustrated in
EAM, effectively guiding the choice of experts cached in Figure 1, where the three input prompts trigger 2, 1, and 1
GPUs over multiple iterations, and (ii) recognizes that ex- iterations, respectively.
perts in initial MoE layers typically benefit less from prefetch- When serving an MoE model, practitioners often need
ing compared to those in later layers. This realization allows to balance latency, throughput, memory cost, and model
us to prioritize initial layer experts in our caching strategy, quality. Latency can be measured by time-per-output-token
maximizing the synergistic benefits of both caching and (TPOT), that is, the time to complete a forward iteration
prefetching. during the decoding phase [22, 37]. Throughput, indicating
the generation speed, is measured by token-per-second. The
memory cost is the minimum GPU memory required for
2
Table 1. Expert activation ratio (%) over batch size (bs).

Model bs=1 2 4 8 16 32
Switch-128x0.2B 1.2 1.8 2.7 3.9 5.6 7.8
NLLB-128x0.4B 3.6 5.3 7.6 10.6 14.7 19.9
Arctic-128x4B 1.5 2.7 4.5 6.9 10.2 14.5
Mixtral-8x7B 25.0 42.9 66.0 86.4 96.7 99.3

model inference. Model quality is often defined by different (a) NLLB-128x0.4B (b) Switch-128x0.2B
serving scenarios, e.g., language understanding [14], logical
understanding [31]. Figure 2. Expert Activation and Sequence Counts in Pre-
filling. The left y-axis shows the expert activation ratio by
2.2 Offloading in Current LLM Serving prompt length (x-axis), while the right y-axis counts prompts
MoE models often incur substantial memory costs. For in- by length. The horizontal red line marks the average prompt
stance, Switch Transformers [9] requires up to 6TB of mem- length activation ratio, using BIGBench, FLAN, and MMLU.
ory to store all parameters, necessitating at least 75 A100-
80GB GPUs; while Grok-1 requires 620 GB of memory, that
is, at least 8 A100-80GB GPUs. (2) Slow on-demand prefetching. Recently, Mixtral Of-
To reduce memory costs, the designers of LLM serving sys- floading identified this issue and proposed prefetching only
tems have begun to incorporate offloading capabilities. On the activated experts after the router determines which ex-
the one hand, vLLM and TensorRT-LLM support offloading perts are activated. Although this reduces traffic, Mixtral
the key-value cache to host memory via the PagedAttention Offloading launches the prefetching operation late, still block-
mechanism [22]. However, these systems do not support the ing GPUs significantly.
offloading of experts in MoE models. FlexGen [29] supports (3) Poor cache performance. We could also cache experts
parameter offloading but it is specific to dense transformer in GPUs to reduce expert transfers. Existing caching strate-
models and cannot support MoE models. On the other hand, gies are primarily LRU or based on model-level counts, used
Llama.cpp [24], Mixtral Offloading [8], HuggingFace Accel- in SOTA dynamic neural network systems including Brain-
erate [17] and DeepSpeed Inference [1], have supported the Storm and DeepUM. As we later show in Section 5, both of
offloading of MoE models, making them popular among users these caching strategies perform poorly with MoE models.
with limited GPU resources.
Enabling offloading, however, often means substantial la-
3 Exploiting Expert Activation Awareness
tency overhead these days. This overhead is primarily at-
tributed to waiting for the experts to be fetched from host 3.1 Key Observations
memory into the GPUs. Often, we observe this extra latency We have the following key observations that motivate our
ranges from 4-20X depending on the size of the models and design for improving offloading:
configurations of the GPU servers. Consider the case of us- (1) Low activation ratios imply low fetching traffic. In
ing an offloading-enabled NVIDIA A5000 GPU to serve a serving scenarios, LLM serving systems often exhibit low ex-
Mixtral-8x7B model for tasks such as MMLU, BigBench, and pert activation ratios with MoE models. Our findings (Table 1)
FLAN. The TPOT exceeds 6 seconds when using PCIe 4.0 reveal that models with 128 experts per layer activate fewer
to connect the GPU, which is over 30 times slower than the than 5% of experts for small batches and rarely more than
175ms achieved by using 8 A5000 GPUs to serve the model 20% for 32-request batches. In contrast, configurations like
without offloading. Mixtral, with eight experts per layer and top-2 activation,
show 25% activation for single requests and less than 55%
2.3 Issues with Existing Offloading Approaches for batches of four.
To mitigate the latency overheads, existing offloading ap- Selective activation during the prefilling phase is evident
proaches exhibit different issues: in BIGBench and FLAN datasets, with 60% of prompts under
(1) Excessive fetching traffic. Most offloading-supported 200 tokens, leading to low activation ratios in models like
LLM serving systems (e.g., DeepSpeed-Inference, Hugging- Switch and NLLB (Figure 2). NLLB averages 26% expert acti-
Face Accelerate, and Llama.cpp) are originally designed for vation, increasing to 47% for 756-token prompt (Figure 2a).
dense transformers (Llama2 and Llama3). Even MoE models Switch shows a lower average of 21% (Figure 2b). Short
selectively activate experts for different tokens, these sys- prompts are more common, impacting activation trends. Un-
tems cannot predict this selective activation and they instead like others, Mixtral, with eight experts per layer, often acti-
prefetch/fetch all parameters of an MoE layer, causing exces- vates all experts during prefilling, underscoring its higher
sive traffic bottlenecked by PCIe connections [16, 26, 27]. cost in deployment.
3
MoE Layer 0 MoE Layer 1 MoE Layer 2
20 Sequence 1 20 Sequence 2 1000 Sequences Merged
Most Frequent Experts Most Frequent Experts Most Frequent Experts 6000
15 15

Reuse Count
Activation
4000
Count
10 10
5 5 2000
0 0 1 2 3 4 5 6 7 0 0 1 2 3 4 5 6 7 0 0 1 2 3 4 5 6 7
Expert Index Expert Index Expert Index
20 Sequence 1 20 Sequence 2 1000 Sequences Merged
15 15 6000

Reuse Count
10 10 4000
5 5 2000
025 50 75 100 0 50 100 0 0 50 100
Figure 3. Example of group expert activation. The edge Expert Index Expert Index Expert Index
denotes the conditional activation count between a pair of
experts. The higher the count, the wider the edge. We show Figure 4. Expert reuse statistics over 20 decoding iterations
the top hitter experts in each layer. for two sample sequences and merged over 1000 sequences.
Sampled from last layer of Mixtral-8x7B (top) and NLLB-
128x1B (bottom).
The above data indicate low expert activation ratios dur-
ing decoding. A strategy of fetching only active experts for
inference—unlike current practices of fetching all potentially distinct reuse patterns; for instance in sequence 1, in Mix-
active experts—could cut data fetching traffic significantly. tral’s last layer, Expert 2 was reused over 15 times, and Expert
This approach might reduce traffic by up to 98.5% in models 4, 8 times—far more than others. In Sequence 2, Expert 0 had
like Arctic, NLLB, and Switch during decoding. Even for MoE the highest reuse count, followed by Expert 5, both consid-
models with higher activation ratios, traffic reduction could erably exceeding others. Similarly, in NLLB, fewer than 5%
vary from 75% to 34% for batch sizes of 1 to 4, exemplified of experts are activated, with a select few, like Expert 126,
by serving Mixtral-8x7B. being heavily reused.
(2) Group activation can facilitate prefetching. In serv- The skewed reuse of experts becomes more uniform across
ing MoE models, we observe that a group of experts across multiple requests. After analyzing over 1000 sequences with
multiple layers is often activated together. This observation Mixtral, expert reuse counts even out. NLLB shows a similar
is essentially attributed to the training method of MoE mod- trend, albeit less pronounced than in single-request scenar-
els [18], where a specialized group of experts is trained to ios. This uniformity suggests potential for improved cache
process tokens, ensuring that fewer parameters are activated performance by leveraging skewed reuse patterns in each
compared to their dense counterparts. sequence. By prioritizing heavily reused experts, we can
Figure 3 demonstrates group activation patterns in Switch. optimize cache footprint and replacement strategies to en-
The pattern is also observed in NLLB, Arctic, and Mixtral. For hance cache hit ratios, utilizing idle High Bandwidth Memory
instance, activating Expert 81 in Layer 0 (E[0,81]) is linked to (HBM) on GPUs post-inference runtime initialization.
higher activations of E[1,116], E[1,72], and E[1,81] in Layer
1. This pattern, indicating a cross-layer activation group 3.2 Open Concerns
like {E[0,81], E[1,115], E[2,21]}, suggests interconnected ex- Based on the design intuitions above, we aim to enable expert
pert activations across layers. Different groups, e.g., {E[0,81], activation awareness in MoE serving systems, focusing on
E[1,72], E[2,72]}, co-exist and show varied token traffic. For tracing the expert’s selective activation, group activation,
instance, {E[0,81], E[1,72], E[2,62]} experiences higher token and skewed reuse during inference. Leveraging this trace,
traffic compared to {E[0,81], E[1,81], E[2,72]}. this serving system can effectively prefetch and cache experts
Leveraging group activation patterns in MoE models can in GPUs, maximizing efficiency in supporting offloading.
significantly enhance prefetching in MoE offloading systems. To realize such a system design, two open concerns must
With experts holding up to hundreds of megabytes and GPU be addressed:
inference times mirroring PCIe 4.0/5.0 data fetch times (both (1) New methods for tracing expert activation. We need
approximately 1 millisecond), overlapping prefetching and methods to trace the expert activation. These methods must
inference become feasible. By tracking active experts and trace various information, including those related to (i) the
employing historical group activation, an MoE system can selective activation of experts, (ii) the group activation of
predict and prefetch the experts likely needed for subsequent experts on multiple layers, and (iii) the skewed reuse of
layers, this system can make accurate and timely prefetching. experts at the request level.
(3) Skewed expert reuses can enhance caching. In au- Current tracing approaches, used in BrainStorm and DeepUM,
toregressive LLM tasks, MoE models exhibit skewed expert cannot effectively provide the above information. They adopt
reuse across iterations, as our analysis of real-world data in a conceptually model-level expert tracing approach where
Figure 4 shows. Sampling two LLM requests, we observed the expert’s usages are accumulated across all requests since
4
serving the model. They thus suffer from the uniform distri-
bution of expert activation emerging at the model level, thus
falling short in effectively guiding prefetching and caching
decisions based on their historical traces.
(2) New methods for expert prefetching and caching.
New methods for prefetching and caching experts are needed.
When operating an MoE model, the activation information,
including expert selection, group activation, and reuse, is con- Figure 5. EAMC replacement example.
tinuously updated. This necessitates the inference runtime
to dynamically adjust its prefetching and caching priorities. Í
For instance, an expert initially predicted for activation and 𝑗 𝑀 [𝑖] [ 𝑗] = 𝑟 ∀𝑖. The request-level EAM are traced for
thus prefetched might later be deprioritized based on new prefilling and decoding separately. In prefilling, 𝑟 is the num-
activation data. Additionally, the runtime should consider ber of tokens in a prompt, while in decoding, 𝑟 is the number
the MoE model’s architectural characteristics. For example, of output tokens.
activation predictions may vary in accuracy; layers closer We illustrate an EAM by revisiting the example MoE in
to the active layer typically offer more precise predictions Figure 1. We now apply the EAM for tracking each request’s
than those farther away. This factor should be considered activation pattern. The EAM for the P1, P2 and P3 is shown
when setting prefetching priorities. Prefetching and caching in Figure 5. As we can see, these EAMs can accurately reflect
are interdependent; prefetching may not benefit the initial the skewed reuse patterns within the request level.
layers as much, so caching should prioritize likely activated
experts in these layers. These considerations are also crucial 4.2 Expert Activation Matrix Collection
in the cache replacement strategy. We have designed a novel data structure termed the Expert
Activation Matrix Collection (EAMC), which acts as a trace
for keeping historical request-level EAMs online. As the
4 Request-Level Expert Activation Tracing
system has processed an incoming request, it compares the
4.1 Expert Activation Matrix request-level EAM with those recently stored in the EAMC.
We’ve introduced the Expert Activation Matrix (EAM), a A matching prior EAM can then facilitate more effective
novel data structure tailored for tracing expert activations prefetching and caching decisions by the serving system. To
in MoE models. The EAM aims to: (i) efficiently trace expert determine if two EAMs match, we use the following method:
activations for analyzing activation ratios, group activations, each EAM is flattened into a vector, and the cosine distance
and skewed reuses, and (ii) maintain low space complexity between these vectors is calculated. Within the EAMC, the
for scalability in large MoE models. most closely matching prior EAM to a current EAM is the
Our innovative approach with the EAM focuses on tracing one with the smallest cosine distance.
the expert activations per inference request (i.e., request- We design EAMC to have fixed capacity, thereby limiting
level tracing), rather than tracing activations across requests both memory costs and the time required to find a matching
(i.e., model-level tracing) as seen in systems like DeepUM EAM. When the EAMC reaches its capacity, it necessitates
and BrainStorm. This method enables us to capture group the replacement of an entry within the collection. Our re-
activation patterns at the iteration level and accumulate data placement strategy is guided by two main objectives: first,
at the request level, crucial for identifying skewed reuse to record the most recent EAM, thereby quickly adapting to
and selective activation patterns that are not apparent with changes in workload; second, to maintain diversity within
model-level tracing. the recorded EAMs. Consequently, we opt to replace an EAM
The design of EAM realizes the above intuition. There that is most similar to the incoming one. To implement this,
are two types of EAM in our system, namely iteration-level we compare the new EAM against all existing ones in the
EAM and request-level EAM: EAMC, replacing the one that shows the shortest cosine
(1) Iteration-level EAM. An iteration-level EAM is defined distance. We illustrate the EAMC replacement process in
as: for a model with 𝐿 MoE layers and 𝐸 experts per layer, Figure 5. Here, consider that the EAMC capacity is 3. Upon
given 𝑛 tokens per iteration, an EAM 𝑀 is an 𝐿 × 𝐸 matrix a new prompt P4 finished its trace, we compute the cosine
where 𝑀 [𝑖] [ 𝑗] is the number of tokens routed to expert 𝑒𝑖,𝑗 , distance between the EAM4 with all EAMs in the EAMC.
i.e., the expert with index 𝑗 at layer 𝑖. A row in EAM, i.e., The distances show that EAM4 is similar to EAM3, and we
𝑀 [𝑖], represents the expert activation
Í of layer 𝑖 and we have thus evict EAM3 and accommodate the EAM4.
𝑀 [𝑖] [ 𝑗] ∈ {0, . . . , 𝑛} ∀𝑖, 𝑗 and 𝑗 𝑀 [𝑖] [ 𝑗] = 𝑛 ∀𝑖. The capacity of an EAMC must be appropriately small,
(2) Request-level EAM. A request-level EAM accumulates as a large capacity can impede its practicality. MoE models
the counts of per iteration EAM. Formally, given the to- benefit from a modest EAMC capacity for two main reasons:
tal number of tokens across all iterations as 𝑟 , we have (i) After pre-training, MoE routers are optimized to create
5
specialized expert groups for token processing, limiting the
number of groups to ensure efficient token dispatching and
high accuracy, a characteristic underscored by leading re-
search [18, 32]. (ii) Our evaluations in Section 7.5 indicate
that a modest EAMC capacity, from hundreds to thousands,
suffices for various LLM tasks and adapts well to task shifts,
with the added advantage of negligible matching costs com-
pared to model decoding latency.
Potentially, we could enhance our EAMC design by em-
ploying clustering algorithms to identify a representative
subset of EAMs for inclusion in the EAMC. The high compu-
tational complexity of the clustering algorithm makes this Figure 6. Prefetching examples.
enhancement difficult to deploy. Consider the case of serv-
ing Arctic-128x4B for the FLAN dataset (which includes 66
LLM tasks). The clustering algorithm needs to handle over 1
million EAMs, each forming a 4480-dimensional vector (the
flattened EAM). To our knowledge, no existing clustering sooner usage and the greater impact of their absence on GPU
libraries (e.g., FAISS [6]) can efficiently handle this work- utilization. (iii) Experts that fail to be prefetched are assigned
load. Hence, we adhered to the above simple but effective the highest prefetching priority to jump over prefetched ex-
design for EAMC and left its enhancement with clustering perts in the PCIe queue, further minimizing GPU blocking
algorithms for future work. time.
Prefetching examples. We demonstrate how integrating
5 Activation-Aware Prefetching & Caching our insights can enhance prefetching effectiveness over ex-
isting strategies, as depicted in Figure 6. Consider a sce-
In this section, we describe how MoE-Infinity leverages the
nario where an MoE model completes the first layer and
expert activation trace (stored in EAMC) to achieve effective proceeds to the second. Once a token is dispatched to E[2,1],
prefetching and caching in a multi-GPU server. the dependency-based strategy (implemented in DeepSpeed-
Inference) begins prefetching E[3,1] and E[3,2] in the imme-
5.1 Expert Prefetching Strategy diate next layer (shown in (a)). However, when the output
We aim to design an expert prefetching strategy that meets of E[2,1] is later routed to E[3,2], the parameters for E[3,2]
two key objectives: (i) The prefetching operation for an ex- are not ready in the GPU’s buffer, causing the GPU to be
pert should commence timely so that by the time the GPU blocked. This blocking issue can be seen for E[4,2] again.
requires this expert, it is already fully available in the GPU The model-level tracing strategy (shown in (b)), used in
prefetching buffer, thereby minimizing the duration the GPU BrainStorm and DeepUM, could leverage model-level counts
is blocked. (ii) The prefetching operations must also min- to tell E[3,2] is used more frequently than E[3,1], and it thus
imize the consumption of valuable PCIe bandwidth. This prioritizes E[3,2] and let it be ready sooner. Once it knows
means that prefetching should be limited to experts that only E[3,2] is needed, the prefetching of the unused E[3,1]
will actually be used, avoiding unnecessary prefetching of can be canceled to save bandwidth. Despite this improve-
unused experts as much as possible. ment, the model-level counts often yield a uniform distribu-
Intuitions for enhancing prefetching. To optimize prefetch- tion across experts, failing to provide a clear differentiation.
ing bandwidth, our intuition is to use the iteration-level EAM. This leads to situations where both E[4,1] and E[4,2] are
By matching the iteration-level EAM with historical ones prefetched without appropriate prioritization, still causing
in the EAMC, we identify similar past EAMs that help pre- GPU blockage for E[4,2]. Note that the issue of the uniform
dict which experts might activate in subsequent layers. This counts is rising with the number of experts per layer, ulti-
allows us to prefetch only those experts, thereby saving mately resulting in poor prefetching effectiveness, detailed
bandwidth. in Section 7.2.
Further, to enhance the timeliness of prefetching, our in- The strategy (shown in (c)) that combines all our above
tuitions include: (i) We extend prefetching to multiple layers, design intuitions can correctly predict E[3,2] and E[4,2] are
termed multi-layer prefetching, enabling timely readiness of more likely to be activated than E[3,1] and E[4,1] through
experts that are likely to activate even several layers ahead. EAM-based activation prediction and multi-layer prefetch-
This approach is particularly beneficial for emerging MoE ing. It further knows that E[3,2] is sooner to be used than
models with different TopK activation. (ii) We prioritize ex- E[4,2] (through prioritizing layers based on proximity). Hence,
perts based on their proximity to the current layer, giving this strategy achieves the best prefetching performance in
higher prefetching priority to nearer experts due to their this example.
6
Figure 7. Example of computing prefetching priorities.

Prefetching process. We describe the process that imple-


ments our prefetching strategy. The prefetching process ini-
tializes the GPU with a fixed-size buffer where each slot Figure 8. Example of integrating caching with prefetching.
corresponds to an expert. After knowing the routing deci-
sions in each MoE layer, there is an I/O thread computes
the latest prefetching priorities for all experts in the subse-
quent layers. Based on the new priorities, it adjust its current this, we have enhanced the GPU buffer with extra caching
prefetching operations. Additionally, the I/O thread avoids capabilities.
contention on PCIe bandwidth. When multiple experts are For consolidating caching and prefetching operations, the
needed for prefetching, they flow through the PCIe connec- buffer now implements the following mechanisms: (i) It pro-
tion sequentially. vides slots to hold the experts being prefetched (or fetched
We define the prefetching priority computation and illus- on-demand) into the GPU, (ii) It assigns idle slots (those not
trate its computation process in. Figure 7 revisits the MoE currently used for fetching) to cache previously used experts,
model from Figure 6. After R2 finishes dispatching the token (iii) When the cache is full and an additional slot is needed
to E[2,1], we need to adjust the order of prefetching. For for prefetching, we examine all being cached experts and
this, MoE-Infinity utlizes iteration-level EAM that traces the replace the one that shows the lowest cache priority (i.e.,
numbers of tokens passing through different experts in the given by the request-level EAM). In this cache replacement
current iteration. This iteration-level EAM is matched with process, we need to protect being prefetched experts from
prior EAMs in the EAMC (shown in 1 ). Several matched eviction. The primary reason is prefetched experts need to
EAMs might be returned. In such a case, we aggregate them be used sooner by the GPU. They are more likely to be used
and compute activation probability for each expert possiblly compared to the cached ones which have been used in this
to activate (shown in 2 ). In this aggregation step, formally, iteration and are cached to benefit future reuses. For this,
the cell of each matched EAM is summed up and normalized being prefetched experts are given maximal cache priorities
on each row. To ensure future experts in proximity to the until they are used or determined to not use.
current layers can be prioritized, the layer proximity step Intuitions for combing caching with prefetching. When
(shown in 3 ) adjusts the value in each cell through the for- integrating caching with prefetching, we want to maximize
mula (1 − (𝑖 − 𝑙)/𝐿), where 𝑙 is the current layer ID and 𝑖 is buffer hit ratio (i.e., buffer is regarded as hit if the expert is
the future layer ID. This design fixed the issue in Figure 6 cached locally or it has been prefetched fully when activated).
where 𝐸 [3, 2] needs to be prefetched first than 𝐸 [4, 2] even For this, we have two novel intuitions. First, at the end of each
though it shows a lower activation probability purely based iteration, the iteration-level EAM is added to an accumulated
on EAM analysis. request-level EAM, which tracks the frequency of expert
The I/O thread can also handle the contention for the slots usage since the beginning of the current request. We use the
in the GPU buffer. When the buffer is full, it will compare count of this request-level EAM to reset an expert’s priority
the current top-K most important experts to prefetch (where in the buffer after it has been used by the GPU or determined
K is the total number of slots in the buffer). For those already not to be called in this iteration (i.e., a mis-prefetching case).
in the buffer, we skip their prefetching. For those not in the Our second intuition addresses the initial layers of MoE
buffer, we start their prefetching into the slots that exhibit models, which typically benefit less from prefetching due
lower priorities. to less confident prediction of the group activation pattern
at the start. By assigning higher caching priorities to ex-
5.2 Enhancing Expert Prefetching with Caching perts in these initial layers, we not only counteract poten-
We aim to enable effective caching for the experts that have tial prefetching failures but also exploit the layer-by-layer
been loaded into the GPUs. Thus, if they are reused in the execution property of MoE models: the subsequent layers
next iteration, they can be immediately accessed within the are executed later and they are more likely to benefit from
GPUs, reducing the need for repetitive fetching. To facilitate prefetching and thus less need caching.
7
Examples of integrating caching with prefetching. Fig- parameters. For initializing the kv-cache, we will reserve the
ure 8 demonstrates how integrating caching with prefetch- amount of GPU memory in corresponding to the maximal
ing enhances caching effectiveness. In the second iteration, output length we observed in the open LLM datasets.
the cache strategy influences the initial layers. Augment- Inference runtime integration. We have integrated the
ing dependency-based prefetching with an LRU cache, as above prefetching and caching mechanisms into PyTorch
in DeepSpeed-Inference, prefetching 𝐸 [2, 1] evicts 𝐸 [3, 1], and support numerous kernel optimization, such as FlashAt-
leading to a buffer miss when tokens route to 𝐸 [2, 2] (see (a) tention. Our current inference runtime supports checkpoints
left). For model-level tracing with LRU, as in DeepUM and in PyTorch formats and HuggingFace formats.
BrainStorm, uniform activation means 𝐸 [1, 1] could route Failure recovery. MoE-Infinity can checkpoint its EAMC
to 𝐸 [2, 1] or 𝐸 [2, 2], resulting in a buffer miss (see (b) left). together with the MoE checkpoints. Once recovered from
Our method, using a request-level EAM [[1, 2], [0, 3], [0, 3]], the failure, it reloads the EAMC to efficiently resume its
keeps 𝐸 [2, 2] from eviction, ensuring a buffer hit and better prefetching and caching performance.
latency (see (c) left).
At layer 3, prefetching’s role intensifies. Dependency- 6 Practical Concerns
based prefetching misses 𝐸 [3, 2], causing a buffer miss (see Trace memory cost. The total number of traces needed to
(a) right). Model-level tracing identifies the layer 3 expert cover the expert activation patterns is finite and polynomial
but risks future misses by evicting 𝐸 [2, 2] (see (b) right). Our complexity regarding the number of experts. We formulate
strategy accurately prefetches and retains 𝐸 [2, 2], preventing the EAMC construction as a sphere covering problem on
its eviction and optimizing cache prioritization (see (c) right). the distance, with each EAM as a vector in the space. The
Priority update mechanism. We introduce a dynamic pri- sphere covering problem uncovers that given a similarity
ority update mechanism for experts in the buffer, enhancing lower bound, the upper bound of EAMC memory size. The
our caching and prefetching strategy. Upon prefetching, an higher the lower bound, the better we can use EAMC to
expert’s priority is initially set to maximum. The expert prior- guide prefetching and caching. Theorems [7, 25] guarantees
ity will be reset to request-level EAM based on three scenar- a lower bound of 75% trace similarity by using 2𝐿𝐸 EAMs
ios: (i) when expert finishes its inference. (ii) unused experts and lower bound of 98% trace similarity by using 12 𝐿𝐸 ln(𝐿𝐸)
already prefetched after token dispatching. (iii) Prefetched EAMs. We observe that 𝐸 of SOTA MoE models ranges from
experts not among the top-K per GPU. 8 to 128 and 𝐿 ranges from 24 to 64 [9, 18], leading to at most
5.3 Implementation Details 40K EAMs with 160MB memory.
Trace query cost. Searching for the most similar EAM is
Support multiple GPUs. We implement expert parallelism essentially a matrix multiplication on CPU [6]. We measured
to support the use of multiple GPUs on a server. Concretely, the cost to be 21us per query under 1K EAMs and 226us for
we use a hashing function to assign the experts to different 10K EAMs. The frequency of the query is at most once per
GPUs based on their IDs. All experts are kept in the host MoE layer for each (batched) input. Both memory and com-
DRAM. While executing the MoE layers by layers, we use putation overhead are less than 1% of the model inference
this hashing function to know which GPU is going to ac- latency (typically >120ms per token).
commodate an expert needed for prefetching or execution.
When the GPU is spread across multiple NUMA nodes, we
7 Evaluation
will pre-partition the experts based on NUMA nodes, ensur-
ing that these experts are only assigned to the GPUs in the 7.1 Experiments Setup
designated NUMA node. Hardware. We show our experimental results on a commod-
For each GPU, we create an independent I/O thread to man- ity 8-GPU server which has eight NVIDIA RTX A5000 GPUs
age the prefetching and caching. This thread uses pinned and 1TB of DRAM host memory. These GPUs are connected
memory and DMA operations to optimize data transfers in pair-wise NVLink, and each GPU is connected to the host
between the GPU and host DRAM, and a single thread is memory through a dedicated PCIe 4.0 connection (32GB/s).
sufficient to saturate the bandwidth provided by PCIe 4.0 We also evaluated MoE-Infinity on an 8-A100 server and
(32GB/s). For higher PCIe versions, we support creating mul- an 8-H100 server. Due to the page limit, we only include
tiple such threads per GPU. the 8-A5000 server results since this is the typical hardware
For now, most open-source MoE models can be fitted into configuration owned by many of our system’s users who
the host memory (up to 1TB) of a commodity multi-GPU have limited GPU resources.
server. We leave the multi-server support for future work. Models. We include popular open-sourced MoE models in
Memory management. Given a MoE checkpoint, we keep our evaluations including Google Switch Transformers [9]
its dense parts within the GPUs and turn on the offloading in the size of 30-100 GB depending on the configuration,
for its experts. This design is sufficient since the proportion Meta NLLB-MoE [4] (220GB), Mixtral-8x7B [18] (120GB),
of the experts’ parameters comprising of 90-99% of the total and Snowflake-Arctic [30] (900GB). For these models, we
8
report their results with the following configurations if no Table 2. Normalized GPU blocking time.
further mentioned: Switch-128x0.2B (denoted as Switch),
Model Dependency Model-tracing On-demand Ours
NLLB-128x0.4B (denoted as NLLB), Arctic-128x4B (denoted
as Arctic), and Mixtral-8x7B (denoted as Mixtral). Switch 230% 51% 100% 12%
We initially considered including more MoE models in our NLLB 147% 71% 100% 31%
evaluation. Databricks-DBRX and XAI Grok-1 show model Arctic 195% 63% 100% 43%
accuracy that are lower than SnowFlake-Arctic, and they Mixtral 147% 91% 100% 37%
share similar model architectures. We thus include the re-
sult for Arctic only. At the time of submitting this paper, Table 3. Normalized bandwidth usage.
DeepSeek-MoE and QWen-MoE were just realized and none
Model Dependency Model-tracing On-demand Ours
of our baseline systems can support them or achieve compet-
Switch 239% 204% 100% 168%
itive performance as MoE-Infinity, and we omit their results.
NLLB 148% 127% 100% 126%
Datasets. We used a large variety of LLM tasks (290 tasks
Arctic 183% 186% 100% 187%
in total) contributed by three datasets to evaluate the per-
Mixtral 143% 101% 100% 143%
formance and robustness of MoE-Infinity. These datasets
include BIGBench [31] (166 tasks), FLAN [34] (66 tasks), and
MMLU [14] (58 tasks). More specifically, these LLM tasks extension makes it challenging to fully re-implement. Fur-
include reasoning, contextual question answering, free re- ther, it shows suboptimal model-level analysis compared to
sponse, translation and many more. the more recent BrainStorm. Considering all these, we also
To emulate a real-world LLM serving workload, we im- excluded DeepUM.
plement inference clients which send their requests with We also had experiments with HuggingFace Text Genera-
intervals following the distribution modelled after Azure tion Inference (TGI) but its results are significantly slower
Trace [28]. Without further mention, these clients uniformly than baseline systems above, and we omit its results.
select the prompts from all LLM tasks in three datasets by
default, emulating a generic ChatBot service. This workload 7.2 MoE-Infinity Prefetching Strategy
approach follows recent leading LLM serving studies [13, 22]. We first assess the prefetching strategy in MoE-Infinity. We
Baselines. We evaluate MoE-Infinity against many SOTA consider the following baseline strategies: (i) Dependency-
baseline systems: (i) DeepSpeed-Inference, configured for based prefetching strategy, used in DeepSpeed-Inference, which
optimized LLM inference (FastGen[15]). DeepSpeed-Inference constructs a computational graph for an MoE model. Since
is the only mainstream LLM serving library that not only of- all experts in the next layer exhibit computational depen-
fers leading performance (comparable to vLLM and TensorRT- dency on the currently executed expert, all next-layer ex-
LLM) but also supports efficient offloading when GPU re- perts are potentially activated and thus prefetched; (ii) Model-
sources are limited. (ii) Llama.cpp [24], a high-performance tracing-based prefetching strategy, used in BrainStorm, which
inference engine optimized for environments with restricted counts operator usages across the entire lifetime of the serv-
GPU availability. By default, Llama.cpp stores all model pa- ing system and prefetches most used operators in the next
rameters in CPUs and offloads computations to GPUs us- MoE layer; and (iii) On-demand prefetching strategy, used in
ing highly optimized memory copy and caching kernels. Mixtral-Offloading and Llama.cpp, which initiates prefetch-
(iii) Mixtral-Offloading [8], specialized in offloading-efficient ing experts only after the router in the current layer has
MoE model inference, implements optimized parameter prefetch- dispatched its tokens, achieving optimal bandwidth usage
ing and caching strategies tailored for these models. (iv) Brain- but delaying prefetching onset.
Storm [5], a leading dynamic neural network inference en- To assess prefetching performance independently, we dis-
gine that implements model-level tracing to optimize op- able the caching of the experts in all systems (i.e., experts
erator scheduling, caching and prefetching. BrainStorm is are released from the buffer immediately after being used).
not open-sourced and its design does not natively support Here we measure two important metrics that reflect the
MoE-based LLM tasks. We thus extended and implemented prefetching performance: GPU blocking time and prefetch-
BrainStorm’s model-level analysis in MoE-Infinity. ing bandwidth usage:
We initially considered including vLLM [22] and TensorRT- GPU blocking time. An effective prefetching strategy should
LLM [23]. However, these systems target GPU-sufficient en- begin prefetching early to ensure experts are available when
vironments and they do not support parameter offloading, needed, minimizing GPU blocking time. The evaluation of
and we thus exclude them. DeepUM [19] could be another prefetching strategies is presented in Table 2. On-demand
baseline since it achieves model-level analysis at the mem- fetching latencies for experts are 1ms, 2.5ms, 7ms, and 10ms
ory page level to facilitate prefetching. It is however not for Switch to Mixtral, respectively, mirroring single expert
open-sourced. Also, its design as an operating system kernel inference latencies. Dependency-based prefetching performs
worse than on-demand due to I/O contention from excessive
9
Table 4. Performance breakdown of prefetching designs. 7.3 Integrated Prefetching and Caching Strategy
Designs Bandwidth usage Blocking time Next, we enhance the prefetching strategies by incorporating
Vanilla prefetching 100% 100% caching. In addition to MoE-Infinity, we consider additional
+ Request-level tracing 56% 55% baseline strategies: (i) Model-tracing-based caching, used in
+ Multi-layer 73% 39% BrainStorm, prioritizes caching based on model-level ex-
+ Layer-proximity 68% 32% pert counts, paired with model-tracing-based prefetching. (ii)
Least-Recent-Used (LRU), adopted by DeepSpeed-Inference,
Mixtral-Offloading, and Llama.cpp, follows standard caching
without tracing data, combined with on-demand or dependency-
Table 5. Performance of integrated caching and prefetching. based prefetching. (iii) Ideal caching strategy, a theoretical
model assuming perfect GPU buffer access prediction, serv-
Model #Buffer Slots LRU Model-tracing Ours Ideal
ing as a benchmark for optimal caching efficiency.
Switch 953 31% 39% 52% 63%
Buffer hit rate. We evaluate the integrated prefetching and
NLLB 52 18% 25% 46% 51%
caching strategies using a key metric: buffer hit rate. This
Arctic 76 9% 12% 28% 32%
metric can reflect how well the caching and prefetching
Mixtral 39 2% 5% 17% 21%
strategies collaborate to allow a GPU to directly access ex-
perts in the buffer. Results are reported in Table 5. Mixtral
and Arctic are harder cases for both prefetching and caching,
the numbers of slots are smaller, while reuse patterns are
traffic. MoE-Infinity achieves a latency reduction of 20x on more skewed than NLLB. When slots are sufficient in Switch,
Switch and 4x on other models through sequential and selec- caching contributes significantly to the buffer hit ratio, com-
tive prefetching. Model-tracing-based prefetching is effective prising 37%. For Mixtral, prefetching and caching contribute
for skewed expert access patterns but less so for uniform to the buffer hit cases with 8% and 9% respectively. MoE-
patterns in Mixtral. MoE-Infinity, utilizing the EAM for pre- Infinity achieves 2.5X improvement with Mixtral, higher
diction, achieves a 1.5-2.5x latency reduction by identifying than the 1.2x in Switch. This is because Mixtral exhibits
per-request patterns. more multiple iterations per request, allowing MoE-Infinity
Bandwidth usage. A good prefetching strategy shall also to showcase its caching performance. MoE-Infinity is off
minimize its bandwidth usage, minimizing the number of from ideal by 4-9%. This is because a representative request-
prefetched experts actually not used by the GPUs. We report level EAM needs a few iterations to build up. For the first
bandwidth usage in Table 3. On-demand fetching represents few iterations, MoE-Infinity relies on prefetching to benefit
the minimal bandwidth usage as it is exact. Dependency and expert transfers, whereas caching needs time to warn up its
model-tracing strategy can use up to 2x the bandwidth in performance. We leave how to enable better caching for the
Switch and Arctic, as they provide incorrect prefetching, and initial iterations to future work.
later falls back to on-demand. Both the model-tracing strat-
egy and our strategy can use cancels to reduce traffic, promi-
nent in the case of Mixtral. MoE-Infinity uses slightly more 7.4 Entire MoE-Infinity in Action
traffic, however, 56% of the excessive ones can be overlapped We now assess the performance and benefits when putting
with inference, resulting in less overhead than baselines entries MoE-Infinity in action for serving MoE models. In ad-
which fall back to on-demand fetching. Among all traffic 68% dition to the baseline systems mentioned above, we consider
for Arctic and 81% for NLLB are hit in GPU. one more baseline, named GPU-Rich, which represents the
Performance breakdown. We also want to understand how users with affluent GPU resources and can deploy the entire
the multiple new design intuitions (detailed in Section 5.1) MoE models to a larger number of GPUs. This GPU-Rich
independently contribute to the improved GPU blocking time baseline can show the best latency and throughput perfor-
and bandwidth usage. For this, we provide a performance mance and inform us the performance cost of MoE-Infinity.
breakdown shown in Table 4. Vanilla provides fixed size End-to-end performance. We report the end-to-end per-
dependency prefetch. Request-level tracing tends to be the formance of MoE-Infinity and baseline systems. Here la-
biggest factor in improving bandwidth usage and the GPU tency is reported as the time-per-output-token (decoding
blocking time, improving them 43% and 45%, respectively. latency). We create a varying inference workload for dif-
Turning on multi-layer prefetching increases prefetching ferent MoE models, and the intensity of the workload is
traffic, as expected; however, it greatly reduces GPU blocking controlled by the Request-Per-Second (RPS) by the clients.
time by 16%. Finally, the layer proximity design saves the For both MoE-Infinity and baseline systems, we implement
bandwidth usage by 5%, as we can more actively cancel the auto-batching of requests based on the maximal latency
unnecessary prefetching of experts far from the current layer. and maximal batch size. These described configurations are
By benefiting closer experts, it reduces blocking time by 7%. consistent with papers reported in vLLM and Orca [37]. In
10
DeepSpeed-Inference Mixtral-Offloading Ollama BrainStorm* MoE-Infinity GPU-Rich
10 20 30 50
8 25

Latency (s)
15 40
Latency (s)

Latency (s)

Latency (s)
6
2 20 2
10 15 30
4 0 0.2 0.4 10 0
0.00 0.05 20
2 5 5 10
00.0 0.5 1.0 1.5 2.0 2.5 3.0 0 0.0 0.2 0.4 0.6 0.8 1.0 00.0 0.1 0.2 0.3 0.4 0.5 0.6 0
0.00 0.05 0.10
Request rate (req/s) Request rate (req/s) Request rate (req/s) Request rate (req/s)
(a) Switch-128x0.2B (b) NLLB-128x0.4B (c) Mixtral-8x7B (d) Arctic-128x4B

Figure 9. End-to-end performance. For Switch and Mixtral, the sub-figures detail the sub-second latency.

this set of experiments, we use a single GPU on our server MoE-Infinity Mixtral-Offloading BrainStorm* GPU-Rich
to create the most challenging limited resource scenario. 1.0 1.0
0.8 0.8
Figure 9 reports the end-to-end performance. For Mixtral-
0.6 0.6

CDF

CDF
8x7B (the worst case for us due to its small number of experts 0.4 0.4
per layer and a relatively high activation ratio), MoE-Infinity 0.2 0.2
can still achieve latency down to 336ms, and sustain it until 0.00.0 0.4 0.8 1.2 1.6 0.00.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
the RPS is over 0.2 (see Figure 9(c)). At the same time, the Latency (s) Latency (s)
GPU-Rich user consumes 8 GPUs, achieving a latency as (a) Switch-128x0.2B (b) NLLB-128x0.4B
175ms, but can sustain it until RPS over 0.4. This indicates
that MoE-Infinity can save GPU resources by 4X but only Figure 10. Tail latency.
compromising RPS to half, already useful for many GPU-
limited users. For all other offloading-supported baselines, DeepSpeed-Inference MoE-Infinity
MoE-Infinity’s latency performance is close to BrainStorm 6 3
Latency (s)

Latency (s)
and Llama.cpp but achieves over 7-8X improved RPS. For 4 2 No Offload
DeepSpeed-Inference and Mixtral-Offloading, their latency 2 No Offload 1
are even worse, mainly due to their poor prefetching and 0 1 2 4 8 0 1 2 4 8
caching performance. Number of GPUs Number of GPUs
For MoE models with more experts per layer and low selec- (a) Switch-128x0.2B (b) Mixtral-8x7B
tive activation ratios, the performance gains of MoE-Infinity
become more significant. For Switch and NLLB, see Fig- Figure 11. GPU memory vs Host memory.
ure 9(a) and (b), MoE-Infinity achieves the 223ms and 313ms
latency, both numbers comparable to those with the GPU-
Rich. MoE-Infinity can sustain this latency performance to surpassing other baseline systems. Even the GPU-Rich starts
around 80% of the RPS by the GPU-Rich. This means signifi- to consider MoE-Infinity (through MoE-Infinity’s multi-GPU
cant GPU saving by MoE-Infinity: achieving similar latency support) to address its out-of-memory issue.
and RPS performance, MoE-Infinity requires a single GPU Tail latency. We also wonder if offloading would affect tail
while the GPU-Rich requires 8 GPUs for NLLB and 4 GPUs latency, a metric that matters in serving scenarios. For this,
for Switch. Other offloading-supported systems, however, we report the CDF graph for Switch (RPS=1.5) and NLLB
cannot provide such a promise. Benefiting from correlation (RPS=0.6) with MoE-Infinity in Figure 10. With these two
analysis and accurate on-demand prefetching, BrainStorm RPS settings, MoE-Infinity has saturated the usage of the
and Mixtral-Offloading can achieve low latency but their per- PCIe bandwidth and the GPU. In such a case, the gap of the
formance quickly deteriorate with increasing RPS, almost tail latency between MoE-Infinity and the GPU-Rich is still
20X lower than MoE-Infinity. DeepSpeed-Inference suffers close, similar to the averaged latency reported above. This is
from inaccurate prefetching and caching. As a result, it shows attributed to MoE-Infinity’s design in effectively handling
the worst latency and RPS performance. Llama.cpp requires the contention of buffer slots in the GPUs as well as the
custom kernels for different MoE models and it cannot run contention on the PCIe connection.
NLLB, Switch and Arctic. GPU memory vs. host memory. The user of MoE-Infinity
For the biggest MoE model, Arctic, which has 900GB pa- often wonders what is best the ratio for GPU memory and
rameters (see Figure 9(d)), the GPU-Rich cannot deploy it host memory. We answer this question by scaling the de-
into all the available GPUs, facing out-of-the-memory issue ployment of MoE-Infinity from 1 GPU to 8 GPUs, increasing
even budget if not an issue. In such a case, MoE-Infinity the provision of GPU memory and compute resources while
becomes the only available serving system that can offer keeping the size of host memory (1TB) constant. Figure 11 re-
competitive inference performance with a single GPU, vastly ports the results. Here we only include DeepSpeed-Inference
since it is the only baseline library that can effectively scale
11
Table 6. Effective sharing of servers to save GPU cost 1.0 NLLB

Latency (s)
0.8 Switch
System Config. RPS Latency (s) Cost ($/hr) 0.6 Mixtral
MoE-Infinity 4 models / server 9 0.379 16 0.4
0.2
GPU-Rich 1 model / server 13.2 0.249 64 20 40 60 80 100 120
EAMC Capacity
DeepSpeed-Inference Ollama MoE-Infinity Figure 13. EAMC capacity.
Mixtral-Offloading BrainStorm* GPU-Rich
Throughput (token/s)

Throughput (token/s)
6000 1800
5000 1500 Table 7. Handling workload changes. Numbers in each cell
4000 1200
3000 900 mean the minimum, mean and maximal numbers of requests
2000 600
1000 300 required to recover low latency after a workload change.
0 1 2 4 8 16 32 64 0 1 2 4 8 16 32 64
Batch Size Batch Size
Workload Setup NLLB Mixtral Arctic
(a) NLLB-128x0.4B (b) Mixtral-8x7B MMLU tasks 0-14-43 0-9-49 28-43-45
BIGBench tasks 1-15-49 0-11-49 0-29-42
Figure 12. Prefill Performance.
MMLU ↦→ BIGBench 0-6-17 0-27-42 2-28-41
BIGBench ↦→ MMLU 3-11-24 0-35-46 5-27-45
its performance in a multi-GPU deployment. Benefiting from
the relatively small size of the model, MoE-Infinity can al-
ready achieve the best latency performance with a single we can see, among offloading-support systems, MoE-Infinity
GPU, whereas DeepSpeed-Inference needs 4 GPUs to achieve achieves the best performance, benefiting from its better
the best possible latency, 4X more than MoE-Infinity. prefetching and caching designs. For Mixtral, the worst case
For bigger NLLB-MoE, MoE-Infinity would require more for MoE-Infinity, the prefill throughput is 1187 tokens per
PCIe bandwidth and GPU memory. In such a case, even second for a batch size of 64. For NLLB, the prefill through-
though it can still achieve below 1 second performance with put reaches 2333 tokens per second for the same batch size.
a single GPU, its best latency performance would require 4 The result shows that even with high activation ratios, MoE-
GPUs. At the same time, DeepSpeed-Inference would require Infinity can still effectively overlap prefetching and infer-
all 8 GPUs to make it latency performance below 1 second. ence, and its caching strategy still shows benefits over ex-
Cost saving through sharing servers. We demonstrate isting GPU caching strategies (as in Llama.cpp and Mixtral-
how MoE-Infinity enables cost saving when serving MoE Offloading).
models for the GPU poor. By considering multiple MoE mod-
els with architectures similar to NLLB, MoE-Infinity allows 7.5 MoE-Infinity Activation Tracer
these models to share a single 8-GPU server. This is achieved Finally, we aim to evaluate the optimal parameter and ro-
by deploying multiple MoE-Infinity instances on the server bustness of the MoE-Infinity activation tracer.
and evenly partitioning the GPUs and host memory among EAMC Capacity. Users of MoE-Infinity may wonder how
them. Previously, users had to invest in expensive, dedicated to determine the best EAMC capacity. To explore this, we
8-GPU servers; without such resources, they faced memory adjusted the EAMC capacity while serving various MoE
shortages or were forced to resort to slower offloading tech- models. Figure 13 presents the results. We observed that in-
niques that compromised latency and throughput. Table 6 creasing the capacity from 1 to 120 allows all MoE models
reports the results. Instead of using 4 AWS-A10G instance to achieve their lowest average latency. Our findings high-
with 8x24GB GPU memory each, we deploy each model with light two key points: (i) A suitably small EAMC capacity
2 GPUs and enable offloading with MoE-Infinity. Such de- (3% of the total number of requests), even amidst a challeng-
ployment save 4x cost while only have 130ms latency and 4 ing mixed LLM inference workload involving 290 tasks from
RPS degradation. For the Switch model, we observed similar three datasets, is adequate for capturing most MoE activation
degrees of cost saving through sharing servers, and we omit patterns, thus efficiently supporting MoE-Infinity’s prefetch-
their results due to the page limit. ing and caching mechanisms, and (ii) The effectiveness of the
Prefill performance. So far, we have been focusing on EAMC capacity consistently manifests across different MoE
reporting decoding performance, consistent with that re- models, indicating that the EAMC design can be generalized.
ported by other papers on LLM serving. For offloading to Robustness with workload changes. Addressing concerns
be applicable in serving scenarios, users might also ques- about handling workload changes, we tested MoE-Infinity’s
tion whether it significantly slows down prefilling. To ad- tracer with task shifts, and measured the minimum, average,
dress this, we report prefill performance in terms of to- and maximum number of requests needed to restore low
ken throughput, the typical metric for this phase since it latency. The responsiveness to workload changes is shown in
is throughput-oriented. Figure 12 presents the results. As Table 7. In the first experimental group, we randomly shifted
12
between LLM tasks within the same dataset. Experiments Lidong Zhou, Quan Chen, Haisheng Tan, and Minyi Guo. Optimizing
showed that within the same dataset, models returned to dynamic neural networks with brainstorm. In OSDI, pages 797–815.
optimal latency after around 50 requests. Each task has 1000 USENIX Association, 2023.
[6] Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson,
input sequence on average, recovery from task shift needs Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hos-
5% requests in the worst case. When switching between seini, and Hervé Jégou. The faiss library. 2024.
datasets (e.g., MMLU to BigBench), models adapted faster, [7] Ilya Dumer. Covering spheres with spheres. Discret. Comput. Geom.,
averaging 30 requests for latency recovery. Each dataset 38(4):665–679, 2007.
samples 50K inputs, recovery from dataset shift needs less [8] Artyom Eliseev and Denis Mazur. Fast inference of mixture-of-experts
language models with offloading. CoRR, abs/2312.17238, 2023.
than 0.1% requests on average. This quicker adaptation is [9] William Fedus, Barret Zoph, and Noam Shazeer. Switch Transformers:
due to the reuse of activation patterns across similar tasks Scaling to trillion parameter models with simple and efficient sparsity.
shared by these datasets, as highlighted in our trace study. CoRR, abs/2101.03961, 2021.
[10] Elias Frantar and Dan Alistarh. QMoE: Practical sub-1-bit compression
of trillion-parameter models. CoRR, abs/2310.16795, 2023.
8 Conclusions [11] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh.
The design of MoE-Infinity challenges the conventional view GPTQ: accurate post-training quantization for generative pre-trained
that offloading is too costly for latency-sensitive environ- transformers. CoRR, abs/2210.17323, 2022.
[12] Yao Fu, Yinsicheng Jiang, Ping Nie, Zhan Lu, Leyang Xue, Yeqi Huang,
ments like those serving MoE-based LLMs. Our comprehen- Congjie He, Man-Kit Sit, Li Dong, Edoardo Ponti, Jilong Xue, Kai Zou,
sive studies, including extensive MoE trace analysis, design and Luo Mai. Open MoE LLM leaderboard. https://huggingface.co/s
discussions, and experiments, reveal significant opportuni- paces/sparse-generative-ai/open-moe-llm-leaderboard, 2023.
ties to optimize offloading efficiency. MoE-Infinity can re- [13] Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii
duce GPU costs by up to 4X while achieving nearly 93% of Ustiugov, Yuvraj Patel, and Luo Mai. ServerlessLLM: Low-latency
serverless inference for large language models. In OSDI. USENIX
the latency and throughput performance of systems using Association, 2024.
expensive GPUs alone. This makes MoE-Infinity a viable, [14] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas
high-performance option for many organizations with lim- Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive mul-
ited GPU resources. We plan to develop MoE-Infinity as an titask language understanding. In ICLR. OpenReview.net, 2021.
open-source project and anticipate it will benefit the LLM [15] Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ah-
mad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Am-
research community. inabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yuxiong
He. DeepSpeed-FastGen: High-throughput text generation for LLMs
References via MII and DeepSpeed-Inference. 2024.
[16] Chien-Chin Huang, Gu Jin, and Jinyang Li. SwapAdvisor: Pushing
[1] Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, deep learning beyond the GPU memory limit via smart swapping. In
Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia ASPLOS, pages 1341–1355. ACM, 2020.
Zhang, Jeff Rasley, and Yuxiong He. DeepSpeed-Inference: Enabling [17] HuggingFace. Text generation inference. https://github.com/hugging
efficient inference of transformer models at unprecedented scale. In face/text-generation-inference.
SC, pages 46:1–46:15. IEEE, 2022. [18] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam-
[2] Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle ford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand,
Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ra- Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard
makanth Pasunuru, Giridharan Anantharaman, Xian Li, Shuohui Chen,
Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut
Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mix-
Koura, Brian O’Horo, Jeffrey Wang, Luke Zettlemoyer, Mona T. Diab, tral of experts. CoRR, abs/2401.04088, 2024.
Zornitsa Kozareva, and Veselin Stoyanov. Efficient large scale language [19] Jaehoon Jung, Jinpyo Kim, and Jaejin Lee. DeepUM: Tensor migration
modeling with mixtures of experts. In EMNLP, pages 11699–11732. and prefetching in unified memory. In ASPLOS (2), pages 207–221.
Association for Computational Linguistics, 2022. ACM, 2023.
[3] Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, [20] Young Jin Kim, Raffy Fahim, and Hany Hassan Awadalla. Mixture of
Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, quantized experts (MoQE): Complementary effect of low-bit quantiza-
and Thomas Wolf. Open llm leaderboard. https://huggingface.co/spa tion and robustness. CoRR, abs/2310.02410, 2023.
ces/HuggingFaceH4/open_llm_leaderboard, 2023. [21] Robert Knight. Does ChatGPT really cost $3m a day to run? https:
[4] Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Ken- //metanews.com/does-chatgpt-really-cost-3m-a-day-to-run/, 2022
neth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel (accessed 2024-04-22).
Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, [22] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin
Al Youngblood, Bapi Akula, Loïc Barrault, Gabriel Mejia Gonzalez, Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica.
Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Efficient memory management for large language model serving with
Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, pagedattention. In SOSP, pages 611–626. ACM, 2023.
Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia [23] NVIDIA. TensorRT-LLM. https://github.com/NVIDIA/TensorRT-LLM.
Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre [24] Ollama. Ollama. https://github.com/ollama/ollama.
Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, [25] Robert Alexander Rankin. On the closest packing of spheres in n
and Jeff Wang. No language left behind: Scaling human-centered dimensions. Annals of Mathematics, pages 1062–1081, 1947.
machine translation. CoRR, abs/2207.04672, 2022. [26] Jie Ren, Jiaolin Luo, Kai Wu, Minjia Zhang, Hyeran Jeon, and Dong Li.
[5] Weihao Cui, Zhenhua Han, Lingji Ouyang, Yichuan Wang, Ningxin Sentinel: Efficient tensor migration and allocation on heterogeneous
Zheng, Lingxiao Ma, Yuqing Yang, Fan Yang, Jilong Xue, Lili Qiu,
13
memory systems for deep learning. In HPCA, pages 598–611. IEEE, Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Ani-
2021. mesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash
[27] Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan,
Stephen W. Keckler. vDNN: Virtualized deep neural networks for Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat,
scalable, memory-efficient neural network design. In MICRO, pages Aykut Erdem, Ayla Karakas, and et al. Beyond the imitation game:
18:1–18:13. IEEE Computer Society, 2016. Quantifying and extrapolating the capabilities of language models.
[28] Mohammad Shahrad, Rodrigo Fonseca, Iñigo Goiri, Gohar Chaudhry, CoRR, abs/2206.04615, 2022.
Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark [32] Qwen Team. Qwen1.5-moe: Matching 7b model performance with
Russinovich, and Ricardo Bianchini. Serverless in the wild: Char- 1/3 activated parameters", February 2024.
acterizing and optimizing the serverless workload at a large cloud [33] The Mosaic Research Team. Introducing DBRX: A new state-of-the-art
provider. In USENIX Annual Technical Conference, pages 205–218. open LLM. https://www.databricks.com/blog/introducing-dbrx-new-
USENIX Association, 2020. state-art-open-llm.
[29] Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max [34] Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei
Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned
Ce Zhang. Flexgen: High-throughput generative inference of large lan- language models are zero-shot learners. In ICLR. OpenReview.net,
guage models with a single GPU. In ICML, volume 202 of Proceedings 2022.
of Machine Learning Research, pages 31094–31116. PMLR, 2023. [35] XAI. Open release of Grok-1. https://x.ai/blog/grok-os, 2024. Accessed:
[30] Snowflake AI Research. Snowflake Arctic: The best LLM for enterprise 2024-06-04.
AI — efficiently intelligent, truly open. https://www.snowflake.com/bl [36] Guangxuan Xiao, Ji Lin, Mickaël Seznec, Hao Wu, Julien Demouth,
og/arctic-open-efficient-foundation-language-models-snowflake/. and Song Han. SmoothQuant: Accurate and efficient post-training
[31] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md quantization for large language models. In ICML, volume 202 of
Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam San- Proceedings of Machine Learning Research, pages 38087–38099. PMLR,
toro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor 2023.
Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, [37] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and
Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Par- Byung-Gon Chun. Orca: A distributed serving system for transformer-
rish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, based generative models. In OSDI, pages 521–538. USENIX Association,
Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea 2022.
Santilli, Andreas Stuhlmüller, Andrew M. Dai, Andrew La, Andrew K.

14

You might also like