A Survey On Mixture of Experts: Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang
A Survey On Mixture of Experts: Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang
A Survey On Mixture of Experts: Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang
JING TANG, The Hong Kong University of Science and Technology (Guangzhou), China
SUNGHUN KIM, The Hong Kong University of Science and Technology (Guangzhou), China
JIAYI HUANG† , The Hong Kong University of Science and Technology (Guangzhou), China
Large language models (LLMs) have garnered unprecedented advancements across diverse fields, ranging
from natural language processing to computer vision and beyond. The prowess of LLMs is underpinned by
their substantial model size, extensive and diverse datasets, and the vast computational power harnessed
during training, all of which contribute to the emergent abilities of LLMs (e.g., in-context learning) that
are not present in small models. Within this context, the mixture of experts (MoE) has emerged as an
effective method for substantially scaling up model capacity with minimal computation overhead, gaining
significant attention from academia and industry. Despite its growing prevalence, there lacks a systematic and
comprehensive review of the literature on MoE. This survey seeks to bridge that gap, serving as an essential
resource for researchers delving into the intricacies of MoE. We first briefly introduce the structure of the
MoE layer, followed by proposing a new taxonomy of MoE. Next, we overview the core designs for various
MoE models including both algorithmic and systemic aspects, alongside collections of available open-source
implementations, hyperparameter configurations and empirical evaluations. Furthermore, we delineate the
multifaceted applications of MoE in practice, and outline some potential directions for future research. To
facilitate ongoing updates and the sharing of cutting-edge developments in MoE research, we have established
a resource repository accessible at https://github.com/withinmiaov/A-Survey-on-Mixture-of-Experts.
CCS Concepts: • General and reference → Surveys and overviews; • Computing methodologies →
Artificial intelligence.
Additional Key Words and Phrases: Large Language Models, Mixture of Experts, Gating Functions
∗ Equal Contribution.
† Corresponding Author.
Authors’ addresses: Weilin Cai, [email protected], The Hong Kong University of Science and Technol-
ogy (Guangzhou), Guangzhou, China; Juyong Jiang, [email protected], The Hong Kong University of
Science and Technology (Guangzhou), Guangzhou, China; Fan Wang, [email protected], The Hong Kong
University of Science and Technology (Guangzhou), Guangzhou, China; Jing Tang, [email protected], The Hong
Kong University of Science and Technology (Guangzhou), Guangzhou, China; Sunghun Kim, [email protected],
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China; Jiayi Huang, [email protected],
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2024 Association for Computing Machinery.
0360-0300/2024/10-ART1 $15.00
https://doi.org/XXXXXXX.XXXXXXX
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:2 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
1 INTRODUCTION
In the current landscape of artificial general intelligence (AGI), the transformative impact of
transformer-based large language models (LLMs) has permeated diverse fields such as natural
language processing [1, 11, 24, 76, 116, 157, 162], computer vision [98, 128], and multimodality
[99, 181, 190, 194]. Building upon the foundational transformer architecture, LLMs demonstrate
extraordinary capabilities, which are attributed to their sheer size, the breadth of data they are
trained on, and the significant computational resources invested in their development [79, 161, 177].
Recognizing a scaling law [64, 79] that underpins their evolution, it is imperative to identify and
implement efficient methodologies for the sustainable scaling of LLMs.
The concept of mixture of experts (MoE), initially introduced in [71, 78], has undergone extensive
exploration and advancement as evidenced by subsequent studies [3, 28, 37, 46, 125, 132, 153]. The
emergence of sparsely-gated MoE [135], particularly within the integration of transformer-based
large language models [86], has brought new vitality to this three-decade-old technology. The
MoE framework is based on a simple yet powerful idea: different parts of a model, known as
experts, specialize in different tasks or aspects of the data. With this paradigm, only pertinent
experts are engaged for a given input, keeping the computational cost in check while still benefiting
from a large pool of specialized knowledge. This scalable and flexible innovation has offered an
effective approach for adhering to the scaling law, allowing for increased model capacity without
a corresponding surge in computational demands. As depicted in Figure 1, MoE has maintained
a robust trajectory of growth, particularly notable in 2024 with the advent of Mixtral-8x7B [74]
and a variety of subsequent industrial-scale LLMs such as Grok-1 [169], DBRX [34], Arctic [152],
DeepSeek-V2 [36], etc.
Despite the increasing popularity and application of MoE models in various domains, the literature
has yet to see a survey that thoroughly examines and categorizes the advancements in this area.
The most recent review of MoE we could find was presented in September 2022 [48], predating the
pivotal “ChatGPT moment”, which omits the significant advancements that have recently emerged
alongside the escalating academic and industrial interest in this domain. This gap in the literature
not only hinders the progress of MoE research but also limits the dissemination of knowledge
on this topic to a broader audience. Our survey aims to address this deficit by providing a clear
and comprehensive overview of MoE with a novel taxonomy that segments recent progress into
algorithm, system and application.
Under this taxonomy, we first delve into MoE algorithmic advancements, particularly the preva-
lent substitution of feed-forward network (FFN) layers with MoE layers in transformer-based LLMs
[36, 44, 49, 74, 86, 172, 197]. As each MoE layer integrates multiple FFNs—each designated as an
expert—and employs a gating function to activate a select subset of these experts, we explore the
design choices of gating function and expert network, alongside collections of available open-source
implementations, hyperparameter configurations and empirical evaluations. Furthermore, to under-
score the flexibility and versatility of MoE, we extend our analysis beyond the standard integration
of MoE into model backbone, and discuss an array of novel MoE-related designs, such as soft
MoE with token or expert merging [105, 118, 164, 178, 189], mixture of parameter-efficient experts
(MoPEs) [43, 53, 100, 160, 168, 178], training and inference schemes with model transition between
dense and sparse [16, 82, 145, 149, 170, 184], and various derivatives [5, 19, 23, 124, 146, 171].
With the gradual convergence of model architecture design in industrial products, system design
has emerged as a pivotal factor in enhancing the quality of LLM services. Given the close association
of MoE models with machine learning system design, we provide a comprehensive overview of
MoE system design, including computation, communication and storage enhancements tailored
to address the unique challenges posed by the sparse and dynamic nature of its computational
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
A Survey on Mixture of Experts 1:3
Fig. 1. A chronological overview of several representative Mixture of Experts (MoE) models in recent years.
The timeline is primarily structured according to the release dates of the models. MoE models located above
the arrow are open-source, while those below the arrow are proprietary and closed-source. MoE models from
various domains are marked with distinct colors: Natural Language Processing (NLP) in green , Computer
Vision in yellow , Multimodal in pink , and Recommender Systems (RecSys) in cyan .
workload. Additionally, we overview the applications of MoE across various domains, including
natural language processing, computer vision, recommender system, and multimodal contexts.
The remainder of this survey is organized as follows. Section 2 provides a foundational un-
derstanding of MoE, contrasting sparse and dense activation of experts. Section 3 introduces our
proposed taxonomy for categorizing MoE advancements. Sections 4, 5, and 6 delve into the algo-
rithmic designs, computing system support, and various applications of MoE models, following
the structure outlined in our taxonomy in Figure 3. Finally, in Section 7, we highlight the critical
challenges and opportunities for bridging the research-practicality gap, culminating in Section 8
with our conclusions.
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:4 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
𝑌 𝑌
Add + Normalize Add + Normalize
Gate Gate
𝑋 𝑋
(a) Dense MoE (b) Sparse MoE
Fig. 2. An illustration of an MoE layer in Transformer-based models. For each input 𝑋 , the linear-softmax
gating will select all experts namely (a) Dense MoE or top-𝑘 experts namely (b) Sparse MoE to perform
conditional computation. The expert layer returns the output of the selected expert multiplied by the gate
value (softmax of the gating function output).
where 𝑔(x; Θ) represents the gating value prior to the softmax operation.
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
A Survey on Mixture of Experts 1:5
To explain, TopK(·, 𝑘) function retains only the top-𝑘 entries of a vector at their original values,
while setting all other entries to −∞. Following the softmax operation, those entries assigned −∞
become approximately zero. The hyper-parameter 𝑘 is selected based on the specific application,
with common choices being 𝑘 = 1 [26, 49] or 𝑘 = 2 [44, 74, 86, 121, 154, 197]. The addition of a
noise term R noise is a prevalent strategy for the training of a sparsely-gated MoE layer, fostering
exploration among experts and enhancing the stability of MoE training [49, 135].
Although the sparse gate G(x; Θ) substantially expands the model’s parameter space without
a corresponding increase in computational cost, it can lead to a load balancing issue. Such an
issue refers to the uneven distribution of workload across experts, with some being frequently
utilized and others seldom or never engaged. To address this, each MoE layer incorporates an
auxiliary loss function that promotes an even distribution of tokens across experts within each
batch, as described in many studies [30, 44, 49, 74, 86, 94, 154]. To formulate this concept, consider
a batch of queries B = {x𝑖 , x2, . . . , x𝑇 }, comprising 𝑇 tokens, and 𝑁 experts indexed from 𝑖 = 1 to
𝑁 . Following [49, 86], the auxiliary load balancing loss for the batch is defined as
𝑁
∑︁
Lload-balancing = 𝑁 D𝑖 P𝑖 , (2.5)
𝑖=1
1 ∑︁
D𝑖 = 1{argmax G(x; Θ) = 𝑖}, (2.6)
𝑇 𝑥∈B
1 ∑︁
P𝑖 = G(x; Θ)𝑖 , (2.7)
𝑇 𝑥∈B
where D𝑖 represents the proportion of tokens distributed to expert 𝑖, while P𝑖 denotes the proportion
of the gating probability assigned to expert 𝑖. To ensure an even distribution of the batch of tokens
across the 𝑁 experts, the load-balancing loss function Lload-balancing should be minimized. The
Í𝑁 Í𝑁 1 1
optimal condition, i.e., min(Lload-balancing ) = 𝑁 𝑖=1 D𝑖 P𝑖 = 𝑁 𝑖=1 𝑁 𝑁 = 1, is achieved when
1
each expert receives an equal number of dispatched tokens D𝑖 = 𝑁 , and an equal proportion of
the gating probability P𝑖 = 𝑁1 . The balance is thus maintained across all experts, ensuring that
the workload is uniformly distributed at all times. Throughout the subsequent sections, unless
explicitly stated otherwise, the term “MoE” will refer to “sparse MoE”.
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:6 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
Expert Models
Branch-Train-Merge[89], Branch-Train-MiX[145], FoE[158]
Merging
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
A Survey on Mixture of Experts 1:7
For example, the Mixtral 8x7B [74], introduced by Mixtral AI, shares its foundational architecture
with the earlier Mistral 7B [73], but with a notable difference: each layer comprises eight feed-
forward networks (FFN) (i.e., experts). Despite utilizing only 13 billion active parameters, the
Mixtral-8x7B demonstrates superior or equivalent performance to the Llama-2-70B [155] and GPT-
3.5 [113] across various benchmarks. Similarly, the DeepSeek LLM [10], developed by DeepSeek, has
been extended with an MoE variant known as DeepSeekMoE [30]. The DeepSeekMoE 16B, while
requiring approximately 40% less computation, attains performance on par with the Llama 2 7B
[155]. The Qwen team has also contributed to this innovative field by developing the Qwen1.5-MoE
[151], a smaller MoE model with only 2.7B active parameters that rivals the performance of leading
7B parameter models such as the Mistral 7B [73] and the Qwen1.5-7B [150].
To assist researchers in navigating the rapidly evolving landscape of LLMs equipped with MoE
architectures, we have developed a taxonomy that categorizes these models from three perspectives:
algorithm design, system design, and application. Figure 3 showcases our taxonomy alongside
several representative studies. In the following sections, we will provide a comprehensive and
in-depth analysis of each category within our taxonomy.
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:8 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
Table 1. Overview of diverse auxiliary loss functions and their typical coefficient configurations. The origina-
tors introducing each auxiliary loss is highlighted as bolded reference, followed by references that adopts the
same approach. Studies that have modified the original formulation are indicated with underlined reference.
experts into k groups and then applies top-1 gating in each group. Their experimental results
show the training and downstream perplexity of a 16-layer model in order of best to worst: expert
prototyping with 4 top-1 gating, 1 top-4 gating, 1 top-16 gating, 1 top-1 gating.
Auxiliary Loss for Token-Choice Gating. Token-choice gating algorithms frequently incor-
porate an auxiliary loss during training to promote equitable token distribution across experts.
Table 1 shows prevalent auxiliary loss functions leveraged in the field. Shazeer et al. [135] quantify
the importance of an expert in relation to a training batch via the batchwise sum of the gate
values for that expert. They define an additional loss 𝐿𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 , added to the overall loss function
for the model. This loss, which is equal to the square of the coefficient of variation of the set
of importance values and multiplied by a hand-tuned scaling factor 𝑤𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 , encourages all
experts to have equal importance. Although 𝐿𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 promotes balance in importance, it does not
guarantee an even distribution of training examples among experts, which can lead to execution
inefficiencies on distributed computing environments. To address this, they introduce a second loss
𝐿𝑙𝑜𝑎𝑑 to ensure balanced loads. Building on this foundation, GShard [86] define a new differentiable
auxiliary loss 𝐿𝑎𝑢𝑥 using a differentiable approximation (the dot-product of mean gates and mean
gating decisions per expert), as detailed in Section 2.2. Switch Transformers [49] and many other
subsequent studies [34, 44, 74, 94] have embraced this 𝐿𝑎𝑢𝑥 design, and enhancements [30, 36, 154]
have been made to cater to diverse requirements. Nevertheless, ST-MoE [197] identified limitations
with 𝐿𝑎𝑢𝑥 , particularly at larger scales, leading to unreliable training outcomes. To mitigate this, it
introduces the integration of router z-loss 𝐿𝑧 , improving training stability without quality degrada-
tion by penalizing large logits entering the gating network. Since this loss encourages absolute
magnitude of values to be smaller, roundoff errors are reduced, which can be quite impactful for
exponential functions such as the gating. Additionally, Mod-Squad [21] posits the difficulty of
training multi-task models under such an expert-balancing loss, which may inadvertently force
experts to set parameters on conflicting tasks or hinder the potential synergies from parameter
sharing across complementary tasks. Instead, it proposes to maximize the mutual information
(MI) between experts and tasks to build task-expert alignment. Differently, ModuleFormer [140]
proposes to maximize the Mutual Information between experts and tokens. Furthermore, DS-MoE
[117] extends the application of 𝐿𝑀𝐼 , calibrating different weightings 𝑤 𝑀𝐼 , in Mixture-of-Attention
(MoA, as illustrated in Figure 5 (a)) and FFN MoE modules of different size models.
Expert Capacity for Token-Choice Gating. In conjunction with load balancing via auxiliary
loss, GShard [86] incorporates an expert capacity limit, defining a threshold for the number of tokens
an expert can process. This can lead to token overflow, where excess tokens are not processed by
the designated expert. GShard also proposes a random routing mechanism that selects a secondary
expert with a probability proportional to its weight, under the intuition that the contribution of a
secondary expert can be negligible, given that the output is a weighted average and the secondary
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
A Survey on Mixture of Experts 1:9
weight is typically small. For the task of image classification with Vision Transformer (ViT) models,
Riquelme et al. [128] enhance the top-𝑘 gating strategy with Batch Prioritized Routing (BPR), which
assigns priority based on higher gating scores rather than the sequence order of tokens. Zoph et
al. [197] have demonstrated the efficacy of BPR in the context of MoE language models. Kim et al.
[80] suggest randomizing token prioritization within sequences to mitigate routing bias towards
early-positioned tokens. OpenMoE [172] provides a comprehensive analysis of gating mechanisms,
highlighting the “Drop-towards-the-End” phenomenon whereby tokens later in a sequence are at
greater risk of being dropped due to experts reaching their maximum capacity limits, an issue that is
exacerbated in instruction-tuning datasets. Moreover, OpenMoE identifies a tendency within MoE
systems to route tokens based on token-level semantic similarities, leading to “Context-independent
Specialization”. Additionally, this token ID routing specialization is established early in pre-training
and remains largely fixed, resulting in a consistent pattern of token processing by the same experts
throughout training, a phenomenon referred to as “Early Routing Learning”.
Other Advancements on Token-Choice Gating. Despite the implementation of gating heuris-
tics and auxiliary expert-balancing loss functions aimed at achieving a balanced workload distri-
bution among experts, the issue of load imbalance persists as a prevalent challenge within MoE
architectures. To solve it, the Balanced Assignment of Sparse Experts (BASE) layer, as conceptual-
ized by Lewis et al. [87] and illustrated in Figure 4 (b), re-envisions the token-to-expert allocation
process by casting it as a linear assignment problem, aiming to maximize the token-expert affinities
under the constraints that each expert is assigned an equal quantity of tokens. Subsequently, Clark
et al. [26] introduce a variant of the BASE layer, termed S-BASE, using an optimal transport formu-
lation. Additionally, they devise a reinforcement learning based gating algorithm employing top-1
routing, with the reward function defined as the negative cross-entropy of the predicted token.
In addressing the discrete optimization challenge of gating function that can lead to convergence
and statistical performance issues when training with gradient-based methods, Hazimeh et al. [58]
introduce DSelect-k. which is a smooth version of the top-𝑘 gating algorithm that improves over
standard top-𝑘 gating. This method constitutes a refined version of the top-𝑘 gating algorithm,
featuring enhanced smoothness properties that yield improvements over the conventional top-𝑘
gating approach. Kudugunta et al. [84] diverge from the prevalent token-level gating strategies by
introducing a sentence-level gating mechanism. This approach involves generating a sentence rep-
resentation by averaging the tokens within a sequence and subsequently routing it to an expert. Chi
et al.[22] observe that prevailing gating mechanisms tend to push hidden representations clustering
around expert centroids, implying a trend toward representation collapse, which in turn harms
model performance. To counteract this issue, they project hidden vectors into a lower-dimensional
space before gating and implement L2 normalization for both token representations and expert
embeddings, thus calculating gating scores within a low-dimensional hypersphere. Skywork-MoE
[154] proposes two innovative techniques: gating logit normalization, which improves expert
diversification, and adaptive auxiliary loss coefficients, which provides layer-specific adjustment of
auxiliary loss coefficients. Yuan 2.0-M32 [166] proposes a new router network, Attention Router
(as illustrated in Figure 4 (e)), which implements a more efficient selection of experts and yields an
enhancement in model accuracy over classical linear router network.
Non-trainable Token-Choice Gating. The dynamic training of gating functions within MoE
models is standard practice; however, some research has ventured into the realm of non-trainable
token-choice gating mechanisms. The most significant benefit of non-trainable token-choice gating
is that no additional gating network parameters are required and the full load balancing can
be achieved through specific gating mechanisms. The Hash Layer [129] utilizes a random fixed
gating approach by hashing the input token, achieving competitive results without the necessity of
training the gating network. The load balancing is facilitated by the selection of hash functions
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:10 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
𝑌* 𝑌+ 𝑌* 𝑌+ 𝑌* 𝑌+
Add + Normalize Add + Normalize Add + Normalize
FFN1 FFN2 FFN3 FFN4 FFN1 FFN2 FFN3 FFN4 FFN1 FFN2 FFN3 FFN4
p= 0.58 p= 0.76
Random Random
Solve Linear
Gate Assignment Domain Mapping
𝑋* 𝑋+ 𝑋* 𝑋+ 𝑋* (domian1) 𝑋+ (domain2)
(a) Sparse MoE (Top-1 Gating) (b) BASE Layers (c) Domain Mapping + Random Gating
𝑌* 𝑌+ 𝑌 𝑌
Add + Normalize Add + Normalize Add + Normalize
Merged
FFN1 FFN2 FFN3 FFN4 FFN
𝑋* 𝑋+ 𝑋 𝑋
(d) Expert-Choice Gating (e) Attention Router (f) Soft MoE (Expert Merging)
Fig. 4. The illustration of various gating functions employed in MoE models, including (a) sparse MoE with
top-1 gating [49], (b) BASE layers [87], (c) the combination of grouped domain mapping and random gating
[127], (d) expert-choice gating [193], (e) attention router [166], and (f) soft MoE with expert merging [105].
prior to training, which can equitably distribute token batches. Zuo et al. [198] introduces THOR,
an algorithm that randomly allocates two experts to each input during training and inference with
a consistency regularized loss promoting consistent predictions. Gururangan et al. [56] propose
the DEMix model, which explicitly assigns distinct experts to discrete pre-training domains, with
domain matching being employed to select experts corresponding to the training inputs. Given the
potential suboptimality of domain categorization and its limited scope in encompassing test-time
domains, a single domain expert selection could undermine the model’s generalizability. To address
this, DEMix adopts a parameter-free probabilistic method that dynamically estimates the domain-
weighted mixture at inference. Kudugunta et al. [84] explore task-level gating incorporating prior
knowledge tags, and similarly, M2M-100 model [47] utilizes explicit language-specific sublayers with
deterministically routing input tokens based on their language. Building uponÍthe aforementioned
non-trainable gating strategies—random gating and domain mapping—PanGu- [127] presents the
Random Routed Experts (RRE) mechanism. As illustrated in Figure 4 (c), this approach initially
routes tokens to a domain-specific expert group, followed by a random selection within that group.
In contrast to explicit language-specific expert selection, NLLB [29] leverages trainable gating
to manage multilingual machine translation tasks, outperforming the M2M-100 approach [47].
Addressing task interference in generalist models, Zhu et al. [195] introduce the Conditional MoE,
which augments MoE with trainable gating by integrating conditional information at various levels,
such as token-level, context-level, modality-level, task-level, and predefined token attributes. Ye
et al. [175] further investigate the incorporation of trainable gating at task-level MoE. Addition-
ally, STABLEMOE [31] identifies a challenge with existing learning-to-route MoE methods: the
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
A Survey on Mixture of Experts 1:11
4.1.2 Dense. In Section 2.1, we discuss the enduring relevance of dense MoE, which activates all
the experts for each input process. This dense paradigm continues to inform current innovations in
MoE training and inference methodologies, as elaborated in Section 4.4.1. While sparse activation
of experts, as a trade-off, may yield efficiency gains at the expense of some performance when
compared to a densely activated MoE with an equivalent number of total parameters [30, 117, 140], it
represents a strategic adjustment to balance computational demands with model capability. Notably,
dense activation performs well in the context of LoRA-MoE fine-tuning, where the computational
overhead of LoRA experts is comparatively low. This approach enables the effective and flexible
integration of multiple LoRAs across a variety of downstream tasks. It preserves the generative
capabilities of the original pre-trained model and maintains the unique characteristics of individual
LoRAs for each task [43, 167].
4.1.3 Soft. Deciding the allocation of appropriate experts to each input token pose the fundamental
discrete optimization challenge for sparse MoE. This often necessitates heuristic auxiliary losses
to ensure balanced expert engagement and to minimize unassigned tokens. These issues become
more pronounced in scenarios involving out-of-distribution data, such as small inference batches,
novel inputs, or during transfer learning. Similar to dense MoE, the soft MoE approach maintains
full differentiability by leveraging all the experts for processing each input, thus avoiding issues
inherent to discrete expert selection. We distinguish soft MoE from dense MoE to highlight the
characteristic that mitigates computational demands through the gating-weighted merging of input
tokens or experts.
Token Merging. Puigcerver et al. [118] proposed the Soft MoE, which eschews the conventional
sparse and discrete gating mechanism in favor of a soft assignment strategy that merges tokens. This
method computes several weighted averages of all tokens, with weights depending on both tokens
and experts, and processes each aggregate with its respective expert. Their experimental results in
image classification demonstrate that soft MoE enhances the stability of gating function training
and inherently maintains balance. HOMOE [33] follows the design of Soft MoE and combines it
with Hopfield network to address the the challenges of Compositional Zero-Shot Learning tasks.
Yet, merging input tokens complicates its application in auto-regressive decoders, as future tokens
required for averaging are inaccessible during inference.
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:12 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
Expert Merging. In contrast to the merging of input tokens, Muqeeth et al. [105] introduced
the Soft Merging of Experts with Adaptive Routing (SMEAR) framework, which circumvents dis-
crete gating by merging all the experts’ parameters through a weighted average, as illustrated in
Figure 4 (f). They argue that conventional sparse MoE models often fail to match the performance
of their parameter-matched dense counterparts or those utilizing non-learned heuristic gating
functions, potentially due to flawed gradient estimation methods for training modules with non-
differentiable, discrete gating decisions. By processing the input tokens through a single merged
expert, SMEAR does not incur a significant increase in computational costs and enables standard
gradient-based training. Empirical evaluations on T5-GLUE and ResNet-DomainNet benchmarks re-
veal that SMEAR-equipped models surpass those with metadata-based [56, 84] or gradient-estimated
learning gating strategies. On ResNet-DomainNet, SMEAR achieved a 1.5% higher average accu-
racy than Soft MoE [118] with single “slot” per expert, at the expense of a near 10% reduction in
throughput. Subsequent contributions by Zhong et al. [189] argue that SMEAR’s demonstrated
advantages are confined to downstream fine-tuning on classification tasks. They present Lory, an
innovative approach for scaling such expert merging architectures to auto-regressive language
model pre-training. Lory [189] introduces a causal segment routing strategy, conducting expert
merging at the segment level while maintaining the auto-regressive nature of language models.
Furthermore, it employs similarity-based data batching to direct expert specialization in particular
domains or topics. Lory’s empirical validation on LLaMA models showcases significant improve-
ments over parameter-matched dense models in terms of perplexity (by 13.9%) and on diverse
downstream tasks (by 1.5%-11.1%), highlighting the potential of fully-differentiable MoE architec-
tures for language model pre-training and encouraging further investigation in this area. In addition,
expert merging methods have demonstrated efficacy in parameter-efficient fine-tuning (PEFT) MoE
contexts. Zadouri et al. [178] substantiate that soft merging of experts significantly outperforms
sparse gating mechanisms (top-1, top-2) in the T5 models [120] fine-tuning with the MoV-10 setting
of 10 (IA)3 vector expert. Wu et al. [164] propose Omni-SMoLA, an architecture leveraging the soft
method to mix multimodal low-rank experts, improving the generalist performance across a broad
range of generative vision-language tasks.
4.2 Experts
In this section, we delineate the architecture of expert networks within MoE framework, following
our discussion on the gating function that orchestrates the activation of these experts.
4.2.1 Network Types. Since the initial integration of MoE into transformer architectures [49, 86,
197], MoE has served as a substitute for Feed-Forward Network (FFN) modules within these models.
Typically, each expert within a MoE layer replicates the architecture of the FFN it replaces. This
paradigm, wherein FFNs are utilized as experts, remains predominant, and subsequent refinements
will be expounded upon in Sections 4.2.2 to 4.2.4.
Feed-Forward Network. As discussed in existing work [145], the predilection for leveraging
MoE in the context of FFNs is rooted in the hypothesis that self-attention layers exhibit lower
sparsity and less domain specificity than FFN layers. Pan et al. [117] provide empirical support
for this, revealing marked sparsity in FFN layers compared to self-attention layers, through their
analysis of downstream Wikitext tasks using their pre-trained DS-MoE models. Their results
indicate a mere 20% active expert engagement in FFN layers, in contrast to the 80% observed within
self-attention layers. In earlier investigation of FFN computational patterns, Zhang et al. [184]
observe that most inputs only activate a small proportion of neurons of FFNs, thus corroborating
the inherent sparsity of FFNs.
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
A Survey on Mixture of Experts 1:13
Linear K Linear V
𝐾 𝑉 p p
Linear Q1 Linear Q2 Linear Q3 Linear Q4
𝑝! 𝑝"
𝑋! Gate
Gate
𝑄 𝑋! 𝑋#
(a) MoA (b) Shared Expert
Fig. 5. The illustration of Mixture of Attention Heads [182] (a) and Shared Expert [121] (b) architectures.
Attention. While the focus of MoE research has predominantly been on FFN layers within the
Transformer architecture, Zhang et al. [182] introduce the Mixture of Attention Heads (MoA), an
innovative architecture that combines multi-head attention layers with MoE to further enhance
performance and restrain computational cost. As delineated in Figure 5 (a), MoA employs two sets of
experts, one for query projection and one for output projection, which are selected the same indices
of experts through a common gating network. To reduce computational complexity, MoA shares
the key (𝑊𝑘 ) and value (𝑊𝑣 ) projection weights across attention experts, with experts differentiated
𝑞
only by their respective query (𝑞𝑡 𝑊𝑖 ) and output (𝑜𝑖,𝑡 𝑊𝑖𝑜 ) projection weights, allowing for shared
pre-computation of key (𝐾𝑊𝑘 ) and value (𝑉𝑊𝑣 ) sequences. Subsequent work such as DS-MoE
[117], JetMoE [139], and ModuleFormer [140] follows the design of MoA and further refines the
combination of MoE and attention layer.
Others. In addition to the aforementioned expert network types, researchers have explored
the use of Convolutional Neural Network (CNN) as expert [20, 25, 54, 159, 183]. Moreover, recent
endeavors that integrate Parameter-Efficient Fine-Tuning (PEFT) techniques with MoE, such as
employing Low-Rank Adaptation (LoRA) [66] as expert, have shown promising results, which are
discussed in Section 4.3.
4.2.2 Hyperparameters. The scale of sparse MoE models is governed by several critical hyperpa-
rameters that extend beyond those of dense transformer models. These include (1) the count of
experts per MoE layer, (2) the size of each expert, and (3) the placement frequency of MoE layers
throughout the model. The selection of these hyperparameters is crucial, as it profoundly influences
model performance and computational efficiency across various tasks. Optimal hyperparameter
choices are thus contingent upon the specific application requirements and the constraints of the
computational infrastructure. Our subsequent analysis, informed by the exemplified models listed
in Table 2, explores these hyperparameter decisions in depth. Meanwhile, we enumerate some
recent open-source models, detailing their number of parameters and benchmark performance in
Table 3.
Expert Count. Initial investigations employing thousands of experts per layer yielded impressive
gains in pre-training and translation quality [49, 86, 135]. Nonetheless, the quality of sparse MoE
models is disproportionately reduced under domain shift [6] or when fine-tuning on diverse task
distributions [49]. GLaM [44] adopts a configuration of 64 experts, guided by their findings that
a 64-expert setup with top-2 gating strikes an optimal balance between execution efficiency and
performance across zero-shot, one-shot, and few-shot scenarios. Reflecting this trend, more recent
sparse MoE models [34, 74, 94, 151, 154, 166, 172, 197] commonly utilize no more than 64 experts.
Additionally, DeepSpeed-MoE [121] adopts a Pyramid-MoE approach, positioning MoE layers with
a larger expert count towards the network’s end.
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:14 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
Table 2. Comparative configurations of MoE with FFN experts in selected models. Model differentiation in
each reference is achieved by using the model size, indicated either by total or activated/total parameter
count. Both activated and total expert counts encompass the count of shared experts when utilized. 𝑑𝑚𝑜𝑑𝑒𝑙 is
the hidden size, 𝑑 𝑓 𝑓 𝑛 is the intermediate size of FFNs, 𝑑𝑒𝑥𝑝𝑒𝑟𝑡 is the intermediate size of FFN experts, #L is
the number of layers, #H and 𝑑ℎ𝑒𝑎𝑑 are the number of attention heads and attention head dimensions.
Expert Count Placement Activation Share Expert
Reference Models 𝑑𝑚𝑜𝑑𝑒𝑙 𝑑𝑓 𝑓 𝑛 𝑑𝑒𝑥𝑝𝑒𝑟𝑡 #L #H 𝑑ℎ𝑒𝑎𝑑
(Activ./Total) Frequency Function Count
600B 2/2048 1024 8192 𝑑𝑓 𝑓 𝑛 36 16 128 1/2 ReLU 0
GShard [86] 200B 2/2048 1024 8192 𝑑𝑓 𝑓 𝑛 12 16 128 1/2 ReLU 0
(2020) 150B 2/512 1024 8192 𝑑𝑓 𝑓 𝑛 36 16 128 1/2 ReLU 0
37B 2/128 1024 8192 𝑑𝑓 𝑓 𝑛 36 16 128 1/2 ReLU 0
7B 1/128 768 2048 𝑑𝑓 𝑓 𝑛 12 12 64 1/2 GEGLU 0
Switch [49] 26B 1/128 1024 2816 𝑑𝑓 𝑓 𝑛 24 16 64 1/2 GEGLU 0
(2021) 395B 1/64 4096 10240 𝑑𝑓 𝑓 𝑛 24 64 64 1/2 GEGLU 0
1571B 1/2048 2080 6144 𝑑𝑓 𝑓 𝑛 15 32 64 1 ReLU 0
0.1B/1.9B 2/64 768 3072 𝑑𝑓 𝑓 𝑛 12 12 64 1/2 GEGLU 0
GLaM [44] 1.7B/27B 2/64 2048 8192 𝑑𝑓 𝑓 𝑛 24 16 128 1/2 GEGLU 0
(2021) 8B/143B 2/64 4096 16384 𝑑𝑓 𝑓 𝑛 32 32 128 1/2 GEGLU 0
64B/1.2T 2/64 8192 32768 𝑑𝑓 𝑓 𝑛 64 128 128 1/2 GEGLU 0
350M/13B 2/128 1024 4𝑑𝑚𝑜𝑑𝑒𝑙 𝑑𝑓 𝑓 𝑛 24 16 64 1/2 GeLU 0
DeepSpeed-MoE [121] 1.3B/52B 2/128 2048 4𝑑𝑚𝑜𝑑𝑒𝑙 𝑑𝑓 𝑓 𝑛 24 16 128 1/2 GeLU 0
(2022) PR-350M/4B 2/32-2/64 1024 4𝑑𝑚𝑜𝑑𝑒𝑙 𝑑𝑓 𝑓 𝑛 24 16 64 1/2, 10L-32E, 2L-64E GeLU 1
PR-1.3B/31B 2/64-2/128 2048 4𝑑𝑚𝑜𝑑𝑒𝑙 𝑑𝑓 𝑓 𝑛 24 16 128 1/2, 10L-64E, 2L-128E GeLU 1
ST-MoE [197] 0.8B/4.1B 2/32 1024 2816 𝑑𝑓 𝑓 𝑛 27 16 64 1/4, add extra FFN GEGLU 0
(2022) 32B/269B 2/64 5120 20480 𝑑𝑓 𝑓 𝑛 27 64 128 1/4, add extra FFN GEGLU 0
Mixtral [74] 13B/47B 2/8 4096 14336 𝑑𝑓 𝑓 𝑛 32 32 128 1 SwiGLU 0
(2023) 39B/141B 2/8 6144 16384 𝑑𝑓 𝑓 𝑛 56 48 128 1 SwiGLU 0
3.0B/6.7B 2/16 4096 11008 688 32 32 128 1 SwiGLU 0
LLAMA-MoE [149]
(2023) 3.5B/6.7B 4/16 4096 11008 688 32 32 128 1 SwiGLU 0
3.5B/6.7B 2/8 4096 11008 1376 32 32 128 1 SwiGLU 0
1
0.24B/1.89B 8/64 1280 - 4𝑑𝑓 𝑓 𝑛 9 10 128 1 SwiGLU 1
DeepSeekMoE [30]
2.8B/16.4B 8/66 2048 10944 1408 28 16 128 1, except 1st layer SwiGLU 2
(2024)
1
22B/145B 16/132 4096 - 8𝑑𝑓 𝑓 𝑛 62 32 128 1, except 1st layer SwiGLU 4
339M/650M 2/16 768 3072 𝑑𝑓 𝑓 𝑛 12 12 64 1/4 SwiGLU 1
OpenMoE [172]
(2024) 2.6B/8.7B 2/32 2048 8192 𝑑𝑓 𝑓 𝑛 24 24 128 1/6 SwiGLU 1
6.8B/34B 2/32 3072 12288 𝑑𝑓 𝑓 𝑛 32 24 128 1/4 SwiGLU 1
Qwen1.5-MoE [151]
2.7B/14.3B 8/64 2048 5632 1408 24 16 128 1 SwiGLU 4
(2024)
DBRX [34]
36B/132B 4/16 6144 10752 𝑑𝑓 𝑓 𝑛 40 48 128 1 SwiGLU 0
(2024)
Jamba [94] 1/2,
12B/52B 2/16 4096 14336 𝑑𝑓 𝑓 𝑛 32 32 128 SwiGLU 0
(2024) 1:7 Attention:Mamba
Skywork-MoE [154]
22B/146B 2/16 4608 12288 𝑑𝑓 𝑓 𝑛 52 36 128 1 SwiGLU 0
(2024)
Yuan 2.0-M32 [166]
3.7B/40B 2/32 2048 8192 𝑑𝑓 𝑓 𝑛 24 16 256 1 SwiGLU 0
(2024)
Expert Size. To scale the model effectively, GLaM [44] prioritizes the expansion of the inter-
mediate hidden dimension per expert while standardizing the expert count at 64, a strategy that
often requires the implementation of tensor parallelism across multiple accelerators to maintain
computational efficiency [44, 49, 121]. From this period forward, MoE models [34, 74, 154, 197]
typically featured larger expert dimensions. Differently, DeepSeekMoE [30, 36] introduces the
concept of fine-grained expert segmentation by subdividing the intermediate hidden dimension
of FFN expert, while preserving the overall parameter count. Specifically, DeepSeekMoE-145B
employs a reduced intermediate hidden dimension at one-eighth that of its dense FFN counterpart,
increasing both the number of experts (from 16 to 128) and the number of active experts (from top-2
to top-16) by a factor of eight. They believe that this strategy not only refines the decomposition
of knowledge across experts, facilitating more precise learning, but also enhances the flexibility
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
A Survey on Mixture of Experts 1:15
Table 3. A collection of recent open-source models detailing activated and total parameter counts, alongside
performance benchmarks such as MMLU [61] (5-shot), GSM8K [27] (5-shot), MATH [62] (4-shot), and
HumanEval [14] (0-shot), unless specified otherwise.
Params. Benchmarks
Name Time Affiliation Link
Activ. Total MMLU GSM8K MATH HumanEval
Mixtral-8x7B-v0.1 2023.12 Mistral 13B 47B 70.6 58.4, 74.4 (8-shot) 28.4 40.2 https://huggingface.co/mistralai/Mixtral-8x7B-v0.1
DeepSeekMoE-16B-Base 2024.1 DeepSeek 3B 16B 45.0 18.8 (8-shot) 4.3 26.8 https://huggingface.co/deepseek-ai/deepseek-moe-16b-base
Grok-1 2024.3 xAI 86B 314B 73.0 62.9 23.9 63.2 https://github.com/xai-org/grok-1
Qwen1.5-MoE-A2.7B 2024.3 Alibaba 3B 14B 62.5 61.5 (8-shot) - 34.2 https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B
DBRX Instruct 2024.3 Databricks 36B 132B 73.7 72.8 - 70.1 https://huggingface.co/databricks/dbrx-instruct
Jamba-v0.1 2024.3 AI21 Labs 12B 52B 67.4 59.9 (3-shot) - 29.3 https://huggingface.co/ai21labs/Jamba-v0.1
Mistral-8x22B-v0.1 2024.4 Mistral 39B 141B 77.8 78.6, 88.4 (8-shot) 41.8 45.1 https://huggingface.co/mistralai/Mixtral-8x22B-v0.1
Arctic Instruct 2024.4 Snowflake 17B 480B 67.3 74.2 - - https://huggingface.co/Snowflake/snowflake-arctic-instruct
DeepSeek-V2 2024.5 DeepSeek 21B 236B 78.5 79.2 (8-shot) 43.6 48.8 https://huggingface.co/deepseek-ai/DeepSeek-V2
DeepSeek-V2-Chat (RL) 2024.5 DeepSeek 21B 236B 77.8 92.2 (8-shot) 53.9 81.1 https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat
Yuan 2.0-M32 2024.5 IEIT 4B 40B 72.2 92.7 (8-shot) 55.9 (8-shot) 74.4 https://huggingface.co/IEITYuan/Yuan2-M32
Skywork-MoE-Base 2024.6 Kunlun 22B 146B 77.4 76.1 31.9 43.9 https://huggingface.co/Skywork/Skywork-MoE-Base
of expert activation combinations, allowing for more specialized and targeted knowledge capture.
Qwen1.5-MoE [151] and DBRX [34] adopt a similar fine-grained expert segmentation strategy.
From the results of LLAMA-MoE, which evenly splits the parameters of FFNs into non-overlapping
experts to construct the MoE models with same parameter count, activating 4 of 16 𝑑𝑒𝑥𝑝𝑒𝑟𝑡 = 688
expert performs slightly better than activating 2 of 8 𝑑𝑒𝑥𝑝𝑒𝑟𝑡 = 1376 expert. Results from LLAMA-
MoE [149], which allocates dense FFN parameters across non-overlapping experts to maintain a
consistent parameter count, indicate that activating 4 out of 16 experts with a dimensionality of
𝑑𝑒𝑥𝑝𝑒𝑟𝑡 = 688 marginally outperforms the activation of 2 out of 8 experts with 𝑑𝑒𝑥𝑝𝑒𝑟𝑡 = 1376.
Frequency of MoE Layers. Sparse MoE models typically evolve from dense architectures by
interspersing MoE layers in place of the dense FFN layers at regular intervals. Although a higher
frequency of MoE layers can enlarge the model size, it also introduces greater system overhead. In
practice, most MoE models features alternate FFN replacement (1/2) with MoE layers [6, 44, 86, 121].
Nevertheless, variations exist, with some models incorporating MoE layers every fourth layer (1/4)
[172, 197] or in every layer (1/1) [30, 49]. Following the introduction of Mixtral 8x7B [74], the trend
seems to shift towards placing MoE in every layer of the model, with a common choice of only 8 or
16 experts mirroring the dimensionality of a dense FFN [30, 34, 151, 154].
Research into the optimal configuration of MoE layers has been extensive. V-MoE [128] employs
MoE in the last few even-numbered Transformer layers, noting that, despite using fewer MoE
layers, the impact on performance is minimal while computational speed is significantly enhanced.
DeepSeekMoE-16B/-145B [30] replaces all FFNs with MoE layers, excluding the first, due to the
observed slower convergence of load balance status in the first layer. MoE-LLaVA [95], a recently
popular open Large Vision-Language Model (LVLM), demonstrates that alternating MoE placement
yields superior model quality and execution efficiency on multimodal tasks, compared to every-
layer MoE placement or "First-Half" and "Second-Half" configurations. ST-MoE [197] found that
adding a dense FFN adjacent to each MoE layer can improve model quality. Brainformers [192]
introduce a nonuniform architecture that integrates MoE layers, dense FFNs, attention mechanisms,
and a variety of layer normalizations and activation functions without strict sequential layering,
trading architectural regularity for the flexibility of sub-layer composition. Jamba [94] integrates
the architecture of Mamba [55] by adopting a 1:7 ratio of attention-to-Mamba layers.
4.2.3 Activation Function. Building upon dense Transformer architectures, sparse MoE models
have adopted a progression of activation functions paralleling those in leading dense large language
models, including BERT [38], T5 [120], GPT [11], LLAMA [155] and so on. The evolution of
activation functions has seen a shift from ReLU [52] to more advanced options such as GeLU [63],
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:16 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
GeGLU [133], and SwiGLU [133]. This trend extends to other components of MoE models, which
now frequently incorporate Root Mean Square Layer Normalization (RMSNorm) [180], Grouped
Query Attention (GQA) [2], and Rotary Position Embeddings (RoPE) [144].
4.2.4 Shared Expert. DeepSpeed-MoE [121] innovatively introduces the Residual-MoE architecture,
wherein each token is processed by a fixed expert and another selected through gating, achieving
two experts engagement per layer without increasing the communication cost beyond that of top-1
gating. This approach considers the gating-selected MoE expert as an error-correcting adjunct to the
fixed dense FFN. A conceptually similar approach, Conditional MoE Routing (CMR), is employed
in NLLB [29], which also combines the outputs of dense FFN and MoE layers. This paradigm
of integrating fixed FFN with sparse MoE, often referred to as shared expert and illustrated in
Figure 5 (b), has gained traction in recent language models such as DeepSeekMoE [30], OpenMoE
[172], Qwen1.5-MoE [151], and MoCLE [53], indicating its ascension to a mainstream configuration.
Instead of using a single shared expert, DeepSeekMoE [30] and Qwen1.5-MoE [151] employ multiple
shared experts, due to their fine-grained expert segmentation design.
However, shared expert configuration, while effective in NLP tasks, have not demonstrated the
same level of enhancement in vision tasks. Empirical evidence from ScMoE [12] indicates that
pairing one shared expert with one gating-selected expert yields only comparable performance to
standard top-1 MoE in SwinV2-MoE models. Additionally, based on the design of shared expert,
ScMoE decouples the MoE process to separately handle the representations from preceding layers
and integrate them with the outputs processed by the shared expert of the current layer, thus
improving efficiency by facilitating overlap in communication and computation. A comparable
method to enhance overlapping is employed in the Dense-MoE hybrid transformer architecture,
as delineated in Snowflake Arctic [152], which bears resemblance to the Lora MoE framework
discussed in Section 4.3.3 and illustrated in Figure 6 (d).
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
A Survey on Mixture of Experts 1:17
𝑌 𝑌
Add + Normalize Add + Normalize
FFN FFN
𝑋 𝑋
(a) Attention (b) FFN
𝑌 𝑌
Add + Normalize Add + Normalize
FFN FFN
𝑋 𝑋
(c) Transformer Block (d) Every Layer
Fig. 6. The illustration of the taxonomy of MoPEs based placement within the Transformer model architecture.
(a) exemplifies the integration of MoPE with the Key and Value modules of the attention mechanism. (b)
represents the application of MoPE to the FFN. (c) refers to the MoPE integration at the level of the Transformer
block, wherein two distinct groups of experts are applied to attention and FFN, where separate sets of experts
are allocated to both attention and FFN, each regulated by its own gating mechanism. (d) illustrates a
layer-wise integration of MoPE, in which each Transformer layer is regarded as a unified entity with a gating
orchestrating the interplay among experts.
placement within the Transformer model architecture. We will then review recent MoPE research,
summarizing the methodologies and contributions of each study.
4.3.1 Feed-Forward Network. Following the conventional MoE structure, a series of studies in-
troduce the MoPE framework to the FFN layer of every Transformer block. During the training
process, the focus is on optimizing the parameter-efficient experts and the gating mechanism,
leaving the rest of the pre-trained model intact. As illustrated in Figure 6(b), the forward process
under the MoPE framework integrated with FFN can be expressed as:
𝑛
∑︁
FFN𝑀𝑜𝐸 (x′ ) = FFN(x′ ) + x′ ΔW𝑖 · 𝐺 𝑓 𝑓 𝑛 (x′ )𝑖 ,
𝑓 𝑓𝑛
(4.1)
𝑖=1
x′ = LayerNorm(SA(x) + x), (4.2)
where ΔW 𝑓 𝑓 𝑛 and 𝐺 𝑓 𝑓 𝑛 (x) is the parameter-efficient expert and gating function applied to the
FFN layer, respectively.
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:18 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
One of the pioneering works in this domain, LoRAMoE [43], efficiently applies the MoPE
structure to FFN. LoRAMoE integrates a few plug-in LoRA experts into the FFN layer, employing
a gating mechanism to orchestrate the experts’ contributions. Realizing the diversity in data
distributions, LoRAMoE separates the experts into two distinct groups: one focuses on learning
various downstream tasks, and the other is dedicated to aligning pretrained world knowledge with
human instructions. To ensure that each group of experts maintains its focus, LoRAMoE defines
a localized balancing constraint loss, which preserves the importance of each expert within its
group while allowing different groups to concentrate on their respective tasks. This design enables
LoRAMoE to effectively resolve the knowledge forgetting issue and enhance model performance
on downstream tasks. In a similar vein, AdaMix [160] injects a set of Adapter [65] experts after the
FFN layer in each Transformer block. Adapter tuning is a PEFT method that integrates a pair of
feed-forward up and down projection matrices into the Transformer block. During fine-tuning,
only the incremental Adapter blocks are updated, with the rest of the model unchanged. AdaMix
utilizes a stochastic routing policy that randomly selects the projection matrices during training,
maintaining computational costs equivalent to a single adapter. To minimize service costs during
inference, AdaMix averages the outputs of all experts.
Taking a different approach, MixDA[39] includes two training stages to leverage domain-specific
knowledge while preserving learned information. During the first stage, MixDA only fine-tunes
the domain-adapters that work parallel to the FFN to acquire domain-specific knowledge and keep
the world knowledge simultaneously. In the second stage, MixDA introduces a gating network and
task-adapters on top of the FFN layer for tailoring the model to specific downstream tasks. This
strategy allows for a more nuanced adaptation to the task at hand. LLaVA-MoLE[15] extends the
application of MoPE to multimodal tasks. It creates a set of LoRA experts for the FFN layer to handle
inputs from different domains, enhancing the model’s versatility. LLaVA-MoLE adopts a top-1
routing strategy, activating the most relevant expert based on the router’s output distribution, thus
maintaining computational costs close to a standard FFN with LoRA. This framework is effective
in addressing data conflicts and consistently surpasses plain-LoRA baselines across diverse data
configurations.
Contrasting with the MoPE implementations we have discussed, MixLoRA[88] creates a LoRA-
MoE framework that closely aligns with the conventional MoE models. Rather than just plugging
in multiple lightweight experts, MixLoRA fuses LoRA experts with the shared FFN layer. By
leveraging the base weights from a single FFN of the base model, MixLoRA streamlines the creation
of the MoPE architecture. Furthermore, MixLoRA implements a high-throughput framework that
significantly reduces token computation latency and memory usage during both training and
inference, optimizing performance and efficiency.
4.3.2 Attention. A branch of research has been exploring the application of the MoPE framework
with the attention mechanism. These studies typically involve augmenting the attention mechanism
by incorporating a gating network and a set of parallel experts. The MoPE framework can be applied
to the Query, Key, Value, and Output projection modules, individually or in various combinations,
within the attention mechanism. During the fine-tuning process, only the parameters of the activated
experts and the gating network are updated, while the remaining parameters of the model are
kept frozen. For example, as shown in Figure 6(a), the integration of MoPE with the Key and Value
module of the attention mechanism can be formalized as follows:
Q(KT + 𝑛
Í𝑛
𝑀𝑜𝐸 𝑖=1 xΔW𝑖
𝑘 · 𝐺 𝑘 (x)𝑖 ) ∑︁
SA (x) = 𝑆𝑜 𝑓 𝑡𝑚𝑎𝑥 ( √ ) (V + xΔW𝑖𝑣 · 𝐺 𝑣 (x)𝑖 ), (4.3)
𝑑ℎ𝑒𝑎𝑑 𝑖=1
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
A Survey on Mixture of Experts 1:19
where Q, K, V represents the Query, Key and Value modules, respectively. ΔW𝑘 and 𝐺 𝑘 (x) denote
the parameter-efficient expert and its corresponding gating function for the Key module. Similarly,
ΔW𝑣 and 𝐺 𝑣 (x) indicate the expert and the gating function for the Value module. Here, 𝑛 is the
number of experts, and 𝑑ℎ𝑒𝑎𝑑 is the dimensions in the Multi-head Attention mechanism.
Recent studies have demonstrated the effectiveness of extending MoE to the attention layer
[139, 140, 182]. Additionally, there is a new line of research has focused on the fusion of MoPE with
the attention mechanism to enhance the model’s efficiency and adaptability. For instance, MoELoRA
[100] applies MoE to the attention mechanism in a resource-efficient manner by leveraging a PEFT
method. MoELoRA adopts LoRA [66] to construct the experts. LoRA introduces two low-rank
matrices to receive incremental updates associated with the task-specific fine-tuning. Only the
LoRA matrices are updated while the base model is kept untouched during fine-tuning. Specifically,
MoELoRA sets multiple LoRA experts to the Query and Value matrices of the attention mechanism,
and utilizes a gating network to activate the top 𝑘 experts related to the specific tasks during both
training and inference phases. To alleviate routing randomness, MoELoRA employs a contrastive
learning loss to control the training of experts. The contrastive learning loss is designed to accentuate
the differences in output distributions between experts, thereby encouraging them to capture diverse
features relevant to the downstream tasks. MoELoRA offers a solution for flexibly combining various
computational modules tailored to downstream tasks.
Another framework, MoCLE[53], aims to resolve task conflicts that arise from the diversity of
training tasks of different sources and formats. MoCLE utilizes a clustering model to categorize
different tasks and then leverages a router to direct the clustered input to LoRA experts inserted
into the Query and Value modules of the attention mechanism. These LoRA experts contain a group
of multiple task experts and a universal expert. Each task expert is dedicated to a particular task
to reduce task conflicts, while the universal expert, trained on all tasks, helps to maintain model
generalization. SiRA[196] introduces several lightweight LoRA adapters as experts, along with a
top 𝑘 gating mechanism. To mitigate load imbalance and over-fitting issues, SiRA incorporates
a capacity constraint that limits the number of tokens each expert can process. Additionally, it
employs an auxiliary loss to promote load balancing and an expert dropout mechanism to equalize
the gating distribution. SiRA provides an efficient and fine-grained approach to improving the
quality of LoRA.
4.3.3 Transformer Block. The integration of MoPE with the Transformer architecture has received
substantial attention in recent research. This approach involves creating two groups of experts: one
for the attention mechanism, and another for the FFN within the Transformer block. Each group is
regulated by its gating mechanism to control the activation of the experts. As exhibited in Figure
6(c), the forward process under the MoPE framework integrated with the Transformer block can
be denoted as:
𝑦 = LayerNorm(x′ + FFN𝑀𝑜𝐸 (x′ )), (4.4)
′ 𝑀𝑜𝐸
x = LayerNorm(SA (x) + x). (4.5)
MoV [178] is one of the notable attempts that combine MoPE with the Transformer block to pursue
parameter efficiency. Utilizing the PEFT method, (IA)3 [96], MoV introduces tunable vectors that re-
scale the Key and Value modules in the attention mechanism, as well as the activation within the FFN.
By substituting conventional experts with (IA)3 vectors and updating only these lightweight experts
and their corresponding gating during fine-tuning, MoV significantly reduces the computational
burden associated with gradient calculations and lessens the memory footprint required for model
storage. Similarly, MoLORA [178] employs multiple LoRA experts to the attention and FFN blocks,
outperforming the standard LoRA approach. UniPELT [103] proposed a hybrid framework that
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:20 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
integrates three representative PEFT methods as experts, namely Adapter [65], Prefix-tuning [91],
and LoRA [66]. Prefix-tuning is a method that freezes the base model and optimizes the continuous
task-specific vectors prepended to the input of the attention. Within the UniPELT framework,
LoRA matrices are applied to the weight matrices of Query and Key in the attention mechanism,
Prefix vectors are added to the Key and Value modules, and the Adapter block is inserted after the
FFN layer. UniPELT leverages different gating mechanisms to dynamically activate the experts,
efficiently finding the approaches that best suit the given task.
Further broadening the scope of the LoRA-MoE framework, Omni-SMoLA[164] extends the MoPE
with three sets of LoRA experts, each tailored to handle text tokens, visual tokens, and multimodal
tokens, respectively. The specialization enables the architecture to enhance performance across
various vision-and-language tasks. In the context of MoPE research, the number of experts emerges
as a critical hyperparameter influencing downstream task performance [168, 178]. Additionally, the
use of many experts will lead to redundancy [18]. MoLA [51] is one of the pioneering work that
explores the expert allocation issue. It proposes a LoRA-MoE framework with a Layer-wise Expert
Allocation, which enables the flexible employment of varying numbers of experts across different
layers. The expert allocation strategy proposed by MoLA further improves the effectiveness of the
LoRA-MoE framework. In the specialized field of medical applications, MOELoRA[97] tackles the
challenges of task variety and high adaptation cost. It integrates LoRA experts and task-motivated
gate functions into the attention and FFN of each layer. The gating utilizes task identity to modulate
expert contributions, creating unique parameter sets tailored to individual tasks. The design of
MOELoRA combines the strengths of both MOE and LoRA, strengthening LLM’s capability in
medical domains.
4.3.4 Every Layer. There has been considerable interest in incorporating MoPE into fundamental
components such as the attention, FFN, and Transformer block. Existing work often approaches the
attention mechanism and FFN independently, employing distinct gating mechanisms to modulate
them separately. Rather than treating these elements isolated, there is a new direction that considers
the Transformer layer as an integrated whole. This shift in perspective allows for the application
of the MoPE framework to the entire Transformer layer, capturing the combined dynamics of the
attention and FFN within a unified approach. As illustrated in Figure 6(d), the forward process
under the MoPE framework integrated with every layer is as follows:
𝑛
∑︁
𝑦 = LayerNorm(x′ + FFN(x′ )) +
𝑙𝑎𝑦𝑒𝑟
xΔW𝑖 · 𝐺 𝑙𝑎𝑦𝑒𝑟 (x)𝑖 , (4.6)
𝑖=1
x′ = LayerNorm(SA(x) + x), (4.7)
where ΔW𝑙𝑎𝑦𝑒𝑟 and 𝐺 𝑙𝑎𝑦𝑒𝑟 (x) is the parameter-efficient expert and gating function applied to the
entire layer, respectively.
In this context, the approach presented by MoLE[168] provides innovative insights. MoLE
identifies that various layers within LoRA exhibit unique features. In response to this finding,
MoLE pursues to enhance the composition effect of trained LoRAs by dynamically adjusting the
layer-specific weights according to the desired objective. This is achieved by plugging a set of
trained LoRAs alongside a gating function into each layer. MoLE treats each layer of trained LoRAs
as an individual expert and only trains the gating to learn the optimal composition weights for a
specified domain. This dynamic linear composition strategy significantly extends the versatility of
LoRA, enabling its application across a broader spectrum of practical scenarios.
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
A Survey on Mixture of Experts 1:21
Fig. 7. Schematic representation of training and inference schemes related to MoE. It provides an abstracted
view of model transition, without focusing specific model states during training or inference. Subfigure (a)
depicts the original scheme without architectural transformation. Subfigure (b) depicts the merging of distinct
expert models, exemplified by BTX [145]. Subfigure (c) depicts the transition from a dense model to a sparse
model. Subfigure (d) depicts the inverse process, where a sparse model is converted to a dense model.
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:22 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
initialized gating module, and further training the model’s feed-forward layers under sparsity
conditions—specifically, by updating the weights locally within each compute node rather than
averaging gradients across nodes.
Nie et al. [110] present EvoMoE, an efficient end-to-end MoE training framework. EvoMoE decou-
ples the joint learning of experts and the sparse gate, emphasizing the acquisition of foundational
knowledge through a single expert at the inception of training. Subsequently, it spawns multiple
diverse experts and advances the diversification of experts by training with the novel Dense-to-
Sparse gate (DTS-Gate). The DTS-Gate initially operates as a dense activation of all experts, then
progressively and adaptively constricting to route tokens to a reduced number of experts. A similar
strategy is employed in the development of the MoE-LLaVA [95] large vision-language model,
which commences with a dense model, subsequently multiplies the feedforward network (FFN) to
create expert initializations, and proceeds to train exclusively the MoE layers, while keeping the
remaining model components static.
Komatsuzaki et al. [82] highlight the efficiency of sparse models in terms of quality and com-
putational cost, yet acknowledge their significant data requirements and the expense of training
from scratch at scale. To address this, they introduce a scheme termed "sparse upcycling," which
leverages pre-existing training investments by initializing a sparsely activated MoE model from
a pre-trained dense checkpoint. This involves transferring all parameters—and optionally their
associated optimizer states—from the original checkpoint, with the exception of the MoE gating
network parameters, which are not present in the dense model. Notably, the new MoE layers are
populated with identical copies of the original dense model’s FFN layers, and the gating mechanism
weights are initialized randomly. A critical obstacle in model upcycling is the initial performance
decrease resulting from structural modifications to a trained network. To mitigate this performance
regression during upcycling, the researchers proposes normalizing the gate combine weights for
each token to 1. This approach is grounded in the notion that, in the dense model, each token
was processed by a singular "expert" FFN. While this normalization proved beneficial for upcycled
vision models, it was found to be detrimental to the performance of upcycled language models.
Building upon the sparse upcycling technique [82], the Skywork-MoE model [154] leverages
the foundational architecture of its pre-developed Skywork-13B model [163], adopting its dense
checkpoints as a foundation for initial states. Their empirical evidence indicates that the decision
between sparse upcycling and training from scratch should be informed by both the performance of
available dense checkpoints and the MoE-specific training resources, as models trained from scratch
consistently surpass their upcycled counterparts in performance. The study observes a decline in
average expert similarity throughout the training of upcycled MoEs, suggesting a diversification of
experts emerges during the process. Importantly, the Skywork-MoE analysis reveals that models
with greater expert similarity tend to underperform, establishing expert similarity as a potential
diagnostic tool during MoE training for upcycled models. Conversely, the expert similarity in
models trained from scratch remains minimal, implying that non-uniform expert initialization
promotes diversification.
Pan et al. [117] posit that the parameter inefficiency observed in MoE models stems from
conventional sparse training methodologies, where only a select group of experts is engaged and
refined for each input token. To counteract this, they introduce a hybrid framework for MoE models,
denoted as DS-MoE, which integrates dense training (activating all experts) with sparse inference
(sparse expert activation) to achieve higher computation and parameter efficiency. Notably, DS-MoE
maintains activation for all self-attention experts (MoA [182]) during inference but selectively
activates FFN experts, reflecting the observation that self-attention layers manifest considerably
less sparsity compared to FFN layers.
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
A Survey on Mixture of Experts 1:23
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:24 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
unified MoE model. Specifically, it consolidates the FFNs from all ELMs into a singular MoE module
at each layer, with a gating network determining the appropriate FFN expert for each token. Other
components, such as the self-attention layers from ELMs, are merged by averaging their weights.
The resulting model then undergoes MoE-finetuning on all the combined data to enable the gate to
effectively mix the FFN experts.
Wang et al. [158] point that while the emergence of Foundation Models made it easier to obtain
expert models tailored to specific tasks, the heterogeneity of data at test time necessitates more
than a single expert. Accordingly, they explore the Fusion of Experts (FoE) challenge, which aims
to integrate outputs from expert models that provide diverse but complementary insights into the
data distribution, formulating it as an instance of supervised learning.
4.5 Derivatives
Building upon the principles of algorithm design highlighted earlier, numerous studies have drawn
inspiration from the Mixture of Experts (MoE) framework, proposing a range of MoE variants.
We categorize these innovative models as derivatives of the MoE. For instance, Xue et al. [171]
introduced WideNet, an approach that increases model width by substituting the feed-forward
network (FFN) with an MoE layer while maintaining shared trainable parameters across Transformer
layers, except for the normalization layers. Subsequently, Tan et al. [146] presented the Sparse
Universal Transformer (SUT), an efficient enhancement of the Universal Transformer, which
is characterized by parameter-sharing across its layers. SUT incorporates a Sparse Mixture of
Experts and a novel stick-breaking-based dynamic halting mechanism, thus reducing computational
complexity without compromising parameter efficiency or generalization capabilities. Moreover,
the traditional MoE models often employ discrete matching between experts and tokens [6, 44,
49, 86, 135, 193, 197], a practice associated with training instability and uneven expert utilization.
Addressing these challenges, Antoniak et al. [5] innovatively proposes the Mixture of Tokens (MoT),
which blends tokens from different examples before presenting them to the experts. Thus, MoT
enables the model to benefit from a wider array of token-expert combinations.
Recently, the MoE’s principle of assigning specialized knowledge to individual experts has been
adapted to parameter-efficient fine-tuning (PEFT). Choi et al. [23] propose the sparse mixture-of-
prompts (SMoP), a method that utilizes a gating mechanism to train multiple short soft prompts,
each adept at processing distinct subsets of data. This addresses the inefficiencies encountered with
long soft prompts during prompt tuning. The MoE framework has also been integrated into lifelong
learning (LLL), which seeks to facilitate continuous learning from an ongoing stream of data. The
Lifelong-MoE model [19] dynamically expands model capacity by adding experts with regularized
pretraining, effectively mitigating the issue of catastrophic forgetting [81] typically associated with
straightforward fine-tuning. In a recent development, the MoE concept of conditional computation
has been further refined to optimize resource allocation in transformer-based language models
(LMs). The Mixture-of-Depths (MoD) [124] employs a binary gating network to decide whether
a token should be processed by a given Transformer layer. As a result, MoD transformers can
dynamically allocate computational resources (FLOPs) to specific sequence positions, achieving a
lower overall FLOP footprint compared to vanilla or MoE-based transformers.
In summary, the evolution of MoE derivatives reveals a trend where models either integrate
the conditional computation aspect of the gating mechanism or merge the MoE structure with
various tasks achieved by assigning specialized knowledge to individual experts, such as aforemen-
tioned prompt tuning [23] and lifelong learning [19] with MoE, demonstrating the versatility and
adaptability of the MoE architecture across different domains.
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
A Survey on Mixture of Experts 1:25
𝑌! 𝑌"
FFN1 FFN3
Add + Normalize Add + Normalize
FFN2 FFN4
Decode Decode
Gate Gate Gate Gate
All-to-All Combine
Fig. 8. Schematic depiction of diverse parallel strategies for MoE. For clarity and conciseness, this illustration
omits some All-to-All, All-Reduce, Point-to-Point communication within parallelism, and Normalization,
Encode, Decode, Gate in subfigures (b), (c), and (d).
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:26 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
Table 4. Comparative overview of the open-source MoE system frameworks, arranged chronologically by
reference publication date from newest to oldest. We give the count of GitHub stars as of June 2024.
Optimizations
Reference Affiliation Link Star
Computation Communication Storage
OpenMoE [172] Colossal-AI ✓ ✓ https://github.com/hpcaitech/ColossalAI 38K
ScatterMoE [147] Mila Quebec ✓ https://github.com/shawntan/scattermoe 140
Megablocks [50] Stanford University ✓ https://github.com/stanford-futuredata/megablocks 1.1K
Tutel [69] Microsoft ✓ ✓ https://github.com/microsoft/tutel 672
SE-MOE [136] Baidu ✓ ✓ ✓ https://github.com/PaddlePaddle/Paddle 21K
HetuMoE [112] Peking University ✓ ✓ https://github.com/PKU-DAIR/Hetu 236
Deepspeed-MoE [121] Microsoft ✓ ✓ https://github.com/microsoft/DeepSpeed 33K
FastMoE [59] Tsinghua University ✓ ✓ https://github.com/laekov/fastmoe 1.4K
Fairseq [6, 115] Meta https://github.com/facebookresearch/fairseq/tree/moe 29K
Mesh-TensorFlow [134] Google https://github.com/tensorflow/mesh 1.6K
In the subsequent discussion, we delineate the challenges introduced by MoE models from com-
putation, communication, and storage aspects, concurrently reviewing existing research addressing
these issues. Table 4 shows an overview of the open-source MoE frameworks.
5.1 Computation
Despite MoE is designed to scale model parameters efficiently without increasing computational
demand, it encounters challenges pertaining to computational efficiency. One concern is the
imbalance of computational load across distributed devices employing expert parallelism, which
incurs significant synchronization overhead as the system awaits the processing completion of the
most heavily loaded expert. Such issues are typically addressed through algorithmic strategies, such
as optimized gating mechanisms and expert capacity adjustments, as discussed in the preceding
section. Besides, solutions like SE-MoE [136], Tutel [69], FlexMoE [111] and SmartMoE [179] have
introduced dynamic expert placement strategies to distribute the workload as equally as possible
among devices. Additionally, FasterMoE [60] has implemented a novel dynamic shadowed expert
strategy, replicating experts on multiple devices to mitigate severe load imbalance. These model
placement related strategies impact both computation and communication efficiency.
Another concern is that MoE introduces additional computational overhead through operations
including gate routing, input encode and output decode. Unlike expert computations, which mirror
operations in dense models and benefit from extensive optimization on prevalent hardware such
as GPUs, these MoE operations are characterized by redundant computation and memory move-
ment, resulting in low efficiency on computing devices. Therefore, recent studies like DeepSpeed-
MoE[121], FastMoE [59], HetuMoE [112] and Tutel [69] have focused on the development of tailored
GPU kernels to enhance the efficiency of MoE operations.
In contexts where multiple experts are deployed on a single GPU device, MegaBlocks [50]
reformulates MoE computation in terms of block-sparse operations, developing specialized block-
sparse GPU kernels that efficiently handle the dynamic workloads without dropping tokens. Zheng
et al. [187] propose PIT, a deep-learning compiler tailored for dynamic sparsity of MoE, which can
find feasible PIT rules for all the operators within a model and generate optimized GPU kernels for
them. PIT employs a novel tiling mechanism, utilizing the Permutation Invariant Transformation
(PIT)—a mathematically proven property-to transform multiple sparsely located micro-tiles into
a GPU-efficient dense tile without changing the computation results, thus achieving both high
GPU utilization and low coverage waste. Despite these advancements, Tan et al. [147] highlight
remaining optimization potential within current MoE frameworks such as MegaBlocks and PIT,
which commence with an initial scatter-to-group data copy that increases memory footprint and
requires a translation of the MoE problem into the sparse matrix format. Although this translation
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
A Survey on Mixture of Experts 1:27
5.2 Communication
In expert parallelism, the quadruple invocation of All-to-All communication during both the for-
ward and backward propagation phases within each MoE layer causes a significant overhead, even
emerging as the primary constraint on efficiency. The All-to-All communication paradigm encom-
passes both intra-node (via PCIe, pre-4th-generation NVLink) and inter-node (Ethernet, Infiniband,
4th-generation NVLink) communication channels. The efficiency of such communication is con-
tingent upon a multitude of factors, including the heterogeneity of channel bandwidths, network
topology, and the collective communication algorithms. Moreover, load imbalances intrinsic to
MoE may exacerbate these inefficiencies by inducing synchronization delays.
To optimize the use of high intra-node bandwidth and low inter-node bandwidth, DeepSpeed-
MoE [121] and HetuMoE [112] have introduced a hierarchical All-to-All communication strategy
that enhances intra-node process and reduces inter-node data exchanges. Besides, FasterMoE
[60], TA-MoE [13] and SE-MoE [136] have introduced topology-aware routing strategies aimed
at mitigating cross-node expert selection, thereby reducing inter-node communication burdens.
Additionally, ExFlow [174] exploits expert affinity, anticipating expert allocation across layers to
maximize the retention of token processing within local GPU confines. The strategic allocation
of experts to minimize network traffic and leverage high-bandwidth connections is a prevalent
approach in distributed MoE system [121, 142, 154]. And this is often integrated with the placement
design of non-expert modules to optimize overall system performance.
Given the concurrent feature of communication and computation, pipelining [67, 107, 119]
is commonly employed to overlap their execution, thereby reducing the total time cost. This
technique, which is integrated in systems such as Tutel [69], FasterMoE [60], and MPipeMoE [185],
orchestrates overlapping between All-to-All communication and expert computation. Notably,
Lancet [75] underscores the inherent constraints of these pipelining methods, particularly the
bounded duration for which expert computation and communication can overlap. To address this
limitation, Lancet partitions non-MoE computations and integrates them into the pipeline during
forward pass, and strategically schedules gradient weight computations to augment overlap in the
backward pass. With the same objective of extending the overlap duration, ScMoE [12] restructures
the MoE architecture to simultaneously process representations from preceding layers while
engaging with current-layer representations. This decoupling of communication dependencies
facilitates substantial, and in certain cases, complete overlapping between communication and
computation. Snowflake Arctic [152] employs a similar design, utilizing a Dense-MoE hybrid
transformer architecture to effectively overlap communication with computation.
5.3 Storage
The ever-increasing parameters in MoE models exacerbate the constraints posed by memory
capacity in compute devices, a challenge already pronounced in dense models. While expert
parallelism offers a mitigation strategy through the distribution of experts across multiple devices,
individual devices may still struggle to accommodate numerous experts, particularly in inference
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:28 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
contexts where device capacity—such as that of edge devices (PCs, smartphones, IoTs)—is inherently
more restricted.
Considering the hierarchical storage pyramid, solutions like SE-MoE [136], Pre-gated MoE
[70], and EdgeMoE [176] selectively retain only essential non-expert parameters and the active
expert parameters within the GPU’s High-Bandwidth Memory (HBM), offloading inactive expert
parameters to CPU memory or SSDs. These patterns incur additional overhead from data transfer
across the storage hierarchy, thus they integrate expert selection forecasting and expert parameter
prefetching techniques to overlap parameter access with computation.
In addition, MPipeMoE [185] introduces a strategy to reduce the memory overhead associated
with activations and temporary buffers. This is achieved by sharing buffer for various partitions
of tensors, while leveraging recomputation/communication and CPU offloading to recover the
requisite activations in the backward pass.
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
A Survey on Mixture of Experts 1:29
segments each input image into 𝑛 patches (or tokens) and allocates 𝑙 patches (𝑙 ≪ 𝑛) to each expert
for processing through prioritized routing to enhance efficiency.
Recommender System. Recommender systems are quintessential in various large-scale ap-
plications where they are required to balance and optimize multiple objectives simultaneously
[188]. A prime example is in the domain of movie recommendations, where the aim is not only
to suggest movie that align with users’ immediate preferences but also to ensure subsequent user
satisfaction for the selected movies [101]. The effectiveness of multi-task models hinges on the
intricate interplay between task-specific goals and the relationships between tasks. Consequently,
understanding the trade-offs inherent in these relationships is crucial. Mixture-of-experts (MoE)
models with gating mechanisms have emerged as a popular paradigm for tackling the complex-
ities of multi-task learning in recommender systems. Ma et al. [101] introduce the multi-gate
mixture-of-experts (MMOE) approach, which capitalizes on the concept of shared expert submod-
els across all tasks, guided by a gating network tailored to each individual task. Addressing the
“seesaw phenomenon” where the improvement of one task’s performance can detrimentally affect
another is another challenge in multi-task learning. To counteract this, Tang et al. [148] propose
the Progressive Layered Extraction (PLE) model for personalized recommendations. PLE distinctly
segregates shared and task-specific components and employs a progressive routing mechanism
to incrementally extract and refine the semantic knowledge, thereby enhancing the efficacy of
joint representation learning and the routing of information across tasks. Recently, in the pursuit
of capturing both the long-term and short-term user preferences that are particularly salient in
sequential recommendation scenarios, a novel method named AdaMCT [77] has been proposed.
AdaMCT utilizes layer-aware adaptive mixture units to dynamically blend CNN and Transformer
experts, thereby tailoring the recommendations to individual user patterns.
Multimodal Applications. Multimodal models are designed to process and integrate various
data types within a single neural network framework [109]. These models often simultaneously
encompass two primary data modalities: images and text [7, 156, 191]. The Mixture of Experts (MoE)
architecture has gained considerable traction as the foundation of multimodal models due to its
capacity for expert layers to learn distinct modality partitioning [106]. One notable implementation
of this approach is the LIMoE model [106], a sparse mixture of expert models tailored for multimodal
learning. LIMoE is trained on both images and text data, employing contrastive loss and an entropy-
based regularization technique to address load balancing challenges inherent in MoE systems.
Subsequently, Shen et al. [138] and Lin et al. [95] have further investigated the potential of MoE
for scaling vision-language models, offering valuable insights that contribute to the development
of more efficient and effective multimodal learning systems. Furthermore, to address the specific
issue of task conflicts in instruction tuning of Large Vision-Language Models (LVLMs), MoCLE
[53] integrates MoE with LoRA [66] experts and a distinct universal expert to activate task-specific
model parameters based on clusters of instructions. In parallel, to mitigate data conflicts, LLaVA-
MoLE [15] deploys a set of LoRA experts, specifically for the MLP layer, combined with a top-1
gating mechanism to refine instruction tuning in Multimodal Large Language Models (MLLMs).
While the MLLMs employing MoE architectures have demonstrated impressive performances,
they generally involve a limited number of experts and modalities [92]. To address this limitation,
Li et al. [92] introduce the pioneering Uni-MoE, a unified MLLM with MoE architecture capable
of managing an extensive range of modalities. They introduce a progressive training strategy to
bolster expert collaboration and generalization across modalities, and they utilize LoRA [66], a
lightweight fine-tuning methodology, to minimize computational demands.
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:30 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
A Survey on Mixture of Experts 1:31
develops a specialized niche [30]. Thus, there is a pressing need for further research into hardware
optimization techniques that more adeptly accommodate sparse computations. Such advancements
would not only preserve the model’s capacity but could also significantly enhance the performance
and efficiency of MoE models.
Generalization and Robustness. MoE models have demonstrated increased computational
efficiency during pre-training phases. However, there is a notable propensity for sparse MoE
architectures to overfit to specific tasks or datasets, which undermines their ability to generalize
effectively [42, 49, 137, 197]. To enhance the generalization and robustness of MoE models when
encountering unseen data and diverse input variations, various strategies have been explored.
These include regularization techniques such as dropout [49] and token dropping [197], as well as
multi-task instruction tuning [42, 137]. Looking ahead, there is potential for further advancements
in this challenge. Future endeavors could explore innovative regularization methods, refined multi-
task learning frameworks, or the incorporation of meta-learning concepts that bolster the MoE
models’ robustness and extend their generalization capabilities across an even broader spectrum of
downstream tasks.
Interpretability and Transparency. The inherent complexity of MoE models, coupled with
their dynamic gating of inputs to specialized experts, poses significant challenges to interpretability.
This becomes particularly problematic in contexts where comprehending the rationale behind
the model’s decisions is essential. Enhancing the interpretability of MoE models is therefore
critical, not only to facilitate a clearer understanding of their decision-making processes but also to
address underlying challenges such as load balancing [44, 49, 86] and the mitigation of knowledge
redundancy [30, 151]. In light of these considerations, there is a pressing need for future studies
focused on the development of methods and tools that can effectively visualize and explain the
behavior of individual experts within MoE models, as well as the nature of their interactions. Such
advancements would significantly improve our grasp of MoE models and bolster their ongoing
development, ensuring their gating decisions are transparent and trustworthy.
Optimal Expert Architecture. The design of MoE architectures, encompassing the selection
of network types and the quantity of experts, significantly influences the efficacy of multi-task
learning across various domains. A plethora of network architectures has been adopted as experts,
including LSTM [135], CNN [25, 183], FFNs (MLPs) [49, 86, 117, 197], Attention [140, 182], and LoRA
[43, 88, 100]. Among these, FFNs as experts remain the most prevalent. Despite their considerable
achievements, the exploration of various hybrids of network types within experts (as the distinct
features processing capabilities of different network architectures), as well as the development
of innovative expert architectures, remains nascent areas of research. Furthermore, the strategic
allocation of a varying number of experts across different layers of the model presents an area
ripe for investigation. This is due to two primary considerations: 1) different layers of the model
capture semantic information at varying levels of granularity; 2) an excessive number of experts
can complicate the training process and augment computational costs, while an insufficient number
of experts might lead to knowledge redundancy and diminish the specialization capabilities of the
experts. To navigate these challenges, the development of automated architecture search methods
specifically designed for MoE models is imperative [192]. Such approaches could systematically
identify optimal configurations, balancing the trade-offs between computational efficiency and the
specialization of experts.
Integration with Existing Frameworks. Ensuring seamless integration of MoE models into
existing large language models (LLMs) is crucial for their broad adoption. It is particularly vital
to enable adaptation of LLMs to MoE architecture without necessitating training from scratch,
as it can significantly reduce resource consumption. Recent studies [15, 43, 51, 88, 100, 168, 178]
have demonstrated the efficacy of combining Parameter-efficient Fine-tuning (PEFT) techniques
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:32 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
with MoE frameworks, offering a promising method for incorporating MoE into established LLMs.
However, these methods may compromise model performance or complicate the existing parallel
strategies of pretraining and inference efforts [57]. Advancing the development of modular and
plug-and-play MoE components is essential. Additionally, optimizing these components for training
and deployment across diverse computing environments and hardware platforms will expand their
applicability. Such advancements are expected to enhance the versatility and efficiency of MoE
models, making them more accessible for a wide range of applications and platforms.
By addressing these challenges, we can unlock the full potential of MoE models, paving the way
for more efficient and powerful machine learning systems, particular for large language models
(LLMs), that are capable of handling the ever-growing complexity and diversity of real-world tasks.
8 CONCLUSION
In this survey, we present a systematic and comprehensive review of the literature on MoE models,
serving as a valuable compendium for researchers exploring the landscape of MoE technologies. We
introduce a new taxonomy for MoE models and provide an in-depth analysis that encompasses three
distinct vantage points: algorithm design, system design, and practical applications, complemented
by a curated collection of open-source implementations, detailed hyperparameter configurations,
and thorough empirical assessments. Moreover, we highlight the critical challenges faced in the
field and outline the most promising avenues for future investigation. To support the continuous
dissemination of knowledge and advancements, we have established a dedicated resource repository
to facilitate ongoing updates and the sharing of cutting-edge developments in MoE research. We
hope this survey can contribute to an essential reference for researchers seeking to rapidly acquaint
themselves with MoE models, and that it will actively contribute to the vibrant progression.
REFERENCES
[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida,
Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774
(2023).
[2] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. 2023.
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. In Proceedings of the
2023 Conference on Empirical Methods in Natural Language Processing. 4895–4901.
[3] Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. 2017. Expert gate: Lifelong learning with a network of
experts. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3366–3375.
[4] Amjad Almahairi, Nicolas Ballas, Tim Cooijmans, Yin Zheng, Hugo Larochelle, and Aaron Courville. 2016. Dynamic
capacity networks. In International Conference on Machine Learning. PMLR, 2549–2558.
[5] Szymon Antoniak, Sebastian Jaszczur, Michał Krutul, Maciej Pióro, Jakub Krajewski, Jan Ludziejewski, Tomasz
Odrzygóźdź, and Marek Cygan. 2023. Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation. arXiv
preprint arXiv:2310.15961 (2023).
[6] Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du,
Srinivasan Iyer, Ramakanth Pasunuru, et al. 2021. Efficient large scale language modeling with mixtures of experts.
arXiv preprint arXiv:2112.10684 (2021).
[7] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and
taxonomy. IEEE transactions on pattern analysis and machine intelligence 41, 2 (2018), 423–443.
[8] Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. 2015. Conditional computation in neural
networks for faster models. arXiv preprint arXiv:1511.06297 (2015).
[9] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. 2013. Estimating or propagating gradients through stochastic
neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013).
[10] Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi
Du, Zhe Fu, et al. 2024. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint
arXiv:2401.02954 (2024).
[11] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
A Survey on Mixture of Experts 1:33
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:34 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
[33] Do Huu Dat, Po Yuan Mao, Tien Hoang Nguyen, Wray Buntine, and Mohammed Bennamoun. 2023. HOMOE: A
Memory-Based and Composition-Aware Framework for Zero-Shot Learning with Hopfield Network and Soft Mixture
of Experts. arXiv preprint arXiv:2311.14747 (2023).
[34] Databricks. 2024. Introducing DBRX: A New State-of-the-Art Open LLM. https://www.databricks.com/blog/
introducing-dbrx-new-state-art-open-llm
[35] Andrew Davis and Itamar Arel. 2013. Low-rank approximations for conditional feedforward computation in deep
neural networks. arXiv preprint arXiv:1312.4461 (2013).
[36] DeepSeek-AI. 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.
arXiv:2405.04434 [cs.CL]
[37] Marc Deisenroth and Jun Wei Ng. 2015. Distributed gaussian processes. In International conference on machine
learning. PMLR, 1481–1490.
[38] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[39] Shizhe Diao, Tianyang Xu, Ruijia Xu, Jiawei Wang, and Tong Zhang. 2023. Mixture-of-Domain-Adapters: Decoupling
and Injecting Domain Knowledge to Pre-trained Language Models’ Memories. In The 61st Annual Meeting Of The
Association For Computational Linguistics.
[40] Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min
Chan, Weize Chen, et al. 2022. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained
language models. arXiv preprint arXiv:2203.06904 (2022).
[41] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words:
Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[42] Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran
Fan, et al. 2023. The Art of Balancing: Revolutionizing Mixture of Experts for Maintaining World Knowledge in
Language Model Alignment. arXiv preprint arXiv:2312.09979 (2023).
[43] Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran
Fan, et al. 2023. Loramoe: Revolutionizing mixture of experts for maintaining world knowledge in language model
alignment. arXiv preprint arXiv:2312.09979 (2023).
[44] Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou,
Adams Wei Yu, Orhan Firat, et al. 2022. Glam: Efficient scaling of language models with mixture-of-experts. In
International Conference on Machine Learning. PMLR, 5547–5569.
[45] Dheeru Dua, Shruti Bhosale, Vedanuj Goswami, James Cross, Mike Lewis, and Angela Fan. 2022. Tricks for Training
Sparse Translation Models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies. 3340–3345.
[46] David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. 2013. Learning factored representations in a deep mixture of
experts. arXiv preprint arXiv:1312.4314 (2013).
[47] Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur
Celebi, Guillaume Wenzek, Vishrav Chaudhary, et al. 2021. Beyond english-centric multilingual machine translation.
Journal of Machine Learning Research 22, 107 (2021), 1–48.
[48] William Fedus, Jeff Dean, and Barret Zoph. 2022. A review of sparse expert models in deep learning. arXiv preprint
arXiv:2209.01667 (2022).
[49] William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with
simple and efficient sparsity. Journal of Machine Learning Research 23, 120 (2022), 1–39.
[50] Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. 2023. Megablocks: Efficient sparse training with
mixture-of-experts. Proceedings of Machine Learning and Systems 5 (2023).
[51] Chongyang Gao, Kezhen Chen, Jinmeng Rao, Baochen Sun, Ruibo Liu, Daiyi Peng, Yawen Zhang, Xiaoyuan Guo, Jie
Yang, and VS Subrahmanian. 2024. Higher Layers Need More LoRA Experts. arXiv preprint arXiv:2402.08562 (2024).
[52] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the
fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings,
315–323.
[53] Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T Kwok, and Yu Zhang. 2023.
Mixture of cluster-conditional lora experts for vision-language instruction tuning. arXiv preprint arXiv:2312.12379
(2023).
[54] Sam Gross, Marc’Aurelio Ranzato, and Arthur Szlam. 2017. Hard mixtures of experts for large scale weakly supervised
vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6865–6873.
[55] Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint
arXiv:2312.00752 (2023).
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
A Survey on Mixture of Experts 1:35
[56] Suchin Gururangan, Mike Lewis, Ari Holtzman, Noah A Smith, and Luke Zettlemoyer. 2022. DEMix Layers: Dis-
entangling Domains for Modular Language Modeling. In Proceedings of the 2022 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies. 5557–5576.
[57] Zeyu Han, Chao Gao, Jinyang Liu, Sai Qian Zhang, et al. 2024. Parameter-efficient fine-tuning for large models: A
comprehensive survey. arXiv preprint arXiv:2403.14608 (2024).
[58] Hussein Hazimeh, Zhe Zhao, Aakanksha Chowdhery, Maheswaran Sathiamoorthy, Yihua Chen, Rahul Mazumder,
Lichan Hong, and Ed Chi. 2021. Dselect-k: Differentiable selection in the mixture of experts with applications to
multi-task learning. Advances in Neural Information Processing Systems 34 (2021), 29335–29347.
[59] Jiaao He, Jiezhong Qiu, Aohan Zeng, Zhilin Yang, Jidong Zhai, and Jie Tang. 2021. Fastmoe: A fast mixture-of-expert
training system. arXiv preprint arXiv:2103.13262 (2021).
[60] Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. 2022. Fastermoe:
modeling and optimizing training of large-scale dynamic pre-trained models. In Proceedings of the 27th ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming. 120–134.
[61] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020.
Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020).
[62] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob
Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874
(2021).
[63] Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).
[64] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las
Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language
models. arXiv preprint arXiv:2203.15556 (2022).
[65] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo,
Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International Conference on
Machine Learning. PMLR, 2790–2799.
[66] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2021. LoRA:
Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.
[67] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam,
Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism.
Advances in neural information processing systems 32 (2019).
[68] Yongqi Huang, Peng Ye, Xiaoshui Huang, Sheng Li, Tao Chen, and Wanli Ouyang. 2023. Experts weights averaging:
A new general training scheme for vision transformers. arXiv preprint arXiv:2308.06093 (2023).
[69] Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat
Ram, et al. 2023. Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning and Systems 5 (2023).
[70] Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, Mao Yang, and Minsoo Rhu. 2023.
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference. arXiv preprint
arXiv:2308.12066 (2023).
[71] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts.
Neural computation 3, 1 (1991), 79–87.
[72] Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Leon Song, Samyam Rajbhandari, and Yuxiong
He. 2023. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer
models. arXiv preprint arXiv:2309.14509 (2023).
[73] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las
Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint
arXiv:2310.06825 (2023).
[74] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh
Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. 2024. Mixtral of experts. arXiv preprint
arXiv:2401.04088 (2024).
[75] Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, and Yida Wang. 2024. Lancet: Accelerating Mixture-
of-Experts Training via Whole Graph Computation-Communication Overlapping. arXiv preprint arXiv:2404.19429
(2024).
[76] Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for
Code Generation. arXiv preprint arXiv:2406.00515 (2024).
[77] Juyong Jiang, Peiyan Zhang, Yingtao Luo, Chaozhuo Li, Jae Boum Kim, Kai Zhang, Senzhang Wang, Xing Xie, and
Sunghun Kim. 2023. AdaMCT: adaptive mixture of CNN-transformer for sequential recommendation. In Proceedings
of the 32nd ACM International Conference on Information and Knowledge Management. 976–986.
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:36 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
[78] Michael I Jordan and Robert A Jacobs. 1994. Hierarchical mixtures of experts and the EM algorithm. Neural computation
6, 2 (1994), 181–214.
[79] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford,
Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
[80] Young Jin Kim, Ammar Ahmad Awan, Alexandre Muzio, Andres Felipe Cruz Salinas, Liyang Lu, Amr Hendy, Samyam
Rajbhandari, Yuxiong He, and Hany Hassan Awadalla. 2021. Scalable and efficient moe training for multitask
multilingual models. arXiv preprint arXiv:2109.10465 (2021).
[81] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan,
John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural
networks. Proceedings of the national academy of sciences 114, 13 (2017), 3521–3526.
[82] Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay,
Mostafa Dehghani, and Neil Houlsby. 2022. Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints.
In The Eleventh International Conference on Learning Representations.
[83] Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi,
and Bryan Catanzaro. 2023. Reducing activation recomputation in large transformer models. Proceedings of Machine
Learning and Systems 5 (2023).
[84] Sneha Kudugunta, Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin, Minh-Thang Luong, and Orhan
Firat. 2021. Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference. In Findings of the Association
for Computational Linguistics: EMNLP 2021. 3577–3599.
[85] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang,
and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In
Proceedings of the 29th Symposium on Operating Systems Principles. 611–626.
[86] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam
Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding.
arXiv preprint arXiv:2006.16668 (2020).
[87] Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. 2021. Base layers: Simplifying
training of large, sparse models. In International Conference on Machine Learning. PMLR, 6265–6274.
[88] Dengchun Li, Yingzi Ma, Naizheng Wang, Zhiyuan Cheng, Lei Duan, Jie Zuo, Cal Yang, and Mingjie Tang. 2024.
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA based Mixture of Experts. arXiv preprint
arXiv:2404.15159 (2024).
[89] Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A Smith, and Luke Zettlemoyer. 2022.
Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models. In First Workshop on Interpolation
Regularizers and Beyond at NeurIPS 2022.
[90] Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. 2021. Sequence parallelism: Long sequence
training from system perspective. arXiv preprint arXiv:2105.13120 (2021).
[91] Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings
of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference
on Natural Language Processing (Volume 1: Long Papers). 4582–4597.
[92] Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, and Min Zhang. 2024.
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts. arXiv preprint arXiv:2405.11273 (2024).
[93] Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. 2023. Scaling down to scale up: A guide to parameter-efficient
fine-tuning. arXiv preprint arXiv:2303.15647 (2023).
[94] Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom,
Yonatan Belinkov, Shai Shalev-Shwartz, et al. 2024. Jamba: A hybrid transformer-mamba language model. arXiv
preprint arXiv:2403.19887 (2024).
[95] Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. 2024. Moe-llava:
Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947 (2024).
[96] Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel.
2022. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural
Information Processing Systems 35 (2022), 1950–1965.
[97] Qidong Liu, Xian Wu, Xiangyu Zhao, Yuanshao Zhu, Derong Xu, Feng Tian, and Yefeng Zheng. 2023. Moelora: An
moe-based parameter efficient fine-tuning method for multi-task medical applications. arXiv preprint arXiv:2310.18339
(2023).
[98] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin
transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international
conference on computer vision. 10012–10022.
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
A Survey on Mixture of Experts 1:37
[99] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic represen-
tations for vision-and-language tasks. Advances in neural information processing systems 32 (2019).
[100] Tongxu Luo, Jiahe Lei, Fangyu Lei, Weihao Liu, Shizhu He, Jun Zhao, and Kang Liu. 2024. Moelora: Contrastive
learning guided mixture of experts on parameter-efficient fine-tuning for large language models. arXiv preprint
arXiv:2402.12851 (2024).
[101] Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. 2018. Modeling task relationships in multi-task
learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD international conference on
knowledge discovery & data mining. 1930–1939.
[102] Zixuan Ma, Jiaao He, Jiezhong Qiu, Huanqi Cao, Yuanwei Wang, Zhenbo Sun, Liyan Zheng, Haojie Wang, Shizhi
Tang, Tianyu Zheng, et al. 2022. BaGuaLu: targeting brain scale pretrained models with over 37 million cores. In
Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 192–204.
[103] Yuning Mao, Lambert Mathias, Rui Hou, Amjad Almahairi, Hao Ma, Jiawei Han, Scott Yih, and Madian Khabsa. 2022.
UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning. In Proceedings of the 60th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6253–6264.
[104] Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah,
Xianzhi Du, Futang Peng, Floris Weers, et al. 2024. Mm1: Methods, analysis & insights from multimodal llm pre-
training. arXiv preprint arXiv:2403.09611 (2024).
[105] Mohammed Muqeeth, Haokun Liu, and Colin Raffel. 2023. Soft merging of experts with adaptive routing. arXiv
preprint arXiv:2306.03745 (2023).
[106] Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. 2022. Multimodal contrastive
learning with limoe: the language-image mixture of experts. Advances in Neural Information Processing Systems 35
(2022), 9564–9576.
[107] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B
Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of
the 27th ACM symposium on operating systems principles. 1–15.
[108] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri
Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model
training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance
Computing, Networking, Storage and Analysis. 1–15.
[109] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep
learning. In Proceedings of the 28th international conference on machine learning (ICML-11). 689–696.
[110] Xiaonan Nie, Xupeng Miao, Shijie Cao, Lingxiao Ma, Qibin Liu, Jilong Xue, Youshan Miao, Yi Liu, Zhi Yang, and Bin
Cui. 2021. Evomoe: An evolutional mixture-of-experts training framework via dense-to-sparse gate. arXiv preprint
arXiv:2112.14397 (2021).
[111] Xiaonan Nie, Xupeng Miao, Zilong Wang, Zichao Yang, Jilong Xue, Lingxiao Ma, Gang Cao, and Bin Cui. 2023.
Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement. Proceedings of the ACM
on Management of Data 1, 1 (2023), 1–19.
[112] Xiaonan Nie, Pinxue Zhao, Xupeng Miao, Tong Zhao, and Bin Cui. 2022. HetuMoE: An efficient trillion-scale
mixture-of-expert distributed training system. arXiv preprint arXiv:2203.14685 (2022).
[113] OpenAI. 2022. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt.
[114] Oleksiy Ostapenko, Lucas Caccia, Zhan Su, Nicolas Le Roux, Laurent Charlin, and Alessandro Sordoni. 2023. A Case
Study of Instruction Tuning with Mixture of Parameter-Efficient Experts. In NeurIPS 2023 Workshop on Instruction
Tuning and Instruction Following.
[115] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli.
2019. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038 (2019).
[116] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.
Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
[117] Bowen Pan, Yikang Shen, Haokun Liu, Mayank Mishra, Gaoyuan Zhang, Aude Oliva, Colin Raffel, and Rameswar
Panda. 2024. Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models. arXiv
preprint arXiv:2404.05567 (2024).
[118] Joan Puigcerver, Carlos Riquelme Ruiz, Basil Mustafa, and Neil Houlsby. 2023. From Sparse to Soft Mixtures of
Experts. In The Twelfth International Conference on Learning Representations.
[119] Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2023. Zero Bubble Pipeline Parallelism. In The Twelfth
International Conference on Learning Representations.
[120] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:38 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
A Survey on Mixture of Experts 1:39
[143] Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun
Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. 2022. Using deepspeed and megatron to train
megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022).
[144] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. Roformer: Enhanced transformer
with rotary position embedding. Neurocomputing 568 (2024), 127063.
[145] Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel
Li, Wen-tau Yih, Jason Weston, et al. 2024. Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM.
arXiv preprint arXiv:2403.07816 (2024).
[146] Shawn Tan, Yikang Shen, Zhenfang Chen, Aaron Courville, and Chuang Gan. 2023. Sparse Universal Transformer. In
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 169–179.
[147] Shawn Tan, Yikang Shen, Rameswar Panda, and Aaron Courville. 2024. Scattered Mixture-of-Experts Implementation.
arXiv preprint arXiv:2403.08245 (2024).
[148] Hongyan Tang, Junning Liu, Ming Zhao, and Xudong Gong. 2020. Progressive layered extraction (ple): A novel
multi-task learning (mtl) model for personalized recommendations. In Proceedings of the 14th ACM Conference on
Recommender Systems. 269–278.
[149] LLaMA-MoE Team. 2023. LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training.
https://github.com/pjlab-sys4nlp/llama-moe
[150] Qwen Team. 2024. Introducing Qwen1.5. https://qwenlm.github.io/blog/qwen1.5/
[151] Qwen Team. 2024. Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters". https:
//qwenlm.github.io/blog/qwen-moe/
[152] Snowflake AI Research Team. 2024. Snowflake Arctic: The Best LLM for Enterprise AI — Efficiently Intelligent, Truly
Open. https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/
[153] Lucas Theis and Matthias Bethge. 2015. Generative image modeling using spatial lstms. Advances in neural information
processing systems 28 (2015).
[154] Liang Zhao Cheng Cheng Biye Li Weiwei Lu Peng Cheng Jianhao Zhang Xiaoyu Zhang Liang Zeng Xiaokun Wang
Yutuan Ma Rui Hu Shuicheng Yan Han Fang Yahui Zhou Tianwen Wei, Bo Zhu. 2024. Skywork-MoE: A Deep Dive
into Training Techniques for Mixture-of-Experts Language Models.
[155] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.
arXiv preprint arXiv:2307.09288 (2023).
[156] Shagun Uppal, Sarthak Bhagat, Devamanyu Hazarika, Navonil Majumder, Soujanya Poria, Roger Zimmermann, and
Amir Zadeh. 2022. Multimodal research in vision and language: A review of current and emerging trends. Information
Fusion 77 (2022), 149–171.
[157] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[158] Hongyi Wang, Felipe Maia Polo, Yuekai Sun, Souvik Kundu, Eric Xing, and Mikhail Yurochkin. 2023. Fusing Models
with Complementary Expertise. In The Twelfth International Conference on Learning Representations.
[159] Xin Wang, Fisher Yu, Lisa Dunlap, Yi-An Ma, Ruth Wang, Azalia Mirhoseini, Trevor Darrell, and Joseph E Gonzalez.
2020. Deep mixture of experts via shallow embedding. In Uncertainty in artificial intelligence. PMLR, 552–562.
[160] Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed Hassan Awadallah, and
Jianfeng Gao. 2022. AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning. In Proceedings of the
2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue
Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 5744–5760. https:
//doi.org/10.18653/v1/2022.emnlp-main.388
[161] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma,
Denny Zhou, Donald Metzler, et al. 2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682
(2022).
[162] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022.
Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing
Systems 35 (2022), 24824–24837.
[163] Tianwen Wei, Liang Zhao, Lichang Zhang, Bo Zhu, Lijie Wang, Haihua Yang, Biye Li, Cheng Cheng, Weiwei Lü, Rui
Hu, et al. 2023. Skywork: A more open bilingual foundation model. arXiv preprint arXiv:2310.19341 (2023).
[164] Jialin Wu, Xia Hu, Yaqing Wang, Bo Pang, and Radu Soricut. 2023. Omni-SMoLA: Boosting Generalist Multimodal
Models with Soft Mixture of Low-rank Experts. arXiv preprint arXiv:2312.00968 (2023).
[165] Lemeng Wu, Mengchen Liu, Yinpeng Chen, Dongdong Chen, Xiyang Dai, and Lu Yuan. 2022. Residual mixture of
experts. arXiv preprint arXiv:2204.09636 (2022).
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:40 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
[166] Shaohua Wu, Jiangang Luo, Xi Chen, Lingjun Li, Xudong Zhao, Tong Yu, Chao Wang, Yue Wang, Fei Wang, Weixu
Qiao, et al. 2024. Yuan 2.0-M32: Mixture of Experts with Attention Router. arXiv preprint arXiv:2405.17976 (2024).
[167] Xun Wu, Shaohan Huang, and Furu Wei. 2023. MoLE: Mixture of LoRA Experts. In The Twelfth International Conference
on Learning Representations.
[168] Xun Wu, Shaohan Huang, and Furu Wei. 2024. Mixture of LoRA Experts. In The Twelfth International Conference on
Learning Representations. https://openreview.net/forum?id=uWvKBCYh4S
[169] xAI. 2024. Grok-1. https://github.com/xai-org/grok-1
[170] Fuzhao Xue, Xiaoxin He, Xiaozhe Ren, Yuxuan Lou, and Yang You. 2022. One student knows all experts know: From
sparse to dense. arXiv preprint arXiv:2201.10890 (2022).
[171] Fuzhao Xue, Ziji Shi, Futao Wei, Yuxuan Lou, Yong Liu, and Yang You. 2022. Go wider instead of deeper. In Proceedings
of the AAAI Conference on Artificial Intelligence, Vol. 36. 8779–8787.
[172] Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. 2024. Openmoe: An
early effort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739 (2024).
[173] An Yang, Junyang Lin, Rui Men, Chang Zhou, Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Jiamang Wang, Yong Li,
et al. 2021. M6-t: Exploring sparse expert models and beyond. arXiv preprint arXiv:2105.15082 (2021).
[174] Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, et al. 2024. Exploiting Inter-Layer Expert Affinity for
Accelerating Mixture-of-Experts Model Inference. arXiv preprint arXiv:2401.08383 (2024).
[175] Qinyuan Ye, Juan Zha, and Xiang Ren. 2022. Eliciting and Understanding Cross-task Skills with Task-level Mixture-
of-Experts. In Findings of the Association for Computational Linguistics: EMNLP 2022. 2567–2592.
[176] Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, and Mengwei Xu. 2023. Edgemoe: Fast on-device
inference of moe-based large language models. arXiv preprint arXiv:2308.14352 (2023).
[177] Kang Min Yoo, Jaegeun Han, Sookyo In, Heewon Jeon, Jisu Jeong, Jaewook Kang, Hyunwook Kim, Kyung-Min Kim,
Munhyong Kim, Sungju Kim, et al. 2024. HyperCLOVA X Technical Report. arXiv preprint arXiv:2404.01954 (2024).
[178] Ted Zadouri, Ahmet Üstün, Arash Ahmadian, Beyza Ermiş, Acyr Locatelli, and Sara Hooker. 2023. Pushing mixture of
experts to the limit: Extremely parameter efficient moe for instruction tuning. arXiv preprint arXiv:2309.05444 (2023).
[179] Mingshu Zhai, Jiaao He, Zixuan Ma, Zan Zong, Runqing Zhang, and Jidong Zhai. 2023. {SmartMoE}: Efficiently
Training {Sparsely-Activated} Models through Combining Offline and Online Parallelization. In 2023 USENIX Annual
Technical Conference (USENIX ATC 23). 961–975.
[180] Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural Information Processing
Systems 32 (2019).
[181] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao.
2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. 5579–5588.
[182] Xiaofeng Zhang, Yikang Shen, Zeyu Huang, Jie Zhou, Wenge Rong, and Zhang Xiong. 2022. Mixture of Attention
Heads: Selecting Attention Heads Per Token. In Proceedings of the 2022 Conference on Empirical Methods in Natural
Language Processing. 4150–4162.
[183] Yihua Zhang, Ruisi Cai, Tianlong Chen, Guanhua Zhang, Huan Zhang, Pin-Yu Chen, Shiyu Chang, Zhangyang Wang,
and Sijia Liu. 2023. Robust Mixture-of-Expert Training for Convolutional Neural Networks. In Proceedings of the
IEEE/CVF International Conference on Computer Vision. 90–101.
[184] Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. 2022. MoEfication: Transformer
Feed-forward Layers are Mixtures of Experts. In Findings of the Association for Computational Linguistics: ACL 2022.
877–890.
[185] Zheng Zhang, Yaqi Xia, Hulin Wang, Donglin Yang, Chuang Hu, Xiaobo Zhou, and Dazhao Cheng. 2024. MPMoE:
Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism. IEEE Transactions on Parallel and
Distributed Systems (2024).
[186] Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong
Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed
deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559–578.
[187] Ningxin Zheng, Huiqiang Jiang, Quanlu Zhang, Zhenhua Han, Lingxiao Ma, Yuqing Yang, Fan Yang, Chengruidong
Zhang, Lili Qiu, Mao Yang, et al. 2023. Pit: Optimization of dynamic sparse deep learning models via permutation
invariant transformation. In Proceedings of the 29th Symposium on Operating Systems Principles. 331–347.
[188] Yong Zheng and David Xuejun Wang. 2022. A survey of recommender systems with multi-objective optimization.
Neurocomputing 474 (2022), 141–153.
[189] Zexuan Zhong, Mengzhou Xia, Danqi Chen, and Mike Lewis. 2024. Lory: Fully Differentiable Mixture-of-Experts for
Autoregressive Language Model Pre-training. arXiv preprint arXiv:2405.03133 (2024).
[190] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language
models. International Journal of Computer Vision 130, 9 (2022), 2337–2348.
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
A Survey on Mixture of Experts 1:41
[191] Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jianfeng Gao. 2020. Unified vision-language
pre-training for image captioning and vqa. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34.
13041–13049.
[192] Yanqi Zhou, Nan Du, Yanping Huang, Daiyi Peng, Chang Lan, Da Huang, Siamak Shakeri, David So, Andrew M Dai,
Yifeng Lu, et al. 2023. Brainformers: Trading simplicity for efficiency. In International Conference on Machine Learning.
PMLR, 42531–42542.
[193] Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon,
et al. 2022. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems 35
(2022), 7103–7114.
[194] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language
understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).
[195] Jinguo Zhu, Xizhou Zhu, Wenhai Wang, Xiaohua Wang, Hongsheng Li, Xiaogang Wang, and Jifeng Dai. 2022. Uni-
perceiver-moe: Learning sparse generalist models with conditional moes. Advances in Neural Information Processing
Systems 35 (2022), 2664–2678.
[196] Yun Zhu, Nevan Wichers, Chu-Cheng Lin, Xinyi Wang, Tianlong Chen, Lei Shu, Han Lu, Canoee Liu, Liangchen Luo,
Jindong Chen, et al. 2023. Sira: Sparse mixture of low rank adaptation. arXiv preprint arXiv:2311.09179 (2023).
[197] Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus.
2022. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906 (2022).
[198] Simiao Zuo, Xiaodong Liu, Jian Jiao, Young Jin Kim, Hany Hassan, Ruofei Zhang, Jianfeng Gao, and Tuo Zhao.
2021. Taming Sparsely Activated Transformer with Stochastic Experts. In International Conference on Learning
Representations.
[199] Simiao Zuo, Qingru Zhang, Chen Liang, Pengcheng He, Tuo Zhao, and Weizhu Chen. 2022. MoEBERT: from BERT to
Mixture-of-Experts via Importance-Guided Adaptation. In Proceedings of the 2022 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies. 1610–1623.
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.