A Survey On Mixture of Experts: Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang

1
A Survey on Mixture of Experts

WEILIN CAI∗ , The Hong Kong University of Science and Technology (Guangzhou), China
JUYONG JIANG∗ , The Hong Kong University of Science and Technology (Guangzhou), China
FAN WANG∗ , The Hong Kong University of Science and Technology (Guangzhou), China
arXiv:2407.06204v1 [cs.LG] 26 Jun 2024
JING TANG, The Hong Kong University of Science and Technology (Guangzhou), China
SUNGHUN KIM, The Hong Kong University of Science and Technology (Guangzhou), China
JIAYI HUANG† , The Hong Kong University of Science and Technology (Guangzhou), China
Large language models (LLMs) have garnered unprecedented advancements across diverse fields, ranging
from natural language processing to computer vision and beyond. The prowess of LLMs is underpinned by
their substantial model size, extensive and diverse datasets, and the vast computational power harnessed
during training, all of which contribute to the emergent abilities of LLMs (e.g., in-context learning) that
are not present in small models. Within this context, the mixture of experts (MoE) has emerged as an
effective method for substantially scaling up model capacity with minimal computation overhead, gaining
significant attention from academia and industry. Despite its growing prevalence, there lacks a systematic and
comprehensive review of the literature on MoE. This survey seeks to bridge that gap, serving as an essential
resource for researchers delving into the intricacies of MoE. We first briefly introduce the structure of the
MoE layer, followed by proposing a new taxonomy of MoE. Next, we overview the core designs for various
MoE models including both algorithmic and systemic aspects, alongside collections of available open-source
implementations, hyperparameter configurations and empirical evaluations. Furthermore, we delineate the
multifaceted applications of MoE in practice, and outline some potential directions for future research. To
facilitate ongoing updates and the sharing of cutting-edge developments in MoE research, we have established
a resource repository accessible at https://github.com/withinmiaov/A-Survey-on-Mixture-of-Experts.
CCS Concepts: • General and reference → Surveys and overviews; • Computing methodologies →
Artificial intelligence.
Additional Key Words and Phrases: Large Language Models, Mixture of Experts, Gating Functions
ACM Reference Format:

Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. 2024. A Survey on Mixture of
Experts. ACM Comput. Surv. X, Y, Article 1 (October 2024), 41 pages. https://doi.org/XXXXXXX.XXXXXXX
∗ Equal Contribution.
† Corresponding Author.
Authors’ addresses: Weilin Cai, [email protected], The Hong Kong University of Science and Technol-
ogy (Guangzhou), Guangzhou, China; Juyong Jiang, [email protected], The Hong Kong University of
Science and Technology (Guangzhou), Guangzhou, China; Fan Wang, [email protected], The Hong Kong
University of Science and Technology (Guangzhou), Guangzhou, China; Jing Tang, [email protected], The Hong
Kong University of Science and Technology (Guangzhou), Guangzhou, China; Sunghun Kim, [email protected],
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China; Jiayi Huang, [email protected],
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2024 Association for Computing Machinery.
0360-0300/2024/10-ART1 $15.00
https://doi.org/XXXXXXX.XXXXXXX
ACM Comput. Surv., Vol. X, No. Y, Article 1. Publication date: October 2024.
1:2 Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang
1 INTRODUCTION
In the current landscape of artificial general intelligence (AGI), the transformative impact of
transformer-based large language models (LLMs) has permeated diverse fields such as natural
language processing [1, 11, 24, 76, 116, 157, 162], computer vision [98, 128], and multimodality
[99, 181, 190, 194]. Building upon the foundational transformer architecture, LLMs demonstrate
extraordinary capabilities, which are attributed to their sheer size, the breadth of data they are
trained on, and the significant computational resources invested in their development [79, 161, 177].
Recognizing a scaling law [64, 79] that underpins their evolution, it is imperative to identify and
implement efficient methodologies for the sustainable scaling of LLMs.
The concept of mixture of experts (MoE), initially introduced in [71, 78], has undergone extensive
exploration and advancement as evidenced by subsequent studies [3, 28, 37, 46, 125, 132, 153]. The
emergence of sparsely-gated MoE [135], particularly within the integration of transformer-based
large language models [86], has brought new vitality to this three-decade-old technology. The
MoE framework is based on a simple yet powerful idea: different parts of a model, known as
experts, specialize in different tasks or aspects of the data. With this paradigm, only pertinent
experts are engaged for a given input, keeping the computational cost in check while still benefiting
from a large pool of specialized knowledge. This scalable and flexible innovation has offered an
effective approach for adhering to the scaling law, allowing for increased model capacity without
a corresponding surge in computational demands. As depicted in Figure 1, MoE has maintained
a robust trajectory of growth, particularly notable in 2024 with the advent of Mixtral-8x7B [74]
and a variety of subsequent industrial-scale LLMs such as Grok-1 [169], DBRX [34], Arctic [152],
DeepSeek-V2 [36], etc.
Despite the increasing popularity and application of MoE models in various domains, the literature
has yet to see a survey that thoroughly examines and categorizes the advancements in this area.
The most recent review of MoE we could find was presented in September 2022 [48], predating the
pivotal “ChatGPT moment”, which omits the significant advancements that have recently emerged
alongside the escalating academic and industrial interest in this domain. This gap in the literature
not only hinders the progress of MoE research but also limits the dissemination of knowledge
on this topic to a broader audience. Our survey aims to address this deficit by providing a clear
and comprehensive overview of MoE with a novel taxonomy that segments recent progress into
algorithm, system and application.
Under this taxonomy, we first delve into MoE algorithmic advancements, particularly the preva-
lent substitution of feed-forward network (FFN) layers with MoE layers in transformer-based LLMs
[36, 44, 49, 74, 86, 172, 197]. As each MoE layer integrates multiple FFNs—each designated as an
expert—and employs a gating function to activate a select subset of these experts, we explore the
design choices of gating function and expert network, alongside collections of available open-source
implementations, hyperparameter configurations and empirical evaluations. Furthermore, to under-
score the flexibility and versatility of MoE, we extend our analysis beyond the standard integration
of MoE into model backbone, and discuss an array of novel MoE-related designs, such as soft
MoE with token or expert merging [105, 118, 164, 178, 189], mixture of parameter-efficient experts
(MoPEs) [43, 53, 100, 160, 168, 178], training and inference schemes with model transition between
dense and sparse [16, 82, 145, 149, 170, 184], and various derivatives [5, 19, 23, 124, 146, 171].
With the gradual convergence of model architecture design in industrial products, system design
has emerged as a pivotal factor in enhancing the quality of LLM services. Given the close association
of MoE models with machine learning system design, we provide a comprehensive overview of
MoE system design, including computation, communication and storage enhancements tailored
to address the unique challenges posed by the sparse and dynamic nature of its computational
A Survey on Mixture of Experts 1:3
MoLE Yuan 2.0-M32

Skywork-MoE
Jamba Arctic DeepSeek-V2
DBRX WizardLM-2-8×22B
Grok-1 JetMoE
Qwen1.5-MoE-A2.7B Mixtral-8x22B
Jun.
MoE-LLaVA May
Apr. DS-MoE
Chinese-Mixtral-8x7B
Mar. Lory
DeepSeekMoE
Feb. Step-2 Skywork 3.0
LLAMA-MoE OpenMoE
Jan. MM1 MoD
Mixtral-8x7B
Branch-Train-Mix
2024
MoV
Dec.
Skywork 2.0
Soft MoE
Sep.
abab6
AdaMix SMEAR
Base Layer NLLB Aug.

Omni-SMoLA
Meta-MoE Swin-MoE
Jun.
V-MoE ST-MoE
May
CPM-2-MoE DeepSpeed-MoE
Mar.
Switch
2023 Brainformer
GShard 2022
M2M-100 2021
PANGU-Σ NLP
2020
2018
2017 Vision
M6-T LIMoE
MoE MMoE Expert-Choice MoE Multimodal

GLaM
PLE Hash Layer MoA RecSys
Fig. 1. A chronological overview of several representative Mixture of Experts (MoE) models in recent years.
The timeline is primarily structured according to the release dates of the models. MoE models located above
the arrow are open-source, while those below the arrow are proprietary and closed-source. MoE models from
various domains are marked with distinct colors: Natural Language Processing (NLP) in green , Computer
Vision in yellow , Multimodal in pink , and Recommender Systems (RecSys) in cyan .
workload. Additionally, we overview the applications of MoE across various domains, including
natural language processing, computer vision, recommender system, and multimodal contexts.
The remainder of this survey is organized as follows. Section 2 provides a foundational un-
derstanding of MoE, contrasting sparse and dense activation of experts. Section 3 introduces our
proposed taxonomy for categorizing MoE advancements. Sections 4, 5, and 6 delve into the algo-
rithmic designs, computing system support, and various applications of MoE models, following
the structure outlined in our taxonomy in Figure 3. Finally, in Section 7, we highlight the critical
challenges and opportunities for bridging the research-practicality gap, culminating in Section 8
with our conclusions.
2 BACKGROUND ON MIXTURE OF EXPERTS

In transformer-based large language models (LLMs), each mixture of experts (MoE) layer typically
consists of a set of 𝑁 “expert networks” {𝑓1, . . . , 𝑓𝑁 }, alongside a “gating network” G. The role of the
gating network, which often takes the form of a linear network with a softmax activation function,
is to direct the input to the appropriate expert networks [49, 135]. The MoE layer is strategically
placed to select the feed-forward network (FFN) within each Transformer block, typically following
the self-attention (SA) sub-layer. This positioning is crucial because the FFN become increasingly
computationally demanding as the model scales up. For instance, in the PaLM [24] model with the
parameter number of 540B, the 90% of these parameters are within its FFN layers.
𝑌 𝑌
Add + Normalize Add + Normalize
FFN1 FFN2 FFN3 FFN4 FFN1 FFN2 FFN3 FFN4

p= 0.19 p= 0.32 p= 0.52 p= 0.37
p= 0.41 p= 0.08
Gate Gate
𝑋 𝑋
(a) Dense MoE (b) Sparse MoE
Fig. 2. An illustration of an MoE layer in Transformer-based models. For each input 𝑋 , the linear-softmax
gating will select all experts namely (a) Dense MoE or top-𝑘 experts namely (b) Sparse MoE to perform
conditional computation. The expert layer returns the output of the selected expert multiplied by the gate
value (softmax of the gating function output).
Formally, each expert network 𝑓𝑖 , usually a linear-ReLU-linear network, is parameterized by W𝑖 ,

accepting the same input x and generating an output 𝑓𝑖 (x; W𝑖 ). In parallel, the gating network G,
parameterized by Θ and typically consisting of a linear-ReLU-linear-softmax network, yields the
output G(x; Θ). Based on the design of gating function, MoE layers can be broadly classified into
two categories as follows.
2.1 Dense MoE

The dense mixture of experts layer activates all expert networks {𝑓1, . . . , 𝑓𝑁 } during each iteration.
This strategy has been extensively employed in a range of early works, including those by [3, 28, 37,
46, 71, 78, 101, 125, 132, 153]. Most recent, the dense MoE concept has been revisited by studies such
as EvoMoE[110], MoLE [167], LoRAMoE [43], and DS-MoE[117]. The structure of the dense MoE
layer is depicted in Figure 2(a). Consequently, the output of the dense MoE layer can be formulated
as
𝑁
∑︁
MoE 𝑁
Fdense (x; Θ, {W𝑖 }𝑖=1 )= G(x; Θ)𝑖 𝑓𝑖 (x; W𝑖 ), (2.1)
𝑖=1
exp(𝑔(x; Θ)𝑖 )
G(x; Θ)𝑖 = softmax(𝑔(x; Θ))𝑖 = Í𝑁 , (2.2)
𝑗=1 exp(𝑔(x; Θ) 𝑗 )
where 𝑔(x; Θ) represents the gating value prior to the softmax operation.
2.2 Sparse MoE

While a dense mixture of experts typically yields higher prediction accuracy [128], it also incurs a
significant increase in computational overhead. To address this, Shazeer et al. [135] introduced the
sparsely-gated MoE layer, which is designed to activate only a selected subset of experts during
each forward pass. This strategy achieves sparsity by computing a weighted sum of the outputs
from only the top-𝑘 experts, rather than aggregating the outputs from all experts. The structure of
the sparse MoE layer is illustrated in Figure 2(b). Building on the framework established by [135],
Equation (2.2) can be modified to reflect the sparsely-gated mechanism as follows:
G(x; Θ)𝑖 = softmax(TopK(𝑔(x; Θ) + R noise, 𝑘))𝑖 , (2.3)

(
𝑔(x; Θ)𝑖 , if 𝑔(x; Θ)𝑖 is in the top-𝑘 elements of 𝑔(x; Θ),
TopK(𝑔(x; Θ), 𝑘)𝑖 = (2.4)
−∞, otherwise.
To explain, TopK(·, 𝑘) function retains only the top-𝑘 entries of a vector at their original values,
while setting all other entries to −∞. Following the softmax operation, those entries assigned −∞
become approximately zero. The hyper-parameter 𝑘 is selected based on the specific application,
with common choices being 𝑘 = 1 [26, 49] or 𝑘 = 2 [44, 74, 86, 121, 154, 197]. The addition of a
noise term R noise is a prevalent strategy for the training of a sparsely-gated MoE layer, fostering
exploration among experts and enhancing the stability of MoE training [49, 135].
Although the sparse gate G(x; Θ) substantially expands the model’s parameter space without
a corresponding increase in computational cost, it can lead to a load balancing issue. Such an
issue refers to the uneven distribution of workload across experts, with some being frequently
utilized and others seldom or never engaged. To address this, each MoE layer incorporates an
auxiliary loss function that promotes an even distribution of tokens across experts within each
batch, as described in many studies [30, 44, 49, 74, 86, 94, 154]. To formulate this concept, consider
a batch of queries B = {x𝑖 , x2, . . . , x𝑇 }, comprising 𝑇 tokens, and 𝑁 experts indexed from 𝑖 = 1 to
𝑁 . Following [49, 86], the auxiliary load balancing loss for the batch is defined as
𝑁
∑︁
Lload-balancing = 𝑁 D𝑖 P𝑖 , (2.5)
𝑖=1
1 ∑︁
D𝑖 = 1{argmax G(x; Θ) = 𝑖}, (2.6)
𝑇 𝑥∈B
1 ∑︁
P𝑖 = G(x; Θ)𝑖 , (2.7)
𝑇 𝑥∈B
where D𝑖 represents the proportion of tokens distributed to expert 𝑖, while P𝑖 denotes the proportion
of the gating probability assigned to expert 𝑖. To ensure an even distribution of the batch of tokens
across the 𝑁 experts, the load-balancing loss function Lload-balancing should be minimized. The
Í𝑁 Í𝑁 1 1
optimal condition, i.e., min(Lload-balancing ) = 𝑁 𝑖=1 D𝑖 P𝑖 = 𝑁 𝑖=1 𝑁 𝑁 = 1, is achieved when
1
each expert receives an equal number of dispatched tokens D𝑖 = 𝑁 , and an equal proportion of
the gating probability P𝑖 = 𝑁1 . The balance is thus maintained across all experts, ensuring that
the workload is uniformly distributed at all times. Throughout the subsequent sections, unless
explicitly stated otherwise, the term “MoE” will refer to “sparse MoE”.
3 TAXONOMY OF MIXTURE OF EXPERTS

To effectively scale model parameters without a corresponding increase in computational demand,
the mixture of experts (MoE) architecture has emerged as a viable solution. MoE leverages a
collection of specialized models and a gating mechanism to dynamically select the appropriate
“expert networks” for processing a given input. This enables the model to allocate computational
resources on an as-needed basis, a concept known as conditional computation. The incorporation
of MoE architectures into large language models (LLMs) is now a prevalent practice, allowing these
models to achieve significant parameter scale-ups and consequent enhancements in capabilities
[48, 49, 74, 86, 154].
Dense DS-MoE[117], EvoMoE[110], MoLE[167], LoRAMoE[43]
Shazeer et al.[135], GShard[86], Switch Transformer[49]

ST-MoE[197], M6-t[173], Mod-Squad[21], StableMoE[31]
ModuleFormer[140], OpenMoE[172], Skywork-MoE [154]
Base Layer[87], DSelect-k[58], V-MoE[128], Z-code M3[80]
Token-Choice
S-Base[26], Sentence-level MoE[84], NLLB[29], MMoE[101]
Task-level MoE[175], X-MoE[22], Uni-Perceiver-MoE[195]
Gating Mixtral-8x7B[74], DeepSeekMoE[30], Jamba[94], DBRX[34],
Function MoA[182], JetMoE [139], Yuan 2.0-M32 [166], DS-MoE[117]
Sparse
Non-trainable Hash Layer[129], THOR[198],
Í DEMix[56], Task-MoE[84]
Token-Choice M2M-100[47], Pangu- [127]
Expert-Choice Expert-Choice MoE[193], Brainformers[192]
Token Merging Soft MoE[118], HOMOE[33]

Soft
Expert Merging SMEAR[105], Lory[189], MoV[178], Omni-SMoLA[164]
GShard[86], Switch Transformer[49], ST-MoE[197]

FFN
Branch-Train-MiX[145], DS-MoE[117], MoEfication[184]
Network
Types Attention MoA[182], JetMoE[139], DS-MoE[117], ModuleFormer[140]
Others pMoE[25], ADVMOE[183], Chen et al.[20], DeepMoE[159]
GShard[86], Meta-MoE[6], GLaM[44], Mixtral-8x7B[74]

Count
ST-MoE[197], Swith Transformer[49], DeepSpeed-MoE[121]
Hyper- GLaM[44], DeepSeekMoE[30], DeepSeek-V2[36], DBRX[34]

Size
parameters Qwen1.5-MoE[151], LLAMA-MoE [149], Mixtral-8x7B[74]
ST-MoE[197], OpenMoE[172], V-MoE[128], MoE-LLaVA[95]

Frequency
Experts DeepSeekMoE[30], Brainformers[192], Jamba[94]
Algorithm
Activ. Func. ReLU[52], GeLU[63], GeGLU[133], SwiGLU[133]
DeepSpeed-MoE[121], NLLB[29], DeepSeekMoE[30], OpenMoE[172], ScMoE[12]

Share Expert
Qwen1.5-MoE[151], MoCLE[53], Snowflake-Arctic[152]
LoRAMoE[43], AdaMix[160], MixDA[39], LLaVA-MoLE[15]

FFN
MixLoRA[88]
Attention MoELoRA[100], MoCLE[53], SiRA[196]

PEFT
Transformer MoV[178], MoLoRA[178], UniPELT[103], Omni-SMoLA[164]
Block MoLA[51], MOELoRA[97]
Every Layer MoLE[168]

Mixture of Experts (MoE)
Original Shazeer et al.[135], GShard[86], Switch Transformer[49], ST-MoE[197]
RMoE[165], Dua et al.[45], Sparse Upcycling[82], DS-MoE[117], EvoMoE[110]

Training Dense2Sparse MoE-LLaVA[95], SMoE-Dropout[17], Skywork-MoE[154], MoEfication [184]
& Inference MoEBERT[199], LLaMA-MoE[149]
Scheme
Sparse2Dense OneS[170], MoE-Pruning[16], ModuleFormer [140], EWA[68]
Expert Models
Branch-Train-Merge[89], Branch-Train-MiX[145], FoE[158]
Merging
Derivatives Lifelong-MoE[19], MoT[5], MoD[124], WideNet[171], SUT[146], SMoP[23]
FastMoE[59], DeepSpeed-MoE[121], Tutel[69], SE-MoE[136], FasterMoE[60], DeepSpeed-TED[142]

Computation
HetuMoE[112], FlexMoE[111], SmartMoE[179], MegaBlocks[50], ScatterMoE[147], PIT[187]
System DeepSpeed-MoE[121], HetuMoE[112], FasterMoE[60], ExFlow[174], Tutel[69], DeepSpeed-TED[142]

Communication
TA-MoE[13], SE-MoE[136], MPipeMoE [185], SkyWork-MoE[154], Lancet[75], ScMoE[12], Arctic[152]
Storage SE-MoE[136], Pre-gated MoE[70], EdgeMoE[176], MPipeMoE[185]
Shazeer et al.[135], GShard[86], Swith Transformer[49], Meta-MoE[6], GLaM[44], NLLB[29]

NLP
Mixtral-8x7B[74] DeepSpeed-MoE[121], ST-MoE[197], DeepSeekMoE[30]
CV V-MoE[128], Swin-MoE[69], ADVMOE[183], pMoE[25]

Application
RecSys MMoE[101], PLE[148], AdaMCT[77]
LIMoE[106], Shen et al.[138], MoCLE[53], LLaVA-MoLE[15], MoE-LLaVA[95], Uni-MoE[92]

MultiModal
MM1[104]
Fig. 3. Taxonomy of Mixture of Experts (MoE).
For example, the Mixtral 8x7B [74], introduced by Mixtral AI, shares its foundational architecture
with the earlier Mistral 7B [73], but with a notable difference: each layer comprises eight feed-
forward networks (FFN) (i.e., experts). Despite utilizing only 13 billion active parameters, the
Mixtral-8x7B demonstrates superior or equivalent performance to the Llama-2-70B [155] and GPT-
3.5 [113] across various benchmarks. Similarly, the DeepSeek LLM [10], developed by DeepSeek, has
been extended with an MoE variant known as DeepSeekMoE [30]. The DeepSeekMoE 16B, while
requiring approximately 40% less computation, attains performance on par with the Llama 2 7B
[155]. The Qwen team has also contributed to this innovative field by developing the Qwen1.5-MoE
[151], a smaller MoE model with only 2.7B active parameters that rivals the performance of leading
7B parameter models such as the Mistral 7B [73] and the Qwen1.5-7B [150].
To assist researchers in navigating the rapidly evolving landscape of LLMs equipped with MoE
architectures, we have developed a taxonomy that categorizes these models from three perspectives:
algorithm design, system design, and application. Figure 3 showcases our taxonomy alongside
several representative studies. In the following sections, we will provide a comprehensive and
in-depth analysis of each category within our taxonomy.
4 ALGORITHM DESIGN OF MIXTURE OF EXPERTS

4.1 Gating Function
The gating function (also known as the routing function or router), which stands as a fundamental
component of all the MoE architectures, orchestrates the engagement of expert computations and
the combination of their respective outputs. We categorize this mechanism into three distinct types
Based on the processing methodologies of each input, we categorize the gating mechanism into
three distinct types: sparse, which activates a subset of experts; dense, which activates all experts;
and soft, which encompasses fully-differentiable approaches including input token merging and
expert merging.
4.1.1 Sparse. The sparse gating functions activate a selected subset of experts for processing each
individual input token, which can be considered as a form of conditional computation [4, 9, 35].
The gating functions have been studied extensively, which may be trained by various forms of
reinforcement learning and back-propagation, making binary or sparse and continuous, stochastic
or deterministic gating decisions [8, 26, 46, 130, 131]. Shazeer et al. [135] pioneered a differentiable
heuristic with auxiliary load balancing losses, in which the outputs from expert computations are
weighted by their selection probabilities. This introduces a differentiable aspect to the gating process,
thereby facilitating the derivation of gradients that can guide the gating function’s optimization.
This paradigm has subsequently become predominant in the realm of MoE research. Due to its
selection of experts for each input token, this method can be recognized as a gating function with
token choice.
Token-Choice Gating. Shazeer et al. [135] posited the necessity of gating inputs to the top-𝑘
experts, with 𝑘 > 1, to enhance the efficacy of MoE. The rationale behind this approach is that by
simultaneously consulting multiple experts for a given input, the network can effectively weigh
and integrate their respective contributions, thereby improving performance. To accommodate the
scalability to thousands of experts within a MoE layer, they employ a two-level hierarchical MoE to
reduce the branching factor in the context of a large expert count. Subsequent research has largely
affirmed that increasing the value of k enhances performance, which has led to the widespread
adoption of this top-𝑘 strategy with 𝑘 > 1. Notwithstanding, the Switch Transformer model [49]
has shown that a top-1 gating strategy (as illustrated in Figure 4 (a)) can also yield competitive
results, a finding that has been substantiated and adopted by later studies [26]. Furthermore, M6-t
[173] proposed a novel variation of the top-1 gating called expert prototyping, which organizes
Table 1. Overview of diverse auxiliary loss functions and their typical coefficient configurations. The origina-
tors introducing each auxiliary loss is highlighted as bolded reference, followed by references that adopts the
same approach. Studies that have modified the original formulation are indicated with underlined reference.
Reference Auxiliary Loss Coefficient
Shazeer et al.[135], V-MoE[128] 𝐿𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 + 𝐿𝑙𝑜𝑎𝑑 𝑤𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 = 0.1, 𝑤𝑙𝑜𝑎𝑑 = 0.1

GShard[86], Switch-T[49], GLaM[44], Mixtral-8x7B[74], DBRX[34],
𝐿𝑎𝑢𝑥 𝑤 𝑎𝑢𝑥 = 0.01
Jamba[94], DeepSeekMoE[30], DeepSeek-V2[36], Skywork-MoE[154]
ST-MoE[197], OpenMoE[172], MoA[182], JetMoE [139] 𝐿𝑎𝑢𝑥 + 𝐿𝑧 𝑤 𝑎𝑢𝑥 = 0.01, 𝑤𝑧 = 0.001
Mod-Squad[21], Moduleformer[140], DS-MoE[117] 𝐿𝑀𝐼 𝑤 𝑀𝐼 = 0.001
experts into k groups and then applies top-1 gating in each group. Their experimental results
show the training and downstream perplexity of a 16-layer model in order of best to worst: expert
prototyping with 4 top-1 gating, 1 top-4 gating, 1 top-16 gating, 1 top-1 gating.
Auxiliary Loss for Token-Choice Gating. Token-choice gating algorithms frequently incor-
porate an auxiliary loss during training to promote equitable token distribution across experts.
Table 1 shows prevalent auxiliary loss functions leveraged in the field. Shazeer et al. [135] quantify
the importance of an expert in relation to a training batch via the batchwise sum of the gate
values for that expert. They define an additional loss 𝐿𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 , added to the overall loss function
for the model. This loss, which is equal to the square of the coefficient of variation of the set
of importance values and multiplied by a hand-tuned scaling factor 𝑤𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 , encourages all
experts to have equal importance. Although 𝐿𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 promotes balance in importance, it does not
guarantee an even distribution of training examples among experts, which can lead to execution
inefficiencies on distributed computing environments. To address this, they introduce a second loss
𝐿𝑙𝑜𝑎𝑑 to ensure balanced loads. Building on this foundation, GShard [86] define a new differentiable
auxiliary loss 𝐿𝑎𝑢𝑥 using a differentiable approximation (the dot-product of mean gates and mean
gating decisions per expert), as detailed in Section 2.2. Switch Transformers [49] and many other
subsequent studies [34, 44, 74, 94] have embraced this 𝐿𝑎𝑢𝑥 design, and enhancements [30, 36, 154]
have been made to cater to diverse requirements. Nevertheless, ST-MoE [197] identified limitations
with 𝐿𝑎𝑢𝑥 , particularly at larger scales, leading to unreliable training outcomes. To mitigate this, it
introduces the integration of router z-loss 𝐿𝑧 , improving training stability without quality degrada-
tion by penalizing large logits entering the gating network. Since this loss encourages absolute
magnitude of values to be smaller, roundoff errors are reduced, which can be quite impactful for
exponential functions such as the gating. Additionally, Mod-Squad [21] posits the difficulty of
training multi-task models under such an expert-balancing loss, which may inadvertently force
experts to set parameters on conflicting tasks or hinder the potential synergies from parameter
sharing across complementary tasks. Instead, it proposes to maximize the mutual information
(MI) between experts and tasks to build task-expert alignment. Differently, ModuleFormer [140]
proposes to maximize the Mutual Information between experts and tokens. Furthermore, DS-MoE
[117] extends the application of 𝐿𝑀𝐼 , calibrating different weightings 𝑤 𝑀𝐼 , in Mixture-of-Attention
(MoA, as illustrated in Figure 5 (a)) and FFN MoE modules of different size models.
Expert Capacity for Token-Choice Gating. In conjunction with load balancing via auxiliary
loss, GShard [86] incorporates an expert capacity limit, defining a threshold for the number of tokens
an expert can process. This can lead to token overflow, where excess tokens are not processed by
the designated expert. GShard also proposes a random routing mechanism that selects a secondary
expert with a probability proportional to its weight, under the intuition that the contribution of a
secondary expert can be negligible, given that the output is a weighted average and the secondary
weight is typically small. For the task of image classification with Vision Transformer (ViT) models,
Riquelme et al. [128] enhance the top-𝑘 gating strategy with Batch Prioritized Routing (BPR), which
assigns priority based on higher gating scores rather than the sequence order of tokens. Zoph et
al. [197] have demonstrated the efficacy of BPR in the context of MoE language models. Kim et al.
[80] suggest randomizing token prioritization within sequences to mitigate routing bias towards
early-positioned tokens. OpenMoE [172] provides a comprehensive analysis of gating mechanisms,
highlighting the “Drop-towards-the-End” phenomenon whereby tokens later in a sequence are at
greater risk of being dropped due to experts reaching their maximum capacity limits, an issue that is
exacerbated in instruction-tuning datasets. Moreover, OpenMoE identifies a tendency within MoE
systems to route tokens based on token-level semantic similarities, leading to “Context-independent
Specialization”. Additionally, this token ID routing specialization is established early in pre-training
and remains largely fixed, resulting in a consistent pattern of token processing by the same experts
throughout training, a phenomenon referred to as “Early Routing Learning”.
Other Advancements on Token-Choice Gating. Despite the implementation of gating heuris-
tics and auxiliary expert-balancing loss functions aimed at achieving a balanced workload distri-
bution among experts, the issue of load imbalance persists as a prevalent challenge within MoE
architectures. To solve it, the Balanced Assignment of Sparse Experts (BASE) layer, as conceptual-
ized by Lewis et al. [87] and illustrated in Figure 4 (b), re-envisions the token-to-expert allocation
process by casting it as a linear assignment problem, aiming to maximize the token-expert affinities
under the constraints that each expert is assigned an equal quantity of tokens. Subsequently, Clark
et al. [26] introduce a variant of the BASE layer, termed S-BASE, using an optimal transport formu-
lation. Additionally, they devise a reinforcement learning based gating algorithm employing top-1
routing, with the reward function defined as the negative cross-entropy of the predicted token.
In addressing the discrete optimization challenge of gating function that can lead to convergence
and statistical performance issues when training with gradient-based methods, Hazimeh et al. [58]
introduce DSelect-k. which is a smooth version of the top-𝑘 gating algorithm that improves over
standard top-𝑘 gating. This method constitutes a refined version of the top-𝑘 gating algorithm,
featuring enhanced smoothness properties that yield improvements over the conventional top-𝑘
gating approach. Kudugunta et al. [84] diverge from the prevalent token-level gating strategies by
introducing a sentence-level gating mechanism. This approach involves generating a sentence rep-
resentation by averaging the tokens within a sequence and subsequently routing it to an expert. Chi
et al.[22] observe that prevailing gating mechanisms tend to push hidden representations clustering
around expert centroids, implying a trend toward representation collapse, which in turn harms
model performance. To counteract this issue, they project hidden vectors into a lower-dimensional
space before gating and implement L2 normalization for both token representations and expert
embeddings, thus calculating gating scores within a low-dimensional hypersphere. Skywork-MoE
[154] proposes two innovative techniques: gating logit normalization, which improves expert
diversification, and adaptive auxiliary loss coefficients, which provides layer-specific adjustment of
auxiliary loss coefficients. Yuan 2.0-M32 [166] proposes a new router network, Attention Router
(as illustrated in Figure 4 (e)), which implements a more efficient selection of experts and yields an
enhancement in model accuracy over classical linear router network.
Non-trainable Token-Choice Gating. The dynamic training of gating functions within MoE
models is standard practice; however, some research has ventured into the realm of non-trainable
token-choice gating mechanisms. The most significant benefit of non-trainable token-choice gating
is that no additional gating network parameters are required and the full load balancing can
be achieved through specific gating mechanisms. The Hash Layer [129] utilizes a random fixed
gating approach by hashing the input token, achieving competitive results without the necessity of
training the gating network. The load balancing is facilitated by the selection of hash functions
𝑌* 𝑌+ 𝑌* 𝑌+ 𝑌* 𝑌+
Add + Normalize Add + Normalize Add + Normalize
FFN1 FFN2 FFN3 FFN4 FFN1 FFN2 FFN3 FFN4 FFN1 FFN2 FFN3 FFN4
p= 0.58 p= 0.76
Random Random
Solve Linear
Gate Assignment Domain Mapping
𝑋* 𝑋+ 𝑋* 𝑋+ 𝑋* (domian1) 𝑋+ (domain2)
(a) Sparse MoE (Top-1 Gating) (b) BASE Layers (c) Domain Mapping + Random Gating
𝑌* 𝑌+ 𝑌 𝑌
Add + Normalize Add + Normalize Add + Normalize
Merged
FFN1 FFN2 FFN3 FFN4 FFN
FFN1 FFN2 FFN3 FFN4

FFN1 FFN2 FFN3 FFN4
p= 0.41 p= 0.19 p= 0.32 p= 0.08
Gate Gate Gate Gate

W W’ W’’ Gate
𝑋* 𝑋+ 𝑋 𝑋
(d) Expert-Choice Gating (e) Attention Router (f) Soft MoE (Expert Merging)
Fig. 4. The illustration of various gating functions employed in MoE models, including (a) sparse MoE with
top-1 gating [49], (b) BASE layers [87], (c) the combination of grouped domain mapping and random gating
[127], (d) expert-choice gating [193], (e) attention router [166], and (f) soft MoE with expert merging [105].
prior to training, which can equitably distribute token batches. Zuo et al. [198] introduces THOR,
an algorithm that randomly allocates two experts to each input during training and inference with
a consistency regularized loss promoting consistent predictions. Gururangan et al. [56] propose
the DEMix model, which explicitly assigns distinct experts to discrete pre-training domains, with
domain matching being employed to select experts corresponding to the training inputs. Given the
potential suboptimality of domain categorization and its limited scope in encompassing test-time
domains, a single domain expert selection could undermine the model’s generalizability. To address
this, DEMix adopts a parameter-free probabilistic method that dynamically estimates the domain-
weighted mixture at inference. Kudugunta et al. [84] explore task-level gating incorporating prior
knowledge tags, and similarly, M2M-100 model [47] utilizes explicit language-specific sublayers with
deterministically routing input tokens based on their language. Building uponÍthe aforementioned
non-trainable gating strategies—random gating and domain mapping—PanGu- [127] presents the
Random Routed Experts (RRE) mechanism. As illustrated in Figure 4 (c), this approach initially
routes tokens to a domain-specific expert group, followed by a random selection within that group.
In contrast to explicit language-specific expert selection, NLLB [29] leverages trainable gating
to manage multilingual machine translation tasks, outperforming the M2M-100 approach [47].
Addressing task interference in generalist models, Zhu et al. [195] introduce the Conditional MoE,
which augments MoE with trainable gating by integrating conditional information at various levels,
such as token-level, context-level, modality-level, task-level, and predefined token attributes. Ye
et al. [175] further investigate the incorporation of trainable gating at task-level MoE. Addition-
ally, STABLEMOE [31] identifies a challenge with existing learning-to-route MoE methods: the
phenomenon of gating fluctuation. To counter this, STABLEMOE employs a two-stage training

process. The first stage focuses on acquiring a balanced and cohesive gating strategy, which is then
distilled into a lightweight gate function, decoupled from the backbone model. Subsequently, the
second stage leverages the distilled gate for token-to-expert assignments and freezes it to ensure a
stable gating strategy throughout further training.
Expert-Choice Gating. Zhou et al. [193] propose an inversion of the conventional token-choice
gating paradigm, wherein each expert selects the top-𝑘 tokens they will process, as illustrated in
Figure 4 (d). This approach circumvents the necessity for auxiliary load balancing losses during
training, ensuring a uniform distribution of tokens across experts. However, this method may result
in uneven token coverage, with some tokens potentially being processed by multiple experts or
not at all. Despite this, the technique demonstrates strong empirical performance and offers an
adaptive computational interpretation where the model can implicitly apply more computation
to certain tokens. The effectiveness of expert-choice gating is further validated by Zhou et al. in
their subsequent Brainformers study [192]. Additionally, Komatsuzaki et al. [82] integrate the
expert-choice gating strategy within the Vision Transformer and adapt it for the encoder in T5
models, while maintaining token-choice gating for the T5 decoder.
4.1.2 Dense. In Section 2.1, we discuss the enduring relevance of dense MoE, which activates all
the experts for each input process. This dense paradigm continues to inform current innovations in
MoE training and inference methodologies, as elaborated in Section 4.4.1. While sparse activation
of experts, as a trade-off, may yield efficiency gains at the expense of some performance when
compared to a densely activated MoE with an equivalent number of total parameters [30, 117, 140], it
represents a strategic adjustment to balance computational demands with model capability. Notably,
dense activation performs well in the context of LoRA-MoE fine-tuning, where the computational
overhead of LoRA experts is comparatively low. This approach enables the effective and flexible
integration of multiple LoRAs across a variety of downstream tasks. It preserves the generative
capabilities of the original pre-trained model and maintains the unique characteristics of individual
LoRAs for each task [43, 167].
4.1.3 Soft. Deciding the allocation of appropriate experts to each input token pose the fundamental
discrete optimization challenge for sparse MoE. This often necessitates heuristic auxiliary losses
to ensure balanced expert engagement and to minimize unassigned tokens. These issues become
more pronounced in scenarios involving out-of-distribution data, such as small inference batches,
novel inputs, or during transfer learning. Similar to dense MoE, the soft MoE approach maintains
full differentiability by leveraging all the experts for processing each input, thus avoiding issues
inherent to discrete expert selection. We distinguish soft MoE from dense MoE to highlight the
characteristic that mitigates computational demands through the gating-weighted merging of input
tokens or experts.
Token Merging. Puigcerver et al. [118] proposed the Soft MoE, which eschews the conventional
sparse and discrete gating mechanism in favor of a soft assignment strategy that merges tokens. This
method computes several weighted averages of all tokens, with weights depending on both tokens
and experts, and processes each aggregate with its respective expert. Their experimental results in
image classification demonstrate that soft MoE enhances the stability of gating function training
and inherently maintains balance. HOMOE [33] follows the design of Soft MoE and combines it
with Hopfield network to address the the challenges of Compositional Zero-Shot Learning tasks.
Yet, merging input tokens complicates its application in auto-regressive decoders, as future tokens
required for averaging are inaccessible during inference.
Expert Merging. In contrast to the merging of input tokens, Muqeeth et al. [105] introduced
the Soft Merging of Experts with Adaptive Routing (SMEAR) framework, which circumvents dis-
crete gating by merging all the experts’ parameters through a weighted average, as illustrated in
Figure 4 (f). They argue that conventional sparse MoE models often fail to match the performance
of their parameter-matched dense counterparts or those utilizing non-learned heuristic gating
functions, potentially due to flawed gradient estimation methods for training modules with non-
differentiable, discrete gating decisions. By processing the input tokens through a single merged
expert, SMEAR does not incur a significant increase in computational costs and enables standard
gradient-based training. Empirical evaluations on T5-GLUE and ResNet-DomainNet benchmarks re-
veal that SMEAR-equipped models surpass those with metadata-based [56, 84] or gradient-estimated
learning gating strategies. On ResNet-DomainNet, SMEAR achieved a 1.5% higher average accu-
racy than Soft MoE [118] with single “slot” per expert, at the expense of a near 10% reduction in
throughput. Subsequent contributions by Zhong et al. [189] argue that SMEAR’s demonstrated
advantages are confined to downstream fine-tuning on classification tasks. They present Lory, an
innovative approach for scaling such expert merging architectures to auto-regressive language
model pre-training. Lory [189] introduces a causal segment routing strategy, conducting expert
merging at the segment level while maintaining the auto-regressive nature of language models.
Furthermore, it employs similarity-based data batching to direct expert specialization in particular
domains or topics. Lory’s empirical validation on LLaMA models showcases significant improve-
ments over parameter-matched dense models in terms of perplexity (by 13.9%) and on diverse
downstream tasks (by 1.5%-11.1%), highlighting the potential of fully-differentiable MoE architec-
tures for language model pre-training and encouraging further investigation in this area. In addition,
expert merging methods have demonstrated efficacy in parameter-efficient fine-tuning (PEFT) MoE
contexts. Zadouri et al. [178] substantiate that soft merging of experts significantly outperforms
sparse gating mechanisms (top-1, top-2) in the T5 models [120] fine-tuning with the MoV-10 setting
of 10 (IA)3 vector expert. Wu et al. [164] propose Omni-SMoLA, an architecture leveraging the soft
method to mix multimodal low-rank experts, improving the generalist performance across a broad
range of generative vision-language tasks.
4.2 Experts
In this section, we delineate the architecture of expert networks within MoE framework, following
our discussion on the gating function that orchestrates the activation of these experts.
4.2.1 Network Types. Since the initial integration of MoE into transformer architectures [49, 86,
197], MoE has served as a substitute for Feed-Forward Network (FFN) modules within these models.
Typically, each expert within a MoE layer replicates the architecture of the FFN it replaces. This
paradigm, wherein FFNs are utilized as experts, remains predominant, and subsequent refinements
will be expounded upon in Sections 4.2.2 to 4.2.4.
Feed-Forward Network. As discussed in existing work [145], the predilection for leveraging
MoE in the context of FFNs is rooted in the hypothesis that self-attention layers exhibit lower
sparsity and less domain specificity than FFN layers. Pan et al. [117] provide empirical support
for this, revealing marked sparsity in FFN layers compared to self-attention layers, through their
analysis of downstream Wikitext tasks using their pre-trained DS-MoE models. Their results
indicate a mere 20% active expert engagement in FFN layers, in contrast to the 80% observed within
self-attention layers. In earlier investigation of FFN computational patterns, Zhang et al. [184]
observe that most inputs only activate a small proportion of neurons of FFNs, thus corroborating
the inherent sparsity of FFNs.
Linear O1 Linear O2 Linear O3 Linear O4
Attention Attention Shared FFN1 FFN2 FFN3

Head 1 Head 2 FFN
Linear K Linear V
𝐾 𝑉 p p
Linear Q1 Linear Q2 Linear Q3 Linear Q4
𝑝! 𝑝"
𝑋! Gate
Gate
𝑄 𝑋! 𝑋#
(a) MoA (b) Shared Expert
Fig. 5. The illustration of Mixture of Attention Heads [182] (a) and Shared Expert [121] (b) architectures.
Attention. While the focus of MoE research has predominantly been on FFN layers within the
Transformer architecture, Zhang et al. [182] introduce the Mixture of Attention Heads (MoA), an
innovative architecture that combines multi-head attention layers with MoE to further enhance
performance and restrain computational cost. As delineated in Figure 5 (a), MoA employs two sets of
experts, one for query projection and one for output projection, which are selected the same indices
of experts through a common gating network. To reduce computational complexity, MoA shares
the key (𝑊𝑘 ) and value (𝑊𝑣 ) projection weights across attention experts, with experts differentiated
𝑞
only by their respective query (𝑞𝑡 𝑊𝑖 ) and output (𝑜𝑖,𝑡 𝑊𝑖𝑜 ) projection weights, allowing for shared
pre-computation of key (𝐾𝑊𝑘 ) and value (𝑉𝑊𝑣 ) sequences. Subsequent work such as DS-MoE
[117], JetMoE [139], and ModuleFormer [140] follows the design of MoA and further refines the
combination of MoE and attention layer.
Others. In addition to the aforementioned expert network types, researchers have explored
the use of Convolutional Neural Network (CNN) as expert [20, 25, 54, 159, 183]. Moreover, recent
endeavors that integrate Parameter-Efficient Fine-Tuning (PEFT) techniques with MoE, such as
employing Low-Rank Adaptation (LoRA) [66] as expert, have shown promising results, which are
discussed in Section 4.3.
4.2.2 Hyperparameters. The scale of sparse MoE models is governed by several critical hyperpa-
rameters that extend beyond those of dense transformer models. These include (1) the count of
experts per MoE layer, (2) the size of each expert, and (3) the placement frequency of MoE layers
throughout the model. The selection of these hyperparameters is crucial, as it profoundly influences
model performance and computational efficiency across various tasks. Optimal hyperparameter
choices are thus contingent upon the specific application requirements and the constraints of the
computational infrastructure. Our subsequent analysis, informed by the exemplified models listed
in Table 2, explores these hyperparameter decisions in depth. Meanwhile, we enumerate some
recent open-source models, detailing their number of parameters and benchmark performance in
Table 3.
Expert Count. Initial investigations employing thousands of experts per layer yielded impressive
gains in pre-training and translation quality [49, 86, 135]. Nonetheless, the quality of sparse MoE
models is disproportionately reduced under domain shift [6] or when fine-tuning on diverse task
distributions [49]. GLaM [44] adopts a configuration of 64 experts, guided by their findings that
a 64-expert setup with top-2 gating strikes an optimal balance between execution efficiency and
performance across zero-shot, one-shot, and few-shot scenarios. Reflecting this trend, more recent
sparse MoE models [34, 74, 94, 151, 154, 166, 172, 197] commonly utilize no more than 64 experts.
Additionally, DeepSpeed-MoE [121] adopts a Pyramid-MoE approach, positioning MoE layers with
a larger expert count towards the network’s end.
Table 2. Comparative configurations of MoE with FFN experts in selected models. Model differentiation in
each reference is achieved by using the model size, indicated either by total or activated/total parameter
count. Both activated and total expert counts encompass the count of shared experts when utilized. 𝑑𝑚𝑜𝑑𝑒𝑙 is
the hidden size, 𝑑 𝑓 𝑓 𝑛 is the intermediate size of FFNs, 𝑑𝑒𝑥𝑝𝑒𝑟𝑡 is the intermediate size of FFN experts, #L is
the number of layers, #H and 𝑑ℎ𝑒𝑎𝑑 are the number of attention heads and attention head dimensions.
Expert Count Placement Activation Share Expert
Reference Models 𝑑𝑚𝑜𝑑𝑒𝑙 𝑑𝑓 𝑓 𝑛 𝑑𝑒𝑥𝑝𝑒𝑟𝑡 #L #H 𝑑ℎ𝑒𝑎𝑑
(Activ./Total) Frequency Function Count
600B 2/2048 1024 8192 𝑑𝑓 𝑓 𝑛 36 16 128 1/2 ReLU 0
GShard [86] 200B 2/2048 1024 8192 𝑑𝑓 𝑓 𝑛 12 16 128 1/2 ReLU 0
(2020) 150B 2/512 1024 8192 𝑑𝑓 𝑓 𝑛 36 16 128 1/2 ReLU 0
37B 2/128 1024 8192 𝑑𝑓 𝑓 𝑛 36 16 128 1/2 ReLU 0
7B 1/128 768 2048 𝑑𝑓 𝑓 𝑛 12 12 64 1/2 GEGLU 0
Switch [49] 26B 1/128 1024 2816 𝑑𝑓 𝑓 𝑛 24 16 64 1/2 GEGLU 0
(2021) 395B 1/64 4096 10240 𝑑𝑓 𝑓 𝑛 24 64 64 1/2 GEGLU 0
1571B 1/2048 2080 6144 𝑑𝑓 𝑓 𝑛 15 32 64 1 ReLU 0
0.1B/1.9B 2/64 768 3072 𝑑𝑓 𝑓 𝑛 12 12 64 1/2 GEGLU 0
GLaM [44] 1.7B/27B 2/64 2048 8192 𝑑𝑓 𝑓 𝑛 24 16 128 1/2 GEGLU 0
(2021) 8B/143B 2/64 4096 16384 𝑑𝑓 𝑓 𝑛 32 32 128 1/2 GEGLU 0
64B/1.2T 2/64 8192 32768 𝑑𝑓 𝑓 𝑛 64 128 128 1/2 GEGLU 0
350M/13B 2/128 1024 4𝑑𝑚𝑜𝑑𝑒𝑙 𝑑𝑓 𝑓 𝑛 24 16 64 1/2 GeLU 0
DeepSpeed-MoE [121] 1.3B/52B 2/128 2048 4𝑑𝑚𝑜𝑑𝑒𝑙 𝑑𝑓 𝑓 𝑛 24 16 128 1/2 GeLU 0
(2022) PR-350M/4B 2/32-2/64 1024 4𝑑𝑚𝑜𝑑𝑒𝑙 𝑑𝑓 𝑓 𝑛 24 16 64 1/2, 10L-32E, 2L-64E GeLU 1
PR-1.3B/31B 2/64-2/128 2048 4𝑑𝑚𝑜𝑑𝑒𝑙 𝑑𝑓 𝑓 𝑛 24 16 128 1/2, 10L-64E, 2L-128E GeLU 1
ST-MoE [197] 0.8B/4.1B 2/32 1024 2816 𝑑𝑓 𝑓 𝑛 27 16 64 1/4, add extra FFN GEGLU 0
(2022) 32B/269B 2/64 5120 20480 𝑑𝑓 𝑓 𝑛 27 64 128 1/4, add extra FFN GEGLU 0
Mixtral [74] 13B/47B 2/8 4096 14336 𝑑𝑓 𝑓 𝑛 32 32 128 1 SwiGLU 0
(2023) 39B/141B 2/8 6144 16384 𝑑𝑓 𝑓 𝑛 56 48 128 1 SwiGLU 0
3.0B/6.7B 2/16 4096 11008 688 32 32 128 1 SwiGLU 0
LLAMA-MoE [149]
(2023) 3.5B/6.7B 4/16 4096 11008 688 32 32 128 1 SwiGLU 0
3.5B/6.7B 2/8 4096 11008 1376 32 32 128 1 SwiGLU 0
1
0.24B/1.89B 8/64 1280 - 4𝑑𝑓 𝑓 𝑛 9 10 128 1 SwiGLU 1
DeepSeekMoE [30]
2.8B/16.4B 8/66 2048 10944 1408 28 16 128 1, except 1st layer SwiGLU 2
(2024)
1
22B/145B 16/132 4096 - 8𝑑𝑓 𝑓 𝑛 62 32 128 1, except 1st layer SwiGLU 4
339M/650M 2/16 768 3072 𝑑𝑓 𝑓 𝑛 12 12 64 1/4 SwiGLU 1
OpenMoE [172]
(2024) 2.6B/8.7B 2/32 2048 8192 𝑑𝑓 𝑓 𝑛 24 24 128 1/6 SwiGLU 1
6.8B/34B 2/32 3072 12288 𝑑𝑓 𝑓 𝑛 32 24 128 1/4 SwiGLU 1
Qwen1.5-MoE [151]
2.7B/14.3B 8/64 2048 5632 1408 24 16 128 1 SwiGLU 4
(2024)
DBRX [34]
36B/132B 4/16 6144 10752 𝑑𝑓 𝑓 𝑛 40 48 128 1 SwiGLU 0
(2024)
Jamba [94] 1/2,
12B/52B 2/16 4096 14336 𝑑𝑓 𝑓 𝑛 32 32 128 SwiGLU 0
(2024) 1:7 Attention:Mamba
Skywork-MoE [154]
22B/146B 2/16 4608 12288 𝑑𝑓 𝑓 𝑛 52 36 128 1 SwiGLU 0
(2024)
Yuan 2.0-M32 [166]
3.7B/40B 2/32 2048 8192 𝑑𝑓 𝑓 𝑛 24 16 256 1 SwiGLU 0
(2024)
Expert Size. To scale the model effectively, GLaM [44] prioritizes the expansion of the inter-
mediate hidden dimension per expert while standardizing the expert count at 64, a strategy that
often requires the implementation of tensor parallelism across multiple accelerators to maintain
computational efficiency [44, 49, 121]. From this period forward, MoE models [34, 74, 154, 197]
typically featured larger expert dimensions. Differently, DeepSeekMoE [30, 36] introduces the
concept of fine-grained expert segmentation by subdividing the intermediate hidden dimension
of FFN expert, while preserving the overall parameter count. Specifically, DeepSeekMoE-145B
employs a reduced intermediate hidden dimension at one-eighth that of its dense FFN counterpart,
increasing both the number of experts (from 16 to 128) and the number of active experts (from top-2
to top-16) by a factor of eight. They believe that this strategy not only refines the decomposition
of knowledge across experts, facilitating more precise learning, but also enhances the flexibility
Table 3. A collection of recent open-source models detailing activated and total parameter counts, alongside
performance benchmarks such as MMLU [61] (5-shot), GSM8K [27] (5-shot), MATH [62] (4-shot), and
HumanEval [14] (0-shot), unless specified otherwise.
Params. Benchmarks
Name Time Affiliation Link
Activ. Total MMLU GSM8K MATH HumanEval
Mixtral-8x7B-v0.1 2023.12 Mistral 13B 47B 70.6 58.4, 74.4 (8-shot) 28.4 40.2 https://huggingface.co/mistralai/Mixtral-8x7B-v0.1
DeepSeekMoE-16B-Base 2024.1 DeepSeek 3B 16B 45.0 18.8 (8-shot) 4.3 26.8 https://huggingface.co/deepseek-ai/deepseek-moe-16b-base
Grok-1 2024.3 xAI 86B 314B 73.0 62.9 23.9 63.2 https://github.com/xai-org/grok-1
Qwen1.5-MoE-A2.7B 2024.3 Alibaba 3B 14B 62.5 61.5 (8-shot) - 34.2 https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B
DBRX Instruct 2024.3 Databricks 36B 132B 73.7 72.8 - 70.1 https://huggingface.co/databricks/dbrx-instruct
Jamba-v0.1 2024.3 AI21 Labs 12B 52B 67.4 59.9 (3-shot) - 29.3 https://huggingface.co/ai21labs/Jamba-v0.1
Mistral-8x22B-v0.1 2024.4 Mistral 39B 141B 77.8 78.6, 88.4 (8-shot) 41.8 45.1 https://huggingface.co/mistralai/Mixtral-8x22B-v0.1
Arctic Instruct 2024.4 Snowflake 17B 480B 67.3 74.2 - - https://huggingface.co/Snowflake/snowflake-arctic-instruct
DeepSeek-V2 2024.5 DeepSeek 21B 236B 78.5 79.2 (8-shot) 43.6 48.8 https://huggingface.co/deepseek-ai/DeepSeek-V2
DeepSeek-V2-Chat (RL) 2024.5 DeepSeek 21B 236B 77.8 92.2 (8-shot) 53.9 81.1 https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat
Yuan 2.0-M32 2024.5 IEIT 4B 40B 72.2 92.7 (8-shot) 55.9 (8-shot) 74.4 https://huggingface.co/IEITYuan/Yuan2-M32
Skywork-MoE-Base 2024.6 Kunlun 22B 146B 77.4 76.1 31.9 43.9 https://huggingface.co/Skywork/Skywork-MoE-Base
of expert activation combinations, allowing for more specialized and targeted knowledge capture.
Qwen1.5-MoE [151] and DBRX [34] adopt a similar fine-grained expert segmentation strategy.
From the results of LLAMA-MoE, which evenly splits the parameters of FFNs into non-overlapping
experts to construct the MoE models with same parameter count, activating 4 of 16 𝑑𝑒𝑥𝑝𝑒𝑟𝑡 = 688
expert performs slightly better than activating 2 of 8 𝑑𝑒𝑥𝑝𝑒𝑟𝑡 = 1376 expert. Results from LLAMA-
MoE [149], which allocates dense FFN parameters across non-overlapping experts to maintain a
consistent parameter count, indicate that activating 4 out of 16 experts with a dimensionality of
𝑑𝑒𝑥𝑝𝑒𝑟𝑡 = 688 marginally outperforms the activation of 2 out of 8 experts with 𝑑𝑒𝑥𝑝𝑒𝑟𝑡 = 1376.
Frequency of MoE Layers. Sparse MoE models typically evolve from dense architectures by
interspersing MoE layers in place of the dense FFN layers at regular intervals. Although a higher
frequency of MoE layers can enlarge the model size, it also introduces greater system overhead. In
practice, most MoE models features alternate FFN replacement (1/2) with MoE layers [6, 44, 86, 121].
Nevertheless, variations exist, with some models incorporating MoE layers every fourth layer (1/4)
[172, 197] or in every layer (1/1) [30, 49]. Following the introduction of Mixtral 8x7B [74], the trend
seems to shift towards placing MoE in every layer of the model, with a common choice of only 8 or
16 experts mirroring the dimensionality of a dense FFN [30, 34, 151, 154].
Research into the optimal configuration of MoE layers has been extensive. V-MoE [128] employs
MoE in the last few even-numbered Transformer layers, noting that, despite using fewer MoE
layers, the impact on performance is minimal while computational speed is significantly enhanced.
DeepSeekMoE-16B/-145B [30] replaces all FFNs with MoE layers, excluding the first, due to the
observed slower convergence of load balance status in the first layer. MoE-LLaVA [95], a recently
popular open Large Vision-Language Model (LVLM), demonstrates that alternating MoE placement
yields superior model quality and execution efficiency on multimodal tasks, compared to every-
layer MoE placement or "First-Half" and "Second-Half" configurations. ST-MoE [197] found that
adding a dense FFN adjacent to each MoE layer can improve model quality. Brainformers [192]
introduce a nonuniform architecture that integrates MoE layers, dense FFNs, attention mechanisms,
and a variety of layer normalizations and activation functions without strict sequential layering,
trading architectural regularity for the flexibility of sub-layer composition. Jamba [94] integrates
the architecture of Mamba [55] by adopting a 1:7 ratio of attention-to-Mamba layers.
4.2.3 Activation Function. Building upon dense Transformer architectures, sparse MoE models
have adopted a progression of activation functions paralleling those in leading dense large language
models, including BERT [38], T5 [120], GPT [11], LLAMA [155] and so on. The evolution of
activation functions has seen a shift from ReLU [52] to more advanced options such as GeLU [63],
GeGLU [133], and SwiGLU [133]. This trend extends to other components of MoE models, which
now frequently incorporate Root Mean Square Layer Normalization (RMSNorm) [180], Grouped
Query Attention (GQA) [2], and Rotary Position Embeddings (RoPE) [144].
4.2.4 Shared Expert. DeepSpeed-MoE [121] innovatively introduces the Residual-MoE architecture,
wherein each token is processed by a fixed expert and another selected through gating, achieving
two experts engagement per layer without increasing the communication cost beyond that of top-1
gating. This approach considers the gating-selected MoE expert as an error-correcting adjunct to the
fixed dense FFN. A conceptually similar approach, Conditional MoE Routing (CMR), is employed
in NLLB [29], which also combines the outputs of dense FFN and MoE layers. This paradigm
of integrating fixed FFN with sparse MoE, often referred to as shared expert and illustrated in
Figure 5 (b), has gained traction in recent language models such as DeepSeekMoE [30], OpenMoE
[172], Qwen1.5-MoE [151], and MoCLE [53], indicating its ascension to a mainstream configuration.
Instead of using a single shared expert, DeepSeekMoE [30] and Qwen1.5-MoE [151] employ multiple
shared experts, due to their fine-grained expert segmentation design.
However, shared expert configuration, while effective in NLP tasks, have not demonstrated the
same level of enhancement in vision tasks. Empirical evidence from ScMoE [12] indicates that
pairing one shared expert with one gating-selected expert yields only comparable performance to
standard top-1 MoE in SwinV2-MoE models. Additionally, based on the design of shared expert,
ScMoE decouples the MoE process to separately handle the representations from preceding layers
and integrate them with the outputs processed by the shared expert of the current layer, thus
improving efficiency by facilitating overlap in communication and computation. A comparable
method to enhance overlapping is employed in the Dense-MoE hybrid transformer architecture,
as delineated in Snowflake Arctic [152], which bears resemblance to the Lora MoE framework
discussed in Section 4.3.3 and illustrated in Figure 6 (d).
4.3 Mixture of Parameter-Efficient Experts

LLMs pretrained on generic massive datasets have demonstrated impressive abilities, enabling their
deployment across diverse tasks [51]. However, to tailor a pretrained LLM for a specific downstream
task, fine-tuning is essential. Traditional full fine-tuning, which updates all the parameters of the
base model, is computationally intensive, especially as model sizes continue to grow [40]. To
address this issue, research into parameter-efficient fine-tuning (PEFT) has emerged, intending to
reduce computational demands during the adaptation of a generic pre-trained model to particular
tasks [57]. PEFT methods only update a small set of parameters while maintaining the rest of the
base model untouched [93]. These techniques have achieved state-of-the-art performance across
numerous NLP tasks [66, 96].
Despite these successes, PEFT approaches often struggle with generalizing across multiple tasks
due to their limited scope of trainable parameters and the potential for catastrophic forgetting [88].
To mitigate these limitations, a line of mixture of parameter-efficient experts (MoPE) research has
emerged, focusing on integrating the MoE framework with PEFT [88, 97]. MoPE incorporates the
MoE’s gating mechanism and multi-expert architecture, but with each expert constructed using
PEFT techniques [114]. The subtle combination boosts PEFT’s performance under the multi-task
scenario [196]. Additionally, by leveraging PEFT for constructing experts, MoPE operates with
fewer parameters, achieving greater resource efficiency compared to traditional MoE models [178].
MoPE harnesses the best of both fields: the task versatility of MoE and the resource efficiency of
PEFT [88], positioning it as a promising area of study that pushes the boundaries of both fields. In
the following subsection, we will give a taxonomy of MoPE, as depicted in Figure 6, based on their
𝑌 𝑌
FFN FFN

P1= 0.61
P1= 0.61 P2= 0.28
P2= 0.28
Self-Attention Gate
Q K V Gate
𝑋 𝑋
(a) Attention (b) FFN
𝑌 𝑌
FFN FFN

P1= 0.61
P1= 0.61 P2= 0.28
P2= 0.28
Self-Attention Gate
Q K V Gate
𝑋 𝑋
(c) Transformer Block (d) Every Layer
Fig. 6. The illustration of the taxonomy of MoPEs based placement within the Transformer model architecture.
(a) exemplifies the integration of MoPE with the Key and Value modules of the attention mechanism. (b)
represents the application of MoPE to the FFN. (c) refers to the MoPE integration at the level of the Transformer
block, wherein two distinct groups of experts are applied to attention and FFN, where separate sets of experts
are allocated to both attention and FFN, each regulated by its own gating mechanism. (d) illustrates a
layer-wise integration of MoPE, in which each Transformer layer is regarded as a unified entity with a gating
orchestrating the interplay among experts.
placement within the Transformer model architecture. We will then review recent MoPE research,
summarizing the methodologies and contributions of each study.
4.3.1 Feed-Forward Network. Following the conventional MoE structure, a series of studies in-
troduce the MoPE framework to the FFN layer of every Transformer block. During the training
process, the focus is on optimizing the parameter-efficient experts and the gating mechanism,
leaving the rest of the pre-trained model intact. As illustrated in Figure 6(b), the forward process
under the MoPE framework integrated with FFN can be expressed as:
𝑛
∑︁
FFN𝑀𝑜𝐸 (x′ ) = FFN(x′ ) + x′ ΔW𝑖 · 𝐺 𝑓 𝑓 𝑛 (x′ )𝑖 ,
𝑓 𝑓𝑛
(4.1)
𝑖=1
x′ = LayerNorm(SA(x) + x), (4.2)
where ΔW 𝑓 𝑓 𝑛 and 𝐺 𝑓 𝑓 𝑛 (x) is the parameter-efficient expert and gating function applied to the
FFN layer, respectively.
One of the pioneering works in this domain, LoRAMoE [43], efficiently applies the MoPE
structure to FFN. LoRAMoE integrates a few plug-in LoRA experts into the FFN layer, employing
a gating mechanism to orchestrate the experts’ contributions. Realizing the diversity in data
distributions, LoRAMoE separates the experts into two distinct groups: one focuses on learning
various downstream tasks, and the other is dedicated to aligning pretrained world knowledge with
human instructions. To ensure that each group of experts maintains its focus, LoRAMoE defines
a localized balancing constraint loss, which preserves the importance of each expert within its
group while allowing different groups to concentrate on their respective tasks. This design enables
LoRAMoE to effectively resolve the knowledge forgetting issue and enhance model performance
on downstream tasks. In a similar vein, AdaMix [160] injects a set of Adapter [65] experts after the
FFN layer in each Transformer block. Adapter tuning is a PEFT method that integrates a pair of
feed-forward up and down projection matrices into the Transformer block. During fine-tuning,
only the incremental Adapter blocks are updated, with the rest of the model unchanged. AdaMix
utilizes a stochastic routing policy that randomly selects the projection matrices during training,
maintaining computational costs equivalent to a single adapter. To minimize service costs during
inference, AdaMix averages the outputs of all experts.
Taking a different approach, MixDA[39] includes two training stages to leverage domain-specific
knowledge while preserving learned information. During the first stage, MixDA only fine-tunes
the domain-adapters that work parallel to the FFN to acquire domain-specific knowledge and keep
the world knowledge simultaneously. In the second stage, MixDA introduces a gating network and
task-adapters on top of the FFN layer for tailoring the model to specific downstream tasks. This
strategy allows for a more nuanced adaptation to the task at hand. LLaVA-MoLE[15] extends the
application of MoPE to multimodal tasks. It creates a set of LoRA experts for the FFN layer to handle
inputs from different domains, enhancing the model’s versatility. LLaVA-MoLE adopts a top-1
routing strategy, activating the most relevant expert based on the router’s output distribution, thus
maintaining computational costs close to a standard FFN with LoRA. This framework is effective
in addressing data conflicts and consistently surpasses plain-LoRA baselines across diverse data
configurations.
Contrasting with the MoPE implementations we have discussed, MixLoRA[88] creates a LoRA-
MoE framework that closely aligns with the conventional MoE models. Rather than just plugging
in multiple lightweight experts, MixLoRA fuses LoRA experts with the shared FFN layer. By
leveraging the base weights from a single FFN of the base model, MixLoRA streamlines the creation
of the MoPE architecture. Furthermore, MixLoRA implements a high-throughput framework that
significantly reduces token computation latency and memory usage during both training and
inference, optimizing performance and efficiency.
4.3.2 Attention. A branch of research has been exploring the application of the MoPE framework
with the attention mechanism. These studies typically involve augmenting the attention mechanism
by incorporating a gating network and a set of parallel experts. The MoPE framework can be applied
to the Query, Key, Value, and Output projection modules, individually or in various combinations,
within the attention mechanism. During the fine-tuning process, only the parameters of the activated
experts and the gating network are updated, while the remaining parameters of the model are
kept frozen. For example, as shown in Figure 6(a), the integration of MoPE with the Key and Value
module of the attention mechanism can be formalized as follows:
Q(KT + 𝑛
Í𝑛
𝑀𝑜𝐸 𝑖=1 xΔW𝑖
𝑘 · 𝐺 𝑘 (x)𝑖 ) ∑︁
SA (x) = 𝑆𝑜 𝑓 𝑡𝑚𝑎𝑥 ( √ ) (V + xΔW𝑖𝑣 · 𝐺 𝑣 (x)𝑖 ), (4.3)
𝑑ℎ𝑒𝑎𝑑 𝑖=1
where Q, K, V represents the Query, Key and Value modules, respectively. ΔW𝑘 and 𝐺 𝑘 (x) denote
the parameter-efficient expert and its corresponding gating function for the Key module. Similarly,
ΔW𝑣 and 𝐺 𝑣 (x) indicate the expert and the gating function for the Value module. Here, 𝑛 is the
number of experts, and 𝑑ℎ𝑒𝑎𝑑 is the dimensions in the Multi-head Attention mechanism.
Recent studies have demonstrated the effectiveness of extending MoE to the attention layer
[139, 140, 182]. Additionally, there is a new line of research has focused on the fusion of MoPE with
the attention mechanism to enhance the model’s efficiency and adaptability. For instance, MoELoRA
[100] applies MoE to the attention mechanism in a resource-efficient manner by leveraging a PEFT
method. MoELoRA adopts LoRA [66] to construct the experts. LoRA introduces two low-rank
matrices to receive incremental updates associated with the task-specific fine-tuning. Only the
LoRA matrices are updated while the base model is kept untouched during fine-tuning. Specifically,
MoELoRA sets multiple LoRA experts to the Query and Value matrices of the attention mechanism,
and utilizes a gating network to activate the top 𝑘 experts related to the specific tasks during both
training and inference phases. To alleviate routing randomness, MoELoRA employs a contrastive
learning loss to control the training of experts. The contrastive learning loss is designed to accentuate
the differences in output distributions between experts, thereby encouraging them to capture diverse
features relevant to the downstream tasks. MoELoRA offers a solution for flexibly combining various
computational modules tailored to downstream tasks.
Another framework, MoCLE[53], aims to resolve task conflicts that arise from the diversity of
training tasks of different sources and formats. MoCLE utilizes a clustering model to categorize
different tasks and then leverages a router to direct the clustered input to LoRA experts inserted
into the Query and Value modules of the attention mechanism. These LoRA experts contain a group
of multiple task experts and a universal expert. Each task expert is dedicated to a particular task
to reduce task conflicts, while the universal expert, trained on all tasks, helps to maintain model
generalization. SiRA[196] introduces several lightweight LoRA adapters as experts, along with a
top 𝑘 gating mechanism. To mitigate load imbalance and over-fitting issues, SiRA incorporates
a capacity constraint that limits the number of tokens each expert can process. Additionally, it
employs an auxiliary loss to promote load balancing and an expert dropout mechanism to equalize
the gating distribution. SiRA provides an efficient and fine-grained approach to improving the
quality of LoRA.
4.3.3 Transformer Block. The integration of MoPE with the Transformer architecture has received
substantial attention in recent research. This approach involves creating two groups of experts: one
for the attention mechanism, and another for the FFN within the Transformer block. Each group is
regulated by its gating mechanism to control the activation of the experts. As exhibited in Figure
6(c), the forward process under the MoPE framework integrated with the Transformer block can
be denoted as:
𝑦 = LayerNorm(x′ + FFN𝑀𝑜𝐸 (x′ )), (4.4)
′ 𝑀𝑜𝐸
x = LayerNorm(SA (x) + x). (4.5)
MoV [178] is one of the notable attempts that combine MoPE with the Transformer block to pursue
parameter efficiency. Utilizing the PEFT method, (IA)3 [96], MoV introduces tunable vectors that re-
scale the Key and Value modules in the attention mechanism, as well as the activation within the FFN.
By substituting conventional experts with (IA)3 vectors and updating only these lightweight experts
and their corresponding gating during fine-tuning, MoV significantly reduces the computational
burden associated with gradient calculations and lessens the memory footprint required for model
storage. Similarly, MoLORA [178] employs multiple LoRA experts to the attention and FFN blocks,
outperforming the standard LoRA approach. UniPELT [103] proposed a hybrid framework that
integrates three representative PEFT methods as experts, namely Adapter [65], Prefix-tuning [91],
and LoRA [66]. Prefix-tuning is a method that freezes the base model and optimizes the continuous
task-specific vectors prepended to the input of the attention. Within the UniPELT framework,
LoRA matrices are applied to the weight matrices of Query and Key in the attention mechanism,
Prefix vectors are added to the Key and Value modules, and the Adapter block is inserted after the
FFN layer. UniPELT leverages different gating mechanisms to dynamically activate the experts,
efficiently finding the approaches that best suit the given task.
Further broadening the scope of the LoRA-MoE framework, Omni-SMoLA[164] extends the MoPE
with three sets of LoRA experts, each tailored to handle text tokens, visual tokens, and multimodal
tokens, respectively. The specialization enables the architecture to enhance performance across
various vision-and-language tasks. In the context of MoPE research, the number of experts emerges
as a critical hyperparameter influencing downstream task performance [168, 178]. Additionally, the
use of many experts will lead to redundancy [18]. MoLA [51] is one of the pioneering work that
explores the expert allocation issue. It proposes a LoRA-MoE framework with a Layer-wise Expert
Allocation, which enables the flexible employment of varying numbers of experts across different
layers. The expert allocation strategy proposed by MoLA further improves the effectiveness of the
LoRA-MoE framework. In the specialized field of medical applications, MOELoRA[97] tackles the
challenges of task variety and high adaptation cost. It integrates LoRA experts and task-motivated
gate functions into the attention and FFN of each layer. The gating utilizes task identity to modulate
expert contributions, creating unique parameter sets tailored to individual tasks. The design of
MOELoRA combines the strengths of both MOE and LoRA, strengthening LLM’s capability in
medical domains.
4.3.4 Every Layer. There has been considerable interest in incorporating MoPE into fundamental
components such as the attention, FFN, and Transformer block. Existing work often approaches the
attention mechanism and FFN independently, employing distinct gating mechanisms to modulate
them separately. Rather than treating these elements isolated, there is a new direction that considers
the Transformer layer as an integrated whole. This shift in perspective allows for the application
of the MoPE framework to the entire Transformer layer, capturing the combined dynamics of the
attention and FFN within a unified approach. As illustrated in Figure 6(d), the forward process
under the MoPE framework integrated with every layer is as follows:
𝑛
∑︁
𝑦 = LayerNorm(x′ + FFN(x′ )) +
𝑙𝑎𝑦𝑒𝑟
xΔW𝑖 · 𝐺 𝑙𝑎𝑦𝑒𝑟 (x)𝑖 , (4.6)
𝑖=1
x′ = LayerNorm(SA(x) + x), (4.7)
where ΔW𝑙𝑎𝑦𝑒𝑟 and 𝐺 𝑙𝑎𝑦𝑒𝑟 (x) is the parameter-efficient expert and gating function applied to the
entire layer, respectively.
In this context, the approach presented by MoLE[168] provides innovative insights. MoLE
identifies that various layers within LoRA exhibit unique features. In response to this finding,
MoLE pursues to enhance the composition effect of trained LoRAs by dynamically adjusting the
layer-specific weights according to the desired objective. This is achieved by plugging a set of
trained LoRAs alongside a gating function into each layer. MoLE treats each layer of trained LoRAs
as an individual expert and only trains the gating to learn the optimal composition weights for a
specified domain. This dynamic linear composition strategy significantly extends the versatility of
LoRA, enabling its application across a broader spectrum of practical scenarios.
𝑌! 𝑌" 𝑌! 𝑌" 𝑌! 𝑌" 𝑌! 𝑌"

Add + Normalize Add + Normalize1 Add + Normalize4 Add + Normalize-Average

FFN1 FFN4
Gate Gate
Add + Normalize Add + Normalize1 Add + Normalize4 Add + Normalize-Average
Self-Attention Self-Attention1 Self-Attention4 Self-Attention-Average
𝑋! 𝑋" 𝑋! 𝑋" 𝑋! 𝑋" 𝑋! 𝑋"

(a) Original (b) Expert Models Merging (Use BTX as an Example)
𝑌! 𝑌" 𝑌! 𝑌" 𝑌! 𝑌" 𝑌! 𝑌"

Add + Normalize Add + Normalize Add + Normalize Add + Normalize

FFN FFN
Gate Gate
Add + Normalize Add + Normalize Add + Normalize Add + Normalize
Self-Attention Self-Attention Self-Attention Self-Attention
𝑋! 𝑋" 𝑋! 𝑋" 𝑋! 𝑋" 𝑋! 𝑋"

(c) Dense-to-Sparse (d) Sparse-to-Dense
Fig. 7. Schematic representation of training and inference schemes related to MoE. It provides an abstracted
view of model transition, without focusing specific model states during training or inference. Subfigure (a)
depicts the original scheme without architectural transformation. Subfigure (b) depicts the merging of distinct
expert models, exemplified by BTX [145]. Subfigure (c) depicts the transition from a dense model to a sparse
model. Subfigure (d) depicts the inverse process, where a sparse model is converted to a dense model.
4.4 Training & Inference Scheme

The architectural advancements of Mixture-of-Experts (MoE) models have been complemented
by extensive research into training and inference schemes, with the objective of optimizing both
computational efficiency and model quality.
Original Training & Inference Scheme. Initial training methodologies, as established in
seminal works [49, 86, 128, 135, 197], involve constructing an MoE model and training it from
scratch, with inference directly following the model configurations of training.
The advent of MoE models has introduced novel paradigms for training and inference, enabling
a flexible approach that synergizes the strengths of dense and sparse models while mitigating their
respective weaknesses. As depicted in Figure 7, we divide the emerging schemes into three distinct
categories: Dense-to-Sparse, which entails initiating with dense model training and progressively
transitioning to a sparse MoE configuration [17, 45, 82, 95, 110, 117, 154, 165]; Sparse-to-Dense,
which involves degrading a sparse MoE model to a dense form that is more conducive to hardware
implementation for inference [16, 68, 170]; and Expert Models Merging, a process of integrating
multiple pre-trained dense expert models into a unified MoE model [89, 145, 158].
4.4.1 Dense-to-Sparse. To mitigate the training overhead associated with vision MoE transformer
models, the Residual Mixture-of-Experts (RMoE) approach [165] commences with training a dense,
non-MoE model on the upstream task, followed by an efficient fine-tuning stage to transition
into a MoE model. This process reveals that directly inheriting expert weights from a pre-trained
non-MoE model’s FFN can lead to suboptimal performance, necessitating an alignment between
the MoE and FFN outputs during the fine-tuning phase. Similarly, Dua et al. [45] advocate for
initially training a dense model, subsequently introducing sparsity by incorporating a randomly
initialized gating module, and further training the model’s feed-forward layers under sparsity
conditions—specifically, by updating the weights locally within each compute node rather than
averaging gradients across nodes.
Nie et al. [110] present EvoMoE, an efficient end-to-end MoE training framework. EvoMoE decou-
ples the joint learning of experts and the sparse gate, emphasizing the acquisition of foundational
knowledge through a single expert at the inception of training. Subsequently, it spawns multiple
diverse experts and advances the diversification of experts by training with the novel Dense-to-
Sparse gate (DTS-Gate). The DTS-Gate initially operates as a dense activation of all experts, then
progressively and adaptively constricting to route tokens to a reduced number of experts. A similar
strategy is employed in the development of the MoE-LLaVA [95] large vision-language model,
which commences with a dense model, subsequently multiplies the feedforward network (FFN) to
create expert initializations, and proceeds to train exclusively the MoE layers, while keeping the
remaining model components static.
Komatsuzaki et al. [82] highlight the efficiency of sparse models in terms of quality and com-
putational cost, yet acknowledge their significant data requirements and the expense of training
from scratch at scale. To address this, they introduce a scheme termed "sparse upcycling," which
leverages pre-existing training investments by initializing a sparsely activated MoE model from
a pre-trained dense checkpoint. This involves transferring all parameters—and optionally their
associated optimizer states—from the original checkpoint, with the exception of the MoE gating
network parameters, which are not present in the dense model. Notably, the new MoE layers are
populated with identical copies of the original dense model’s FFN layers, and the gating mechanism
weights are initialized randomly. A critical obstacle in model upcycling is the initial performance
decrease resulting from structural modifications to a trained network. To mitigate this performance
regression during upcycling, the researchers proposes normalizing the gate combine weights for
each token to 1. This approach is grounded in the notion that, in the dense model, each token
was processed by a singular "expert" FFN. While this normalization proved beneficial for upcycled
vision models, it was found to be detrimental to the performance of upcycled language models.
Building upon the sparse upcycling technique [82], the Skywork-MoE model [154] leverages
the foundational architecture of its pre-developed Skywork-13B model [163], adopting its dense
checkpoints as a foundation for initial states. Their empirical evidence indicates that the decision
between sparse upcycling and training from scratch should be informed by both the performance of
available dense checkpoints and the MoE-specific training resources, as models trained from scratch
consistently surpass their upcycled counterparts in performance. The study observes a decline in
average expert similarity throughout the training of upcycled MoEs, suggesting a diversification of
experts emerges during the process. Importantly, the Skywork-MoE analysis reveals that models
with greater expert similarity tend to underperform, establishing expert similarity as a potential
diagnostic tool during MoE training for upcycled models. Conversely, the expert similarity in
models trained from scratch remains minimal, implying that non-uniform expert initialization
promotes diversification.
Pan et al. [117] posit that the parameter inefficiency observed in MoE models stems from
conventional sparse training methodologies, where only a select group of experts is engaged and
refined for each input token. To counteract this, they introduce a hybrid framework for MoE models,
denoted as DS-MoE, which integrates dense training (activating all experts) with sparse inference
(sparse expert activation) to achieve higher computation and parameter efficiency. Notably, DS-MoE
maintains activation for all self-attention experts (MoA [182]) during inference but selectively
activates FFN experts, reflecting the observation that self-attention layers manifest considerably
less sparsity compared to FFN layers.
Chen et al. [17] introduce SMoE-Dropout, an innovative plug-and-play training framework,

which initially modularizes the FFN into a sequence of smaller FFNs then employs a random policy
parameterized by fixed weights to route token to k experts with the largest response. Progressively,
the framework activates an increasing number of experts, preventing overfitting to the amounts
of used network capacity during training. MoEfication [184] investigates various strategies for
expert construction in T5 models, including random splitting, parameter clustering, and building
co-activation graphs. MoEBERT [199] implements an importance-based method for adapting
FFNs into experts within BERT models. LLaMA-MoE [149] conducts an extensive examination
of different expert construction methodologies, ultimately proposing a straightforward random
division approach that partitions the parameters of FFNs into non-overlapping experts.
4.4.2 Sparse-to-Dense. Switch Transformer [49] studies the distillation of large sparse models
into smaller dense counterparts to achieve parameter efficiency for deployment. The study reports
that initializing the corresponding weights of dense model from non-expert layers of MoE model
modestly enhances performance, facilitated by the consistent dimension of non-expert layers.
Furthermore, an improvement in distillation is observed when employing a mixture of 0.25 for the
teacher probabilities and 0.75 for the ground truth label. Leveraging both methods, the distillation
preserves approximately 30% of the sparse model’s quality gains using only about 5% of the
parameters. Similarly, Xue et al. [170] address the challenges of overfitting, deployment difficulty,
and hardware constraints associated with sparse MoE models. Drawing inspiration from human
learning paradigms, they propose a new concept referred to as ’knowledge integration’ aimed at
creating a dense student model (OneS) that encapsulates the expertise of a sparse MoE model. Their
framework first implements knowledge gathering, explored through a variety of methods such
as summation, averaging, top-𝑘 Knowledge Gathering, and their Singular Value Decomposition
Knowledge Gathering. Then, they refine the dense student model by knowledge distillation to
mitigate noise introduced from the knowledge gathering. The OneS model retains 61.7% of the
MoE’s benefits on ImageNet and 88.2% on NLP datasets. Further investigations into MoE model
distillation are also conducted by other researchers [29, 121].
Chen et al. [16] highlight the challenges associated with deploying MoE models on resource-
constrained platforms, such as cloud or mobile environments. Observing that only a fraction of
experts contribute significantly to MoE fine-tuning and inference, they propose a method for the
progressive elimination of non-essential experts. This approach retains the advantages of MoE
while simplifying the model into a single-expert dense structure for the target downstream task.
Similarly, ModuleFormer [140] applies a comparable pruning technique, removing task-unrelated
experts for a lightweight deployment. Huang et al. [68] separate the training and inference stages
for Vision Transformers (ViTs). They substitute certain FFNs in the ViT with custom-designed,
efficient MoEs during training. These MoEs assign tokens to experts using a random uniform
partition and incorporate Experts Weights Averaging (EWA) on these MoEs at the end of each
iteration. After training, the MoEs are converted back to FFNs through averaging of expert weights,
thus reverting the model to its original dense ViT for inference.
4.4.3 Expert Models Merging. Li et al. [89] introduce the Branch-Train-Merge (BTM) algorithm, a
method for the communication-efficient training of language models (LMs). BTM independently
trains a set of expert LMs (ELMs), each tailored to a specific domain within the training corpus, such
as scientific or legal text. These ELMs, which operate without shared parameters, can be ensembled
or parameter-averaged at inference to coalesce into a singular LM. Expanding on this concept,
Sukhbaatar et al. [145] present Branch-Train-MiX (BTX), designed to combine the strengths of
BTM and Mixture-of-Experts while addressing their respective limitations. BTX maintains separate
training for multiple expert LLMs, akin to BTM, but subsequently integrates these experts within a
unified MoE model. Specifically, it consolidates the FFNs from all ELMs into a singular MoE module
at each layer, with a gating network determining the appropriate FFN expert for each token. Other
components, such as the self-attention layers from ELMs, are merged by averaging their weights.
The resulting model then undergoes MoE-finetuning on all the combined data to enable the gate to
effectively mix the FFN experts.
Wang et al. [158] point that while the emergence of Foundation Models made it easier to obtain
expert models tailored to specific tasks, the heterogeneity of data at test time necessitates more
than a single expert. Accordingly, they explore the Fusion of Experts (FoE) challenge, which aims
to integrate outputs from expert models that provide diverse but complementary insights into the
data distribution, formulating it as an instance of supervised learning.
4.5 Derivatives
Building upon the principles of algorithm design highlighted earlier, numerous studies have drawn
inspiration from the Mixture of Experts (MoE) framework, proposing a range of MoE variants.
We categorize these innovative models as derivatives of the MoE. For instance, Xue et al. [171]
introduced WideNet, an approach that increases model width by substituting the feed-forward
network (FFN) with an MoE layer while maintaining shared trainable parameters across Transformer
layers, except for the normalization layers. Subsequently, Tan et al. [146] presented the Sparse
Universal Transformer (SUT), an efficient enhancement of the Universal Transformer, which
is characterized by parameter-sharing across its layers. SUT incorporates a Sparse Mixture of
Experts and a novel stick-breaking-based dynamic halting mechanism, thus reducing computational
complexity without compromising parameter efficiency or generalization capabilities. Moreover,
the traditional MoE models often employ discrete matching between experts and tokens [6, 44,
49, 86, 135, 193, 197], a practice associated with training instability and uneven expert utilization.
Addressing these challenges, Antoniak et al. [5] innovatively proposes the Mixture of Tokens (MoT),
which blends tokens from different examples before presenting them to the experts. Thus, MoT
enables the model to benefit from a wider array of token-expert combinations.
Recently, the MoE’s principle of assigning specialized knowledge to individual experts has been
adapted to parameter-efficient fine-tuning (PEFT). Choi et al. [23] propose the sparse mixture-of-
prompts (SMoP), a method that utilizes a gating mechanism to train multiple short soft prompts,
each adept at processing distinct subsets of data. This addresses the inefficiencies encountered with
long soft prompts during prompt tuning. The MoE framework has also been integrated into lifelong
learning (LLL), which seeks to facilitate continuous learning from an ongoing stream of data. The
Lifelong-MoE model [19] dynamically expands model capacity by adding experts with regularized
pretraining, effectively mitigating the issue of catastrophic forgetting [81] typically associated with
straightforward fine-tuning. In a recent development, the MoE concept of conditional computation
has been further refined to optimize resource allocation in transformer-based language models
(LMs). The Mixture-of-Depths (MoD) [124] employs a binary gating network to decide whether
a token should be processed by a given Transformer layer. As a result, MoD transformers can
dynamically allocate computational resources (FLOPs) to specific sequence positions, achieving a
lower overall FLOP footprint compared to vanilla or MoE-based transformers.
In summary, the evolution of MoE derivatives reveals a trend where models either integrate
the conditional computation aspect of the gating mechanism or merge the MoE structure with
various tasks achieved by assigning specialized knowledge to individual experts, such as aforemen-
tioned prompt tuning [23] and lifelong learning [19] with MoE, demonstrating the versatility and
adaptability of the MoE architecture across different domains.
𝑌! 𝑌"
FFN1 FFN3
FFN2 FFN4
Decode Decode
Gate Gate Gate Gate
All-to-All Combine
FFN1 FFN2 Self-Attention Self-Attention

GPU1 GPU2 GPU3 GPU4
All-to-All Dispatch 𝑋! 𝑋"
(b) Data + Expert + Tensor Parallelism
Encode Encode
FFN21 FFN22
Gate Gate
SA2 SA2 FFN1 FFN2 FFN3 FFN4
Add + Normalize Add + Normalize GPU3 GPU4
FFN11 FFN12 Gate Gate

Self-Attention Self-Attention
SA1 SA1 Self-Attention
𝑋! GPU1 GPU2 𝑋" 𝑋! GPU1 𝑋" GPU2 GPU1 GPU2
𝑋!
(a) Data + Expert Parallelism (c) Data + Expert + Pipeline Parallelism (d) Expert + Tensor Parallelism
Fig. 8. Schematic depiction of diverse parallel strategies for MoE. For clarity and conciseness, this illustration
omits some All-to-All, All-Reduce, Point-to-Point communication within parallelism, and Normalization,
Encode, Decode, Gate in subfigures (b), (c), and (d).
5 SYSTEM DESIGN OF MIXTURE OF EXPERTS

While Mixture of Experts (MoE) has been increasingly leveraged to enhance the capabilities of
large language models, its adoption introduces new challenges to existing training and inference
systems, due to the inherently sparse and dynamic nature of its computational workload. Since the
introduction of expert parallelism by GShard[86], which implements parallel gating and expert
computation by dispatching partitioned local tokens with the load balancing limit of expert capacity,
this paradigm has emerged as a fundamental strategy to facilitate the efficient scaling of MoE models.
This approach can be viewed as an augmentation of data parallelism [122, 123, 126], where each
expert in an MoE layer is assigned to a distinct device, while all non-expert layers are duplicated
across devices. As depicted in Figure 8(a), the process flow of expert parallelism consists of the
following sequential operations: gate routing, input encode, All-to-All dispatch, expert computation,
All-to-All combine, and output decode. In general, the input size for GEMM needs to be large enough
to achieve optimal utilization and throughput of computing device necessitates. Therefore, input
encode is employed to aggregate the input tokens of a same expert into a contiguous memory space,
as determined by the token-expert mapping from gate routing. Subsequently, the All-to-All dispatch
is employed to send the input tokens to their corresponding experts across the distributed devices.
Following the localized computation by the experts, the inverse process - All-to-All combine and
output decode, reinstates the original data layout according to the the gating indices.
Furthermore, the synergy of expert parallelism [49, 69, 102, 142, 186] with other existing parallel
strategies (tensor [108, 141, 143], pipeline [67, 107, 119], sequence parallelism [72, 83, 90]) has
been investigated to enhance the scalability and efficiency of MoE models within large-scale
distributed environments. As shown in Figure 8, we illustrate several examples pf hybrid parallelism,
encompassing (b) data + expert + tensor parallelism [49, 69, 121, 142, 179], (c) data + expert + pipeline
parallelism [60, 69, 179], (d) expert + tensor parallelism [154]. It is imperative to recognize that the
choice of distributed parallelism strategies influences a complex interplay between computation
efficiency, communication overhead, memory occupation, potentially affected by various hardware
configurations. Consequently, the deployment strategies for practical applications necessitate
nuanced trade-offs and bespoke designs tailored to specific use-case scenarios.
Table 4. Comparative overview of the open-source MoE system frameworks, arranged chronologically by
reference publication date from newest to oldest. We give the count of GitHub stars as of June 2024.
Optimizations
Reference Affiliation Link Star
Computation Communication Storage
OpenMoE [172] Colossal-AI ✓ ✓ https://github.com/hpcaitech/ColossalAI 38K
ScatterMoE [147] Mila Quebec ✓ https://github.com/shawntan/scattermoe 140
Megablocks [50] Stanford University ✓ https://github.com/stanford-futuredata/megablocks 1.1K
Tutel [69] Microsoft ✓ ✓ https://github.com/microsoft/tutel 672
SE-MOE [136] Baidu ✓ ✓ ✓ https://github.com/PaddlePaddle/Paddle 21K
HetuMoE [112] Peking University ✓ ✓ https://github.com/PKU-DAIR/Hetu 236
Deepspeed-MoE [121] Microsoft ✓ ✓ https://github.com/microsoft/DeepSpeed 33K
FastMoE [59] Tsinghua University ✓ ✓ https://github.com/laekov/fastmoe 1.4K
Fairseq [6, 115] Meta https://github.com/facebookresearch/fairseq/tree/moe 29K
Mesh-TensorFlow [134] Google https://github.com/tensorflow/mesh 1.6K
In the subsequent discussion, we delineate the challenges introduced by MoE models from com-
putation, communication, and storage aspects, concurrently reviewing existing research addressing
these issues. Table 4 shows an overview of the open-source MoE frameworks.
5.1 Computation
Despite MoE is designed to scale model parameters efficiently without increasing computational
demand, it encounters challenges pertaining to computational efficiency. One concern is the
imbalance of computational load across distributed devices employing expert parallelism, which
incurs significant synchronization overhead as the system awaits the processing completion of the
most heavily loaded expert. Such issues are typically addressed through algorithmic strategies, such
as optimized gating mechanisms and expert capacity adjustments, as discussed in the preceding
section. Besides, solutions like SE-MoE [136], Tutel [69], FlexMoE [111] and SmartMoE [179] have
introduced dynamic expert placement strategies to distribute the workload as equally as possible
among devices. Additionally, FasterMoE [60] has implemented a novel dynamic shadowed expert
strategy, replicating experts on multiple devices to mitigate severe load imbalance. These model
placement related strategies impact both computation and communication efficiency.
Another concern is that MoE introduces additional computational overhead through operations
including gate routing, input encode and output decode. Unlike expert computations, which mirror
operations in dense models and benefit from extensive optimization on prevalent hardware such
as GPUs, these MoE operations are characterized by redundant computation and memory move-
ment, resulting in low efficiency on computing devices. Therefore, recent studies like DeepSpeed-
MoE[121], FastMoE [59], HetuMoE [112] and Tutel [69] have focused on the development of tailored
GPU kernels to enhance the efficiency of MoE operations.
In contexts where multiple experts are deployed on a single GPU device, MegaBlocks [50]
reformulates MoE computation in terms of block-sparse operations, developing specialized block-
sparse GPU kernels that efficiently handle the dynamic workloads without dropping tokens. Zheng
et al. [187] propose PIT, a deep-learning compiler tailored for dynamic sparsity of MoE, which can
find feasible PIT rules for all the operators within a model and generate optimized GPU kernels for
them. PIT employs a novel tiling mechanism, utilizing the Permutation Invariant Transformation
(PIT)—a mathematically proven property-to transform multiple sparsely located micro-tiles into
a GPU-efficient dense tile without changing the computation results, thus achieving both high
GPU utilization and low coverage waste. Despite these advancements, Tan et al. [147] highlight
remaining optimization potential within current MoE frameworks such as MegaBlocks and PIT,
which commence with an initial scatter-to-group data copy that increases memory footprint and
requires a translation of the MoE problem into the sparse matrix format. Although this translation
contributes minimally to computation overhead, it imposes limitations on the transparency and

adaptability of extending MegaBlocks to modules beyond the FFN. To address these issues, Tan et
al. [147] propose ScatterMoE, a MoE implementation designed to effectively minimize the memory
footprint. ScatterMoE leverages ParallelLinear, a linear module capable of executing grouped matrix
operations on scattered groups. This approach yields intermediate representations (e.g., the hidden
states of an SMoE MLP) that are directly accessible as standard PyTorch tensors, allowing for easy
extensions of MoE methods to other types of expert modules.
5.2 Communication
In expert parallelism, the quadruple invocation of All-to-All communication during both the for-
ward and backward propagation phases within each MoE layer causes a significant overhead, even
emerging as the primary constraint on efficiency. The All-to-All communication paradigm encom-
passes both intra-node (via PCIe, pre-4th-generation NVLink) and inter-node (Ethernet, Infiniband,
4th-generation NVLink) communication channels. The efficiency of such communication is con-
tingent upon a multitude of factors, including the heterogeneity of channel bandwidths, network
topology, and the collective communication algorithms. Moreover, load imbalances intrinsic to
MoE may exacerbate these inefficiencies by inducing synchronization delays.
To optimize the use of high intra-node bandwidth and low inter-node bandwidth, DeepSpeed-
MoE [121] and HetuMoE [112] have introduced a hierarchical All-to-All communication strategy
that enhances intra-node process and reduces inter-node data exchanges. Besides, FasterMoE
[60], TA-MoE [13] and SE-MoE [136] have introduced topology-aware routing strategies aimed
at mitigating cross-node expert selection, thereby reducing inter-node communication burdens.
Additionally, ExFlow [174] exploits expert affinity, anticipating expert allocation across layers to
maximize the retention of token processing within local GPU confines. The strategic allocation
of experts to minimize network traffic and leverage high-bandwidth connections is a prevalent
approach in distributed MoE system [121, 142, 154]. And this is often integrated with the placement
design of non-expert modules to optimize overall system performance.
Given the concurrent feature of communication and computation, pipelining [67, 107, 119]
is commonly employed to overlap their execution, thereby reducing the total time cost. This
technique, which is integrated in systems such as Tutel [69], FasterMoE [60], and MPipeMoE [185],
orchestrates overlapping between All-to-All communication and expert computation. Notably,
Lancet [75] underscores the inherent constraints of these pipelining methods, particularly the
bounded duration for which expert computation and communication can overlap. To address this
limitation, Lancet partitions non-MoE computations and integrates them into the pipeline during
forward pass, and strategically schedules gradient weight computations to augment overlap in the
backward pass. With the same objective of extending the overlap duration, ScMoE [12] restructures
the MoE architecture to simultaneously process representations from preceding layers while
engaging with current-layer representations. This decoupling of communication dependencies
facilitates substantial, and in certain cases, complete overlapping between communication and
computation. Snowflake Arctic [152] employs a similar design, utilizing a Dense-MoE hybrid
transformer architecture to effectively overlap communication with computation.
5.3 Storage
The ever-increasing parameters in MoE models exacerbate the constraints posed by memory
capacity in compute devices, a challenge already pronounced in dense models. While expert
parallelism offers a mitigation strategy through the distribution of experts across multiple devices,
individual devices may still struggle to accommodate numerous experts, particularly in inference
contexts where device capacity—such as that of edge devices (PCs, smartphones, IoTs)—is inherently
more restricted.
Considering the hierarchical storage pyramid, solutions like SE-MoE [136], Pre-gated MoE
[70], and EdgeMoE [176] selectively retain only essential non-expert parameters and the active
expert parameters within the GPU’s High-Bandwidth Memory (HBM), offloading inactive expert
parameters to CPU memory or SSDs. These patterns incur additional overhead from data transfer
across the storage hierarchy, thus they integrate expert selection forecasting and expert parameter
prefetching techniques to overlap parameter access with computation.
In addition, MPipeMoE [185] introduces a strategy to reduce the memory overhead associated
with activations and temporary buffers. This is achieved by sharing buffer for various partitions
of tensors, while leveraging recomputation/communication and CPU offloading to recover the
requisite activations in the backward pass.
6 APPLICATIONS OF MIXTURE OF EXPERTS MODELS

In the current landscape dominated by Transformer-based large language models (LLMs), the
mixture of experts (MoE) paradigm offers a compelling method to significantly expand model
capacity while avoiding a corresponding surge in computational demands during training and
inference phases. These models have been instrumental in enhancing the performance of LLMs
across a spectrum of downstream tasks, with some applications achieving results that eclipse
human performance [30, 48, 74]. Rumors suggest that the formidable GPT-4 may employ an MoE
architecture with an array of 8 × 220B experts, trained on diverse datasets and tasks, and utilizing a
16-iteration inference process 1 . Given these, MoE has garnered widespread adoption across fields
such as natural language processing, computer vision, recommender systems, and multimodal
applications. The essence of these applications lies in leveraging conditional computation to
significantly boost the number of model parameters, thereby augmenting model capacities with a
fixed computational cost, or implementing dynamic expert selection through gating mechanisms
for efficient multi-task learning. In the following, we will explore several representative applications
of MoE in various domains, to provide a overall understanding of how MoE can be utilized to
specific tasks.
Natural Language Processing. The integration of MoE architectures with LLMs has unlocked
extraordinary capabilities in a range of natural language understanding (NLU) and generation
(NLG) tasks, including machine translation [29, 135], open-domain question answering [6, 44], code
generation [30, 74, 150, 154], and mathematical problem-solving [30, 36, 74, 150]. The methods
of integrating MoE into LLMs have been thoroughly discussed and analyzed in the preceding
algorithm design section 4 and system design section 5, and will not be reiterated in depth here.
Computer Vision. The great success of sparsely-gated Mixture of Experts networks (MoE) in
NLP has inspired their application in computer vision. For example, Riquelme et al. [128] introduced
Vision MoE (V-MoE), which incorporates a sparsely activated mixture of MLPs into select ViT
[41] blocks. In image recognition tasks, V-MoE rivals the performance of state-of-the-art networks
while requiring substantially less computational power during inference. This demonstrates the
potential of MoE to discern distinct image semantics through specialized experts. Hwang et al.
[69] develop Tutel, a scalable stack design and implementation for MoE with dynamic parallelism
and pipelining, which they demonstrate with SwinV2-MoE, built upon Swin Transformer V2 [98].
Moreover, Zhang et al. [183] explore adversarial robustness in CNN-based MoE models, proposing
a novel router-expert alternating adversarial training framework called ADVMOE. In most recent
work, Chowdhury et al. [25] introduce the concept of patch-level routing in MoE (pMoE) that
1 https://x.com/soumithchintala/status/1671267150101721090
segments each input image into 𝑛 patches (or tokens) and allocates 𝑙 patches (𝑙 ≪ 𝑛) to each expert
for processing through prioritized routing to enhance efficiency.
Recommender System. Recommender systems are quintessential in various large-scale ap-
plications where they are required to balance and optimize multiple objectives simultaneously
[188]. A prime example is in the domain of movie recommendations, where the aim is not only
to suggest movie that align with users’ immediate preferences but also to ensure subsequent user
satisfaction for the selected movies [101]. The effectiveness of multi-task models hinges on the
intricate interplay between task-specific goals and the relationships between tasks. Consequently,
understanding the trade-offs inherent in these relationships is crucial. Mixture-of-experts (MoE)
models with gating mechanisms have emerged as a popular paradigm for tackling the complex-
ities of multi-task learning in recommender systems. Ma et al. [101] introduce the multi-gate
mixture-of-experts (MMOE) approach, which capitalizes on the concept of shared expert submod-
els across all tasks, guided by a gating network tailored to each individual task. Addressing the
“seesaw phenomenon” where the improvement of one task’s performance can detrimentally affect
another is another challenge in multi-task learning. To counteract this, Tang et al. [148] propose
the Progressive Layered Extraction (PLE) model for personalized recommendations. PLE distinctly
segregates shared and task-specific components and employs a progressive routing mechanism
to incrementally extract and refine the semantic knowledge, thereby enhancing the efficacy of
joint representation learning and the routing of information across tasks. Recently, in the pursuit
of capturing both the long-term and short-term user preferences that are particularly salient in
sequential recommendation scenarios, a novel method named AdaMCT [77] has been proposed.
AdaMCT utilizes layer-aware adaptive mixture units to dynamically blend CNN and Transformer
experts, thereby tailoring the recommendations to individual user patterns.
Multimodal Applications. Multimodal models are designed to process and integrate various
data types within a single neural network framework [109]. These models often simultaneously
encompass two primary data modalities: images and text [7, 156, 191]. The Mixture of Experts (MoE)
architecture has gained considerable traction as the foundation of multimodal models due to its
capacity for expert layers to learn distinct modality partitioning [106]. One notable implementation
of this approach is the LIMoE model [106], a sparse mixture of expert models tailored for multimodal
learning. LIMoE is trained on both images and text data, employing contrastive loss and an entropy-
based regularization technique to address load balancing challenges inherent in MoE systems.
Subsequently, Shen et al. [138] and Lin et al. [95] have further investigated the potential of MoE
for scaling vision-language models, offering valuable insights that contribute to the development
of more efficient and effective multimodal learning systems. Furthermore, to address the specific
issue of task conflicts in instruction tuning of Large Vision-Language Models (LVLMs), MoCLE
[53] integrates MoE with LoRA [66] experts and a distinct universal expert to activate task-specific
model parameters based on clusters of instructions. In parallel, to mitigate data conflicts, LLaVA-
MoLE [15] deploys a set of LoRA experts, specifically for the MLP layer, combined with a top-1
gating mechanism to refine instruction tuning in Multimodal Large Language Models (MLLMs).
While the MLLMs employing MoE architectures have demonstrated impressive performances,
they generally involve a limited number of experts and modalities [92]. To address this limitation,
Li et al. [92] introduce the pioneering Uni-MoE, a unified MLLM with MoE architecture capable
of managing an extensive range of modalities. They introduce a progressive training strategy to
bolster expert collaboration and generalization across modalities, and they utilize LoRA [66], a
lightweight fine-tuning methodology, to minimize computational demands.
7 CHALLENGES & OPPORTUNITIES

Mixture of Experts (MoE) models present a compelling approach for significantly increasing model
capacity at a constant computational cost. Despite their promise, several intrinsic challenges remain.
In this section, we identify critical challenges and promising directions for future investigation as
follows:
Training Stability and Load Balancing. MoE models that utilize sparse gating have become a
popular means to expand model capacity without proportionally increasing computational demands.
However, the discrete nature of assigning a fixed number of experts to tokens leads to significant
challenges in maintaining balanced workloads of experts and training stability across varying
inputs [5, 49, 86, 135, 193]. Load imbalances, where certain experts become over-utilized while
others are underutilized can hinder expert specialization and further degrade model performance.
Although current efforts [30, 44, 49, 74, 86, 94, 154] have attempted to address this challenge by
incorporating auxiliary loss functions to encourage even token distribution across experts, these
solutions can still lead to training instability [197] and often neglect the relative importance of
different tokens [193]. Therefore, future studies should focus on more effective regularization
techniques [197] or innovative gating algorithms [5, 105, 189, 193] that encourage equitable load
distribution among experts and enhance model training stability.
Scalability and Communication Overhead. As the escalating sizes of LLMs with MoE necessi-
tate more expansive distributed systems, the imperative for efficient communication during model
training becomes increasingly critical, as elaborated in Section 5.2. The trade-off between model
complexity, indicated by the number of parameters, and the communication overhead represents
a significant bottleneck in distributed training processes [86]. To address these challenges, it is
essential to develop and implement effective strategies that enhance the efficiency of information
transfer from system aspect or streamline information exchange without compromising model
performance from algorithm aspect. Innovations such as DeepSpeed [121], FasterMoE [60], and
ScMoE [12] are at the forefront of minimizing communication overhead. For example, the shared
expert approach [30, 121, 151, 152], advancing MoE with parameter-sharing frameworks, holds
promise for reducing the volume of data transmitted between distributed systems while concur-
rently enhancing model performance in natural language processing tasks. Such innovations are
pivotal in facilitating more scalable and efficient distributed training architectures for MoE models.
Expert Specialization and Collaboration. Expert specialization refers to the concept where
each expert develops non-overlapping and focused knowledge. Encouraging experts to concentrate
their skills on distinct sub-tasks or domains has been shown to enhance the performance and
generalization of the MoE model. The prevailing strategy involves designating a select number
of experts as shared ones, with the goal of capturing commonalities in knowledge and reducing
redundancy among those experts that are routed dynamically [30, 121, 151, 172]. However, fostering
effective collaboration among these specialized experts is an ongoing challenge. Relying solely on
a sparsely computed weighted sum of outputs from the top-𝑘 experts can overlook the intricate
internal relationships that exist across the entire experts. Consequently, exploring new mechanisms
for enhancing both the specialization and collaboration among experts is crucial for the development
of more integrated and powerful MoE models.
Sparse Activation and Computational Efficiency. One of the primary benefits of MoE models
lies in their capacity for sparse activations, which theoretically enhances computational efficiency.
Nevertheless, implementing this efficiency in practice poses substantial challenges. This is attributed
to the non-uniformity of sparse operations within hardware accelerators [32, 85]. Furthermore,
optimizing the balance between activating a select top-𝑘 subset of experts from an entire pool of
experts entails intricate coordination. This optimization is crucial for ensuring that each expert
develops a specialized niche [30]. Thus, there is a pressing need for further research into hardware
optimization techniques that more adeptly accommodate sparse computations. Such advancements
would not only preserve the model’s capacity but could also significantly enhance the performance
and efficiency of MoE models.
Generalization and Robustness. MoE models have demonstrated increased computational
efficiency during pre-training phases. However, there is a notable propensity for sparse MoE
architectures to overfit to specific tasks or datasets, which undermines their ability to generalize
effectively [42, 49, 137, 197]. To enhance the generalization and robustness of MoE models when
encountering unseen data and diverse input variations, various strategies have been explored.
These include regularization techniques such as dropout [49] and token dropping [197], as well as
multi-task instruction tuning [42, 137]. Looking ahead, there is potential for further advancements
in this challenge. Future endeavors could explore innovative regularization methods, refined multi-
task learning frameworks, or the incorporation of meta-learning concepts that bolster the MoE
models’ robustness and extend their generalization capabilities across an even broader spectrum of
downstream tasks.
Interpretability and Transparency. The inherent complexity of MoE models, coupled with
their dynamic gating of inputs to specialized experts, poses significant challenges to interpretability.
This becomes particularly problematic in contexts where comprehending the rationale behind
the model’s decisions is essential. Enhancing the interpretability of MoE models is therefore
critical, not only to facilitate a clearer understanding of their decision-making processes but also to
address underlying challenges such as load balancing [44, 49, 86] and the mitigation of knowledge
redundancy [30, 151]. In light of these considerations, there is a pressing need for future studies
focused on the development of methods and tools that can effectively visualize and explain the
behavior of individual experts within MoE models, as well as the nature of their interactions. Such
advancements would significantly improve our grasp of MoE models and bolster their ongoing
development, ensuring their gating decisions are transparent and trustworthy.
Optimal Expert Architecture. The design of MoE architectures, encompassing the selection
of network types and the quantity of experts, significantly influences the efficacy of multi-task
learning across various domains. A plethora of network architectures has been adopted as experts,
including LSTM [135], CNN [25, 183], FFNs (MLPs) [49, 86, 117, 197], Attention [140, 182], and LoRA
[43, 88, 100]. Among these, FFNs as experts remain the most prevalent. Despite their considerable
achievements, the exploration of various hybrids of network types within experts (as the distinct
features processing capabilities of different network architectures), as well as the development
of innovative expert architectures, remains nascent areas of research. Furthermore, the strategic
allocation of a varying number of experts across different layers of the model presents an area
ripe for investigation. This is due to two primary considerations: 1) different layers of the model
capture semantic information at varying levels of granularity; 2) an excessive number of experts
can complicate the training process and augment computational costs, while an insufficient number
of experts might lead to knowledge redundancy and diminish the specialization capabilities of the
experts. To navigate these challenges, the development of automated architecture search methods
specifically designed for MoE models is imperative [192]. Such approaches could systematically
identify optimal configurations, balancing the trade-offs between computational efficiency and the
specialization of experts.
Integration with Existing Frameworks. Ensuring seamless integration of MoE models into
existing large language models (LLMs) is crucial for their broad adoption. It is particularly vital
to enable adaptation of LLMs to MoE architecture without necessitating training from scratch,
as it can significantly reduce resource consumption. Recent studies [15, 43, 51, 88, 100, 168, 178]
have demonstrated the efficacy of combining Parameter-efficient Fine-tuning (PEFT) techniques
with MoE frameworks, offering a promising method for incorporating MoE into established LLMs.
However, these methods may compromise model performance or complicate the existing parallel
strategies of pretraining and inference efforts [57]. Advancing the development of modular and
plug-and-play MoE components is essential. Additionally, optimizing these components for training
and deployment across diverse computing environments and hardware platforms will expand their
applicability. Such advancements are expected to enhance the versatility and efficiency of MoE
models, making them more accessible for a wide range of applications and platforms.
By addressing these challenges, we can unlock the full potential of MoE models, paving the way
for more efficient and powerful machine learning systems, particular for large language models
(LLMs), that are capable of handling the ever-growing complexity and diversity of real-world tasks.
8 CONCLUSION
In this survey, we present a systematic and comprehensive review of the literature on MoE models,
serving as a valuable compendium for researchers exploring the landscape of MoE technologies. We
introduce a new taxonomy for MoE models and provide an in-depth analysis that encompasses three
distinct vantage points: algorithm design, system design, and practical applications, complemented
by a curated collection of open-source implementations, detailed hyperparameter configurations,
and thorough empirical assessments. Moreover, we highlight the critical challenges faced in the
field and outline the most promising avenues for future investigation. To support the continuous
dissemination of knowledge and advancements, we have established a dedicated resource repository
to facilitate ongoing updates and the sharing of cutting-edge developments in MoE research. We
hope this survey can contribute to an essential reference for researchers seeking to rapidly acquaint
themselves with MoE models, and that it will actively contribute to the vibrant progression.
REFERENCES
[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida,
Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774
(2023).
[2] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. 2023.
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. In Proceedings of the
2023 Conference on Empirical Methods in Natural Language Processing. 4895–4901.
[3] Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. 2017. Expert gate: Lifelong learning with a network of
experts. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3366–3375.
[4] Amjad Almahairi, Nicolas Ballas, Tim Cooijmans, Yin Zheng, Hugo Larochelle, and Aaron Courville. 2016. Dynamic
capacity networks. In International Conference on Machine Learning. PMLR, 2549–2558.
[5] Szymon Antoniak, Sebastian Jaszczur, Michał Krutul, Maciej Pióro, Jakub Krajewski, Jan Ludziejewski, Tomasz
Odrzygóźdź, and Marek Cygan. 2023. Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation. arXiv
preprint arXiv:2310.15961 (2023).
[6] Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du,
Srinivasan Iyer, Ramakanth Pasunuru, et al. 2021. Efficient large scale language modeling with mixtures of experts.
arXiv preprint arXiv:2112.10684 (2021).
[7] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and
taxonomy. IEEE transactions on pattern analysis and machine intelligence 41, 2 (2018), 423–443.
[8] Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. 2015. Conditional computation in neural
networks for faster models. arXiv preprint arXiv:1511.06297 (2015).
[9] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. 2013. Estimating or propagating gradients through stochastic
neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013).
[10] Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi
Du, Zhe Fu, et al. 2024. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint
arXiv:2401.02954 (2024).
[11] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural
information processing systems 33 (2020), 1877–1901.

[12] Weilin Cai, Juyong Jiang, Le Qin, Junwei Cui, Sunghun Kim, and Jiayi Huang. 2024. Shortcut-connected Expert
Parallelism for Accelerating Mixture-of-Experts. arXiv preprint arXiv:2404.05019 (2024).
[13] Chang Chen, Min Li, Zhihua Wu, Dianhai Yu, and Chao Yang. 2022. Ta-moe: Topology-aware large scale mixture-of-
expert training. Advances in Neural Information Processing Systems 35 (2022), 22173–22186.
[14] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards,
Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv
preprint arXiv:2107.03374 (2021).
[15] Shaoxiang Chen, Zequn Jie, and Lin Ma. 2024. Llava-mole: Sparse mixture of lora experts for mitigating data conflicts
in instruction finetuning mllms. arXiv preprint arXiv:2401.16160 (2024).
[16] Tianyu Chen, Shaohan Huang, Yuan Xie, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, and Furu Wei. 2022.
Task-specific expert pruning for sparse mixture-of-experts. arXiv preprint arXiv:2206.00277 (2022).
[17] Tianlong Chen, Zhenyu Zhang, AJAY KUMAR JAISWAL, Shiwei Liu, and Zhangyang Wang. 2022. Sparse MoE as the
New Dropout: Scaling Dense and Self-Slimmable Transformers. In The Eleventh International Conference on Learning
Representations.
[18] Tianlong Chen, Zhenyu Zhang, AJAY KUMAR JAISWAL, Shiwei Liu, and Zhangyang Wang. 2023. Sparse MoE as the
New Dropout: Scaling Dense and Self-Slimmable Transformers. In The Eleventh International Conference on Learning
Representations. https://openreview.net/forum?id=w1hwFUb_81
[19] Wuyang Chen, Yanqi Zhou, Nan Du, Yanping Huang, James Laudon, Zhifeng Chen, and Claire Cui. 2023. Lifelong
language pretraining with distribution-specialized experts. In International Conference on Machine Learning. PMLR,
5383–5395.
[20] Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. 2022. Towards understanding mixture of experts
in deep learning. arXiv preprint arXiv:2208.02813 (2022).
[21] Zitian Chen, Yikang Shen, Mingyu Ding, Zhenfang Chen, Hengshuang Zhao, Erik G Learned-Miller, and Chuang
Gan. 2023. Mod-squad: Designing mixtures of experts as modular multi-task learners. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 11828–11837.
[22] Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia
Song, Xian-Ling Mao, et al. 2022. On the representation collapse of sparse mixture of experts. Advances in Neural
Information Processing Systems 35 (2022), 34600–34613.
[23] Joon-Young Choi, Junho Kim, Jun-Hyung Park, Wing-Lam Mok, and SangKeun Lee. 2023. SMoP: Towards Efficient
and Effective Prompt Tuning with Sparse Mixture-of-Prompts. In The 2023 Conference on Empirical Methods in Natural
Language Processing.
[24] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham,
Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways.
Journal of Machine Learning Research 24, 240 (2023), 1–113.
[25] Mohammed Nowaz Rabbani Chowdhury, Shuai Zhang, Meng Wang, Sijia Liu, and Pin-Yu Chen. 2023. Patch-
level routing in mixture-of-experts is provably sample-efficient for convolutional neural networks. In International
Conference on Machine Learning. PMLR, 6074–6114.
[26] Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc,
Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. 2022. Unified scaling laws for routed language models. In
International conference on machine learning. PMLR, 4057–4086.
[27] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry
Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint
arXiv:2110.14168 (2021).
[28] Ronan Collobert, Samy Bengio, and Yoshua Bengio. 2001. A parallel mixture of SVMs for very large scale problems.
Advances in Neural Information Processing Systems 14 (2001).
[29] Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi,
Janice Lam, Daniel Licht, Jean Maillard, et al. 2022. No language left behind: Scaling human-centered machine
translation. arXiv preprint arXiv:2207.04672 (2022).
[30] Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai
Yu, Y Wu, et al. 2024. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.
[31] Damai Dai, Li Dong, Shuming Ma, Bo Zheng, Zhifang Sui, Baobao Chang, and Furu Wei. 2022. StableMoE: Stable
Routing Strategy for Mixture of Experts. In Proceedings of the 60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). 7085–7095.
[32] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient
exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344–16359.
[33] Do Huu Dat, Po Yuan Mao, Tien Hoang Nguyen, Wray Buntine, and Mohammed Bennamoun. 2023. HOMOE: A
Memory-Based and Composition-Aware Framework for Zero-Shot Learning with Hopfield Network and Soft Mixture
of Experts. arXiv preprint arXiv:2311.14747 (2023).
[34] Databricks. 2024. Introducing DBRX: A New State-of-the-Art Open LLM. https://www.databricks.com/blog/
introducing-dbrx-new-state-art-open-llm
[35] Andrew Davis and Itamar Arel. 2013. Low-rank approximations for conditional feedforward computation in deep
neural networks. arXiv preprint arXiv:1312.4461 (2013).
[36] DeepSeek-AI. 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.
arXiv:2405.04434 [cs.CL]
[37] Marc Deisenroth and Jun Wei Ng. 2015. Distributed gaussian processes. In International conference on machine
learning. PMLR, 1481–1490.
[38] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[39] Shizhe Diao, Tianyang Xu, Ruijia Xu, Jiawei Wang, and Tong Zhang. 2023. Mixture-of-Domain-Adapters: Decoupling
and Injecting Domain Knowledge to Pre-trained Language Models’ Memories. In The 61st Annual Meeting Of The
Association For Computational Linguistics.
[40] Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min
Chan, Weize Chen, et al. 2022. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained
language models. arXiv preprint arXiv:2203.06904 (2022).
[41] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words:
Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[42] Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran
Fan, et al. 2023. The Art of Balancing: Revolutionizing Mixture of Experts for Maintaining World Knowledge in
Language Model Alignment. arXiv preprint arXiv:2312.09979 (2023).
[43] Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran
Fan, et al. 2023. Loramoe: Revolutionizing mixture of experts for maintaining world knowledge in language model
alignment. arXiv preprint arXiv:2312.09979 (2023).
[44] Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou,
Adams Wei Yu, Orhan Firat, et al. 2022. Glam: Efficient scaling of language models with mixture-of-experts. In
International Conference on Machine Learning. PMLR, 5547–5569.
[45] Dheeru Dua, Shruti Bhosale, Vedanuj Goswami, James Cross, Mike Lewis, and Angela Fan. 2022. Tricks for Training
Sparse Translation Models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies. 3340–3345.
[46] David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. 2013. Learning factored representations in a deep mixture of
experts. arXiv preprint arXiv:1312.4314 (2013).
[47] Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur
Celebi, Guillaume Wenzek, Vishrav Chaudhary, et al. 2021. Beyond english-centric multilingual machine translation.
Journal of Machine Learning Research 22, 107 (2021), 1–48.
[48] William Fedus, Jeff Dean, and Barret Zoph. 2022. A review of sparse expert models in deep learning. arXiv preprint
arXiv:2209.01667 (2022).
[49] William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with
simple and efficient sparsity. Journal of Machine Learning Research 23, 120 (2022), 1–39.
[50] Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. 2023. Megablocks: Efficient sparse training with
mixture-of-experts. Proceedings of Machine Learning and Systems 5 (2023).
[51] Chongyang Gao, Kezhen Chen, Jinmeng Rao, Baochen Sun, Ruibo Liu, Daiyi Peng, Yawen Zhang, Xiaoyuan Guo, Jie
Yang, and VS Subrahmanian. 2024. Higher Layers Need More LoRA Experts. arXiv preprint arXiv:2402.08562 (2024).
[52] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the
fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings,
315–323.
[53] Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T Kwok, and Yu Zhang. 2023.
Mixture of cluster-conditional lora experts for vision-language instruction tuning. arXiv preprint arXiv:2312.12379
(2023).
[54] Sam Gross, Marc’Aurelio Ranzato, and Arthur Szlam. 2017. Hard mixtures of experts for large scale weakly supervised
vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6865–6873.
[55] Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint
arXiv:2312.00752 (2023).
[56] Suchin Gururangan, Mike Lewis, Ari Holtzman, Noah A Smith, and Luke Zettlemoyer. 2022. DEMix Layers: Dis-
entangling Domains for Modular Language Modeling. In Proceedings of the 2022 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies. 5557–5576.
[57] Zeyu Han, Chao Gao, Jinyang Liu, Sai Qian Zhang, et al. 2024. Parameter-efficient fine-tuning for large models: A
comprehensive survey. arXiv preprint arXiv:2403.14608 (2024).
[58] Hussein Hazimeh, Zhe Zhao, Aakanksha Chowdhery, Maheswaran Sathiamoorthy, Yihua Chen, Rahul Mazumder,
Lichan Hong, and Ed Chi. 2021. Dselect-k: Differentiable selection in the mixture of experts with applications to
multi-task learning. Advances in Neural Information Processing Systems 34 (2021), 29335–29347.
[59] Jiaao He, Jiezhong Qiu, Aohan Zeng, Zhilin Yang, Jidong Zhai, and Jie Tang. 2021. Fastmoe: A fast mixture-of-expert
training system. arXiv preprint arXiv:2103.13262 (2021).
[60] Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. 2022. Fastermoe:
modeling and optimizing training of large-scale dynamic pre-trained models. In Proceedings of the 27th ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming. 120–134.
[61] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020.
Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020).
[62] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob
Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874
(2021).
[63] Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).
[64] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las
Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language
models. arXiv preprint arXiv:2203.15556 (2022).
[65] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo,
Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International Conference on
Machine Learning. PMLR, 2790–2799.
[66] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2021. LoRA:
Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.
[67] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam,
Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism.
Advances in neural information processing systems 32 (2019).
[68] Yongqi Huang, Peng Ye, Xiaoshui Huang, Sheng Li, Tao Chen, and Wanli Ouyang. 2023. Experts weights averaging:
A new general training scheme for vision transformers. arXiv preprint arXiv:2308.06093 (2023).
[69] Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat
Ram, et al. 2023. Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning and Systems 5 (2023).
[70] Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, Mao Yang, and Minsoo Rhu. 2023.
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference. arXiv preprint
arXiv:2308.12066 (2023).
[71] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts.
Neural computation 3, 1 (1991), 79–87.
[72] Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Leon Song, Samyam Rajbhandari, and Yuxiong
He. 2023. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer
models. arXiv preprint arXiv:2309.14509 (2023).
[73] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las
Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint
arXiv:2310.06825 (2023).
[74] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh
Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. 2024. Mixtral of experts. arXiv preprint
arXiv:2401.04088 (2024).
[75] Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, and Yida Wang. 2024. Lancet: Accelerating Mixture-
of-Experts Training via Whole Graph Computation-Communication Overlapping. arXiv preprint arXiv:2404.19429
(2024).
[76] Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for
Code Generation. arXiv preprint arXiv:2406.00515 (2024).
[77] Juyong Jiang, Peiyan Zhang, Yingtao Luo, Chaozhuo Li, Jae Boum Kim, Kai Zhang, Senzhang Wang, Xing Xie, and
Sunghun Kim. 2023. AdaMCT: adaptive mixture of CNN-transformer for sequential recommendation. In Proceedings
of the 32nd ACM International Conference on Information and Knowledge Management. 976–986.
[78] Michael I Jordan and Robert A Jacobs. 1994. Hierarchical mixtures of experts and the EM algorithm. Neural computation
6, 2 (1994), 181–214.
[79] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford,
Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
[80] Young Jin Kim, Ammar Ahmad Awan, Alexandre Muzio, Andres Felipe Cruz Salinas, Liyang Lu, Amr Hendy, Samyam
Rajbhandari, Yuxiong He, and Hany Hassan Awadalla. 2021. Scalable and efficient moe training for multitask
multilingual models. arXiv preprint arXiv:2109.10465 (2021).
[81] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan,
John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural
networks. Proceedings of the national academy of sciences 114, 13 (2017), 3521–3526.
[82] Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay,
Mostafa Dehghani, and Neil Houlsby. 2022. Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints.
In The Eleventh International Conference on Learning Representations.
[83] Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi,
and Bryan Catanzaro. 2023. Reducing activation recomputation in large transformer models. Proceedings of Machine
Learning and Systems 5 (2023).
[84] Sneha Kudugunta, Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin, Minh-Thang Luong, and Orhan
Firat. 2021. Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference. In Findings of the Association
for Computational Linguistics: EMNLP 2021. 3577–3599.
[85] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang,
and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In
Proceedings of the 29th Symposium on Operating Systems Principles. 611–626.
[86] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam
Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding.
[87] Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. 2021. Base layers: Simplifying
training of large, sparse models. In International Conference on Machine Learning. PMLR, 6265–6274.
[88] Dengchun Li, Yingzi Ma, Naizheng Wang, Zhiyuan Cheng, Lei Duan, Jie Zuo, Cal Yang, and Mingjie Tang. 2024.
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA based Mixture of Experts. arXiv preprint
arXiv:2404.15159 (2024).
[89] Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A Smith, and Luke Zettlemoyer. 2022.
Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models. In First Workshop on Interpolation
Regularizers and Beyond at NeurIPS 2022.
[90] Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. 2021. Sequence parallelism: Long sequence
training from system perspective. arXiv preprint arXiv:2105.13120 (2021).
[91] Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings
of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference
on Natural Language Processing (Volume 1: Long Papers). 4582–4597.
[92] Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, and Min Zhang. 2024.
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts. arXiv preprint arXiv:2405.11273 (2024).
[93] Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. 2023. Scaling down to scale up: A guide to parameter-efficient
fine-tuning. arXiv preprint arXiv:2303.15647 (2023).
[94] Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom,
Yonatan Belinkov, Shai Shalev-Shwartz, et al. 2024. Jamba: A hybrid transformer-mamba language model. arXiv
preprint arXiv:2403.19887 (2024).
[95] Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. 2024. Moe-llava:
Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947 (2024).
[96] Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel.
2022. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural
Information Processing Systems 35 (2022), 1950–1965.
[97] Qidong Liu, Xian Wu, Xiangyu Zhao, Yuanshao Zhu, Derong Xu, Feng Tian, and Yefeng Zheng. 2023. Moelora: An
moe-based parameter efficient fine-tuning method for multi-task medical applications. arXiv preprint arXiv:2310.18339
(2023).
[98] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin
transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international
conference on computer vision. 10012–10022.
[99] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic represen-
tations for vision-and-language tasks. Advances in neural information processing systems 32 (2019).
[100] Tongxu Luo, Jiahe Lei, Fangyu Lei, Weihao Liu, Shizhu He, Jun Zhao, and Kang Liu. 2024. Moelora: Contrastive
learning guided mixture of experts on parameter-efficient fine-tuning for large language models. arXiv preprint
arXiv:2402.12851 (2024).
[101] Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. 2018. Modeling task relationships in multi-task
learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD international conference on
knowledge discovery & data mining. 1930–1939.
[102] Zixuan Ma, Jiaao He, Jiezhong Qiu, Huanqi Cao, Yuanwei Wang, Zhenbo Sun, Liyan Zheng, Haojie Wang, Shizhi
Tang, Tianyu Zheng, et al. 2022. BaGuaLu: targeting brain scale pretrained models with over 37 million cores. In
Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 192–204.
[103] Yuning Mao, Lambert Mathias, Rui Hou, Amjad Almahairi, Hao Ma, Jiawei Han, Scott Yih, and Madian Khabsa. 2022.
UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning. In Proceedings of the 60th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6253–6264.
[104] Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah,
Xianzhi Du, Futang Peng, Floris Weers, et al. 2024. Mm1: Methods, analysis & insights from multimodal llm pre-
training. arXiv preprint arXiv:2403.09611 (2024).
[105] Mohammed Muqeeth, Haokun Liu, and Colin Raffel. 2023. Soft merging of experts with adaptive routing. arXiv
preprint arXiv:2306.03745 (2023).
[106] Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. 2022. Multimodal contrastive
learning with limoe: the language-image mixture of experts. Advances in Neural Information Processing Systems 35
(2022), 9564–9576.
[107] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B
Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of
the 27th ACM symposium on operating systems principles. 1–15.
[108] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri
Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model
training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance
Computing, Networking, Storage and Analysis. 1–15.
[109] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep
learning. In Proceedings of the 28th international conference on machine learning (ICML-11). 689–696.
[110] Xiaonan Nie, Xupeng Miao, Shijie Cao, Lingxiao Ma, Qibin Liu, Jilong Xue, Youshan Miao, Yi Liu, Zhi Yang, and Bin
Cui. 2021. Evomoe: An evolutional mixture-of-experts training framework via dense-to-sparse gate. arXiv preprint
arXiv:2112.14397 (2021).
[111] Xiaonan Nie, Xupeng Miao, Zilong Wang, Zichao Yang, Jilong Xue, Lingxiao Ma, Gang Cao, and Bin Cui. 2023.
Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement. Proceedings of the ACM
on Management of Data 1, 1 (2023), 1–19.
[112] Xiaonan Nie, Pinxue Zhao, Xupeng Miao, Tong Zhao, and Bin Cui. 2022. HetuMoE: An efficient trillion-scale
mixture-of-expert distributed training system. arXiv preprint arXiv:2203.14685 (2022).
[113] OpenAI. 2022. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt.
[114] Oleksiy Ostapenko, Lucas Caccia, Zhan Su, Nicolas Le Roux, Laurent Charlin, and Alessandro Sordoni. 2023. A Case
Study of Instruction Tuning with Mixture of Parameter-Efficient Experts. In NeurIPS 2023 Workshop on Instruction
Tuning and Instruction Following.
[115] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli.
2019. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038 (2019).
[116] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.
Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
[117] Bowen Pan, Yikang Shen, Haokun Liu, Mayank Mishra, Gaoyuan Zhang, Aude Oliva, Colin Raffel, and Rameswar
Panda. 2024. Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models. arXiv
preprint arXiv:2404.05567 (2024).
[118] Joan Puigcerver, Carlos Riquelme Ruiz, Basil Mustafa, and Neil Houlsby. 2023. From Sparse to Soft Mixtures of
Experts. In The Twelfth International Conference on Learning Representations.
[119] Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2023. Zero Bubble Pipeline Parallelism. In The Twelfth
International Conference on Learning Representations.
[120] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine
learning research 21, 140 (2020), 1–67.

[121] Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan,
Jeff Rasley, and Yuxiong He. 2022. Deepspeed-moe: Advancing mixture-of-experts inference and training to power
next-generation ai scale. In International conference on machine learning. PMLR, 18332–18346.
[122] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward
training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking,
Storage and Analysis. IEEE, 1–16.
[123] Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. 2021. Zero-infinity: Breaking the
gpu memory wall for extreme scale deep learning. In Proceedings of the international conference for high performance
computing, networking, storage and analysis. 1–14.
[124] David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, and Adam Santoro.
2024. Mixture-of-Depths: Dynamically allocating compute in transformer-based language models. arXiv preprint
arXiv:2404.02258 (2024).
[125] Carl Rasmussen and Zoubin Ghahramani. 2001. Infinite mixtures of Gaussian process experts. Advances in neural
information processing systems 14 (2001).
[126] Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong
Li, and Yuxiong He. 2021. {Zero-offload}: Democratizing {billion-scale} model training. In 2021 USENIX Annual
Technical Conference (USENIX ATC 21). 551–564.
[127] Xiaozhe Ren, Pingyi Zhou, Xinfan Meng, Xinjing Huang, Yadao Wang, Weichao Wang, Pengfei Li, Xiaoda Zhang,
Alexander Podolskiy, Grigory Arshinov, et al. 2023. Pangu-{\Sigma}: Towards trillion parameter language model
with sparse heterogeneous computing. arXiv preprint arXiv:2303.10845 (2023).
[128] Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel
Keysers, and Neil Houlsby. 2021. Scaling vision with sparse mixture of experts. Advances in Neural Information
Processing Systems 34 (2021), 8583–8595.
[129] Stephen Roller, Sainbayar Sukhbaatar, Jason Weston, et al. 2021. Hash layers for large sparse models. Advances in
Neural Information Processing Systems 34 (2021), 17555–17566.
[130] Clemens Rosenbaum, Ignacio Cases, Matthew Riemer, and Tim Klinger. 2019. Routing networks and the challenges
of modular and compositional computation. arXiv preprint arXiv:1904.12774 (2019).
[131] Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. 2017. Routing networks: Adaptive selection of non-linear
functions for multi-task learning. arXiv preprint arXiv:1711.01239 (2017).
[132] Babak Shahbaba and Radford Neal. 2009. Nonlinear models using Dirichlet process mixtures. Journal of Machine
Learning Research 10, 8 (2009).
[133] Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint arXiv:2002.05202 (2020).
[134] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins,
HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. 2018. Mesh-tensorflow: Deep learning for supercomputers.
Advances in neural information processing systems 31 (2018).
[135] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017.
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538
(2017).
[136] Liang Shen, Zhihua Wu, WeiBao Gong, Hongxiang Hao, Yangfan Bai, HuaChao Wu, Xinxuan Wu, Jiang Bian, Haoyi
Xiong, Dianhai Yu, et al. 2022. Se-moe: A scalable and efficient mixture-of-experts distributed training and inference
system. arXiv preprint arXiv:2205.10034 (2022).
[137] Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William
Fedus, Xinyun Chen, et al. 2023. Mixture-of-experts meets instruction tuning: A winning combination for large
language models. arXiv preprint arXiv:2305.14705 (2023).
[138] Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, and Yuxiong He. 2023. Scaling vision-language
models with sparse mixture of experts. arXiv preprint arXiv:2303.07226 (2023).
[139] Yikang Shen, Zhen Guo, Tianle Cai, and Zengyi Qin. 2024. JetMoE: Reaching Llama2 Performance with 0.1 M Dollars.
[140] Yikang Shen, Zheyu Zhang, Tianyou Cao, Shawn Tan, Zhenfang Chen, and Chuang Gan. 2023. Moduleformer:
Learning modular large language models from uncurated data. arXiv preprint arXiv:2306.04640 (2023).
[141] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019.
Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint
arXiv:1909.08053 (2019).
[142] Siddharth Singh, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He, and Abhinav Bhatele.
2023. A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training. In Proceedings of
the 37th International Conference on Supercomputing. 203–214.
[143] Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun
Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. 2022. Using deepspeed and megatron to train
megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022).
[144] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. Roformer: Enhanced transformer
with rotary position embedding. Neurocomputing 568 (2024), 127063.
[145] Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel
Li, Wen-tau Yih, Jason Weston, et al. 2024. Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM.
[146] Shawn Tan, Yikang Shen, Zhenfang Chen, Aaron Courville, and Chuang Gan. 2023. Sparse Universal Transformer. In
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 169–179.
[147] Shawn Tan, Yikang Shen, Rameswar Panda, and Aaron Courville. 2024. Scattered Mixture-of-Experts Implementation.
[148] Hongyan Tang, Junning Liu, Ming Zhao, and Xudong Gong. 2020. Progressive layered extraction (ple): A novel
multi-task learning (mtl) model for personalized recommendations. In Proceedings of the 14th ACM Conference on
Recommender Systems. 269–278.
[149] LLaMA-MoE Team. 2023. LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training.
https://github.com/pjlab-sys4nlp/llama-moe
[150] Qwen Team. 2024. Introducing Qwen1.5. https://qwenlm.github.io/blog/qwen1.5/
[151] Qwen Team. 2024. Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters". https:
//qwenlm.github.io/blog/qwen-moe/
[152] Snowflake AI Research Team. 2024. Snowflake Arctic: The Best LLM for Enterprise AI — Efficiently Intelligent, Truly
Open. https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/
[153] Lucas Theis and Matthias Bethge. 2015. Generative image modeling using spatial lstms. Advances in neural information
processing systems 28 (2015).
[154] Liang Zhao Cheng Cheng Biye Li Weiwei Lu Peng Cheng Jianhao Zhang Xiaoyu Zhang Liang Zeng Xiaokun Wang
Yutuan Ma Rui Hu Shuicheng Yan Han Fang Yahui Zhou Tianwen Wei, Bo Zhu. 2024. Skywork-MoE: A Deep Dive
into Training Techniques for Mixture-of-Experts Language Models.
[155] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.
[156] Shagun Uppal, Sarthak Bhagat, Devamanyu Hazarika, Navonil Majumder, Soujanya Poria, Roger Zimmermann, and
Amir Zadeh. 2022. Multimodal research in vision and language: A review of current and emerging trends. Information
Fusion 77 (2022), 149–171.
[157] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[158] Hongyi Wang, Felipe Maia Polo, Yuekai Sun, Souvik Kundu, Eric Xing, and Mikhail Yurochkin. 2023. Fusing Models
with Complementary Expertise. In The Twelfth International Conference on Learning Representations.
[159] Xin Wang, Fisher Yu, Lisa Dunlap, Yi-An Ma, Ruth Wang, Azalia Mirhoseini, Trevor Darrell, and Joseph E Gonzalez.
2020. Deep mixture of experts via shallow embedding. In Uncertainty in artificial intelligence. PMLR, 552–562.
[160] Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed Hassan Awadallah, and
Jianfeng Gao. 2022. AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning. In Proceedings of the
2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue
Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 5744–5760. https:
//doi.org/10.18653/v1/2022.emnlp-main.388
[161] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma,
Denny Zhou, Donald Metzler, et al. 2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682
(2022).
[162] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022.
Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing
Systems 35 (2022), 24824–24837.
[163] Tianwen Wei, Liang Zhao, Lichang Zhang, Bo Zhu, Lijie Wang, Haihua Yang, Biye Li, Cheng Cheng, Weiwei Lü, Rui
Hu, et al. 2023. Skywork: A more open bilingual foundation model. arXiv preprint arXiv:2310.19341 (2023).
[164] Jialin Wu, Xia Hu, Yaqing Wang, Bo Pang, and Radu Soricut. 2023. Omni-SMoLA: Boosting Generalist Multimodal
Models with Soft Mixture of Low-rank Experts. arXiv preprint arXiv:2312.00968 (2023).
[165] Lemeng Wu, Mengchen Liu, Yinpeng Chen, Dongdong Chen, Xiyang Dai, and Lu Yuan. 2022. Residual mixture of
experts. arXiv preprint arXiv:2204.09636 (2022).
[166] Shaohua Wu, Jiangang Luo, Xi Chen, Lingjun Li, Xudong Zhao, Tong Yu, Chao Wang, Yue Wang, Fei Wang, Weixu
Qiao, et al. 2024. Yuan 2.0-M32: Mixture of Experts with Attention Router. arXiv preprint arXiv:2405.17976 (2024).
[167] Xun Wu, Shaohan Huang, and Furu Wei. 2023. MoLE: Mixture of LoRA Experts. In The Twelfth International Conference
on Learning Representations.
[168] Xun Wu, Shaohan Huang, and Furu Wei. 2024. Mixture of LoRA Experts. In The Twelfth International Conference on
Learning Representations. https://openreview.net/forum?id=uWvKBCYh4S
[169] xAI. 2024. Grok-1. https://github.com/xai-org/grok-1
[170] Fuzhao Xue, Xiaoxin He, Xiaozhe Ren, Yuxuan Lou, and Yang You. 2022. One student knows all experts know: From
sparse to dense. arXiv preprint arXiv:2201.10890 (2022).
[171] Fuzhao Xue, Ziji Shi, Futao Wei, Yuxuan Lou, Yong Liu, and Yang You. 2022. Go wider instead of deeper. In Proceedings
of the AAAI Conference on Artificial Intelligence, Vol. 36. 8779–8787.
[172] Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. 2024. Openmoe: An
early effort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739 (2024).
[173] An Yang, Junyang Lin, Rui Men, Chang Zhou, Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Jiamang Wang, Yong Li,
et al. 2021. M6-t: Exploring sparse expert models and beyond. arXiv preprint arXiv:2105.15082 (2021).
[174] Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, et al. 2024. Exploiting Inter-Layer Expert Affinity for
Accelerating Mixture-of-Experts Model Inference. arXiv preprint arXiv:2401.08383 (2024).
[175] Qinyuan Ye, Juan Zha, and Xiang Ren. 2022. Eliciting and Understanding Cross-task Skills with Task-level Mixture-
of-Experts. In Findings of the Association for Computational Linguistics: EMNLP 2022. 2567–2592.
[176] Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, and Mengwei Xu. 2023. Edgemoe: Fast on-device
inference of moe-based large language models. arXiv preprint arXiv:2308.14352 (2023).
[177] Kang Min Yoo, Jaegeun Han, Sookyo In, Heewon Jeon, Jisu Jeong, Jaewook Kang, Hyunwook Kim, Kyung-Min Kim,
Munhyong Kim, Sungju Kim, et al. 2024. HyperCLOVA X Technical Report. arXiv preprint arXiv:2404.01954 (2024).
[178] Ted Zadouri, Ahmet Üstün, Arash Ahmadian, Beyza Ermiş, Acyr Locatelli, and Sara Hooker. 2023. Pushing mixture of
experts to the limit: Extremely parameter efficient moe for instruction tuning. arXiv preprint arXiv:2309.05444 (2023).
[179] Mingshu Zhai, Jiaao He, Zixuan Ma, Zan Zong, Runqing Zhang, and Jidong Zhai. 2023. {SmartMoE}: Efficiently
Training {Sparsely-Activated} Models through Combining Offline and Online Parallelization. In 2023 USENIX Annual
Technical Conference (USENIX ATC 23). 961–975.
[180] Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural Information Processing
Systems 32 (2019).
[181] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao.
2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. 5579–5588.
[182] Xiaofeng Zhang, Yikang Shen, Zeyu Huang, Jie Zhou, Wenge Rong, and Zhang Xiong. 2022. Mixture of Attention
Heads: Selecting Attention Heads Per Token. In Proceedings of the 2022 Conference on Empirical Methods in Natural
Language Processing. 4150–4162.
[183] Yihua Zhang, Ruisi Cai, Tianlong Chen, Guanhua Zhang, Huan Zhang, Pin-Yu Chen, Shiyu Chang, Zhangyang Wang,
and Sijia Liu. 2023. Robust Mixture-of-Expert Training for Convolutional Neural Networks. In Proceedings of the
IEEE/CVF International Conference on Computer Vision. 90–101.
[184] Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. 2022. MoEfication: Transformer
Feed-forward Layers are Mixtures of Experts. In Findings of the Association for Computational Linguistics: ACL 2022.
877–890.
[185] Zheng Zhang, Yaqi Xia, Hulin Wang, Donglin Yang, Chuang Hu, Xiaobo Zhou, and Dazhao Cheng. 2024. MPMoE:
Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism. IEEE Transactions on Parallel and
Distributed Systems (2024).
[186] Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong
Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed
deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559–578.
[187] Ningxin Zheng, Huiqiang Jiang, Quanlu Zhang, Zhenhua Han, Lingxiao Ma, Yuqing Yang, Fan Yang, Chengruidong
Zhang, Lili Qiu, Mao Yang, et al. 2023. Pit: Optimization of dynamic sparse deep learning models via permutation
invariant transformation. In Proceedings of the 29th Symposium on Operating Systems Principles. 331–347.
[188] Yong Zheng and David Xuejun Wang. 2022. A survey of recommender systems with multi-objective optimization.
Neurocomputing 474 (2022), 141–153.
[189] Zexuan Zhong, Mengzhou Xia, Danqi Chen, and Mike Lewis. 2024. Lory: Fully Differentiable Mixture-of-Experts for
Autoregressive Language Model Pre-training. arXiv preprint arXiv:2405.03133 (2024).
[190] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language
models. International Journal of Computer Vision 130, 9 (2022), 2337–2348.
[191] Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jianfeng Gao. 2020. Unified vision-language
pre-training for image captioning and vqa. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34.
13041–13049.
[192] Yanqi Zhou, Nan Du, Yanping Huang, Daiyi Peng, Chang Lan, Da Huang, Siamak Shakeri, David So, Andrew M Dai,
Yifeng Lu, et al. 2023. Brainformers: Trading simplicity for efficiency. In International Conference on Machine Learning.
PMLR, 42531–42542.
[193] Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon,
et al. 2022. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems 35
(2022), 7103–7114.
[194] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language
understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).
[195] Jinguo Zhu, Xizhou Zhu, Wenhai Wang, Xiaohua Wang, Hongsheng Li, Xiaogang Wang, and Jifeng Dai. 2022. Uni-
perceiver-moe: Learning sparse generalist models with conditional moes. Advances in Neural Information Processing
Systems 35 (2022), 2664–2678.
[196] Yun Zhu, Nevan Wichers, Chu-Cheng Lin, Xinyi Wang, Tianlong Chen, Lei Shu, Han Lu, Canoee Liu, Liangchen Luo,
Jindong Chen, et al. 2023. Sira: Sparse mixture of low rank adaptation. arXiv preprint arXiv:2311.09179 (2023).
[197] Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus.
2022. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906 (2022).
[198] Simiao Zuo, Xiaodong Liu, Jian Jiao, Young Jin Kim, Hany Hassan, Ruofei Zhang, Jianfeng Gao, and Tuo Zhao.
2021. Taming Sparsely Activated Transformer with Stochastic Experts. In International Conference on Learning
Representations.
[199] Simiao Zuo, Qingru Zhang, Chen Liang, Pengcheng He, Tuo Zhao, and Weizhu Chen. 2022. MoEBERT: from BERT to
Mixture-of-Experts via Importance-Guided Adaptation. In Proceedings of the 2022 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies. 1610–1623.

A Survey On Mixture of Experts: Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

A Survey On Mixture of Experts: Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Survey On Mixture of Experts: Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang

Uploaded by

Copyright:

Available Formats

1

A Survey on Mixture of Experts

ACM Reference Format:

MoLE Yuan 2.0-M32

Base Layer NLLB Aug.

MoE MMoE Expert-Choice MoE Multimodal

2 BACKGROUND ON MIXTURE OF EXPERTS

FFN1 FFN2 FFN3 FFN4 FFN1 FFN2 FFN3 FFN4

Formally, each expert network 𝑓𝑖 , usually a linear-ReLU-linear network, is parameterized by W𝑖 ,

2.1 Dense MoE

2.2 Sparse MoE

Equation (2.2) can be modified to reflect the sparsely-gated mechanism as follows:

G(x; Θ)𝑖 = softmax(TopK(𝑔(x; Θ) + R noise, 𝑘))𝑖 , (2.3)

3 TAXONOMY OF MIXTURE OF EXPERTS

Dense DS-MoE[117], EvoMoE[110], MoLE[167], LoRAMoE[43]

Shazeer et al.[135], GShard[86], Switch Transformer[49]

Expert-Choice Expert-Choice MoE[193], Brainformers[192]

Token Merging Soft MoE[118], HOMOE[33]

GShard[86], Switch Transformer[49], ST-MoE[197]

Others pMoE[25], ADVMOE[183], Chen et al.[20], DeepMoE[159]

GShard[86], Meta-MoE[6], GLaM[44], Mixtral-8x7B[74]

Hyper- GLaM[44], DeepSeekMoE[30], DeepSeek-V2[36], DBRX[34]

ST-MoE[197], OpenMoE[172], V-MoE[128], MoE-LLaVA[95]

DeepSpeed-MoE[121], NLLB[29], DeepSeekMoE[30], OpenMoE[172], ScMoE[12]

LoRAMoE[43], AdaMix[160], MixDA[39], LLaVA-MoLE[15]

Attention MoELoRA[100], MoCLE[53], SiRA[196]

Every Layer MoLE[168]

Original Shazeer et al.[135], GShard[86], Switch Transformer[49], ST-MoE[197]

RMoE[165], Dua et al.[45], Sparse Upcycling[82], DS-MoE[117], EvoMoE[110]

Derivatives Lifelong-MoE[19], MoT[5], MoD[124], WideNet[171], SUT[146], SMoP[23]

FastMoE[59], DeepSpeed-MoE[121], Tutel[69], SE-MoE[136], FasterMoE[60], DeepSpeed-TED[142]

System DeepSpeed-MoE[121], HetuMoE[112], FasterMoE[60], ExFlow[174], Tutel[69], DeepSpeed-TED[142]

Storage SE-MoE[136], Pre-gated MoE[70], EdgeMoE[176], MPipeMoE[185]

Shazeer et al.[135], GShard[86], Swith Transformer[49], Meta-MoE[6], GLaM[44], NLLB[29]

CV V-MoE[128], Swin-MoE[69], ADVMOE[183], pMoE[25]

LIMoE[106], Shen et al.[138], MoCLE[53], LLaVA-MoLE[15], MoE-LLaVA[95], Uni-MoE[92]

Fig. 3. Taxonomy of Mixture of Experts (MoE).

4 ALGORITHM DESIGN OF MIXTURE OF EXPERTS

Reference Auxiliary Loss Coefficient

Shazeer et al.[135], V-MoE[128] 𝐿𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 + 𝐿𝑙𝑜𝑎𝑑 𝑤𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 = 0.1, 𝑤𝑙𝑜𝑎𝑑 = 0.1

FFN1 FFN2 FFN3 FFN4

Gate Gate Gate Gate

phenomenon of gating fluctuation. To counter this, STABLEMOE employs a two-stage training

Linear O1 Linear O2 Linear O3 Linear O4

Attention Attention Shared FFN1 FFN2 FFN3

4.3 Mixture of Parameter-Efficient Experts

Add + Normalize Add + Normalize

Add + Normalize Add + Normalize

𝑌! 𝑌" 𝑌! 𝑌" 𝑌! 𝑌" 𝑌! 𝑌"

FFN1 FFN2 FFN3 FFN4 FFN1 FFN2 FFN3 FFN4

Add + Normalize Add + Normalize1 Add + Normalize4 Add + Normalize-Average

Self-Attention Self-Attention1 Self-Attention4 Self-Attention-Average

𝑋! 𝑋" 𝑋! 𝑋" 𝑋! 𝑋" 𝑋! 𝑋"

𝑌! 𝑌" 𝑌! 𝑌" 𝑌! 𝑌" 𝑌! 𝑌"

FFN1 FFN2 FFN3 FFN4 FFN1 FFN2 FFN3 FFN4

Add + Normalize Add + Normalize Add + Normalize Add + Normalize

Self-Attention Self-Attention Self-Attention Self-Attention