0% found this document useful (0 votes)

41 views

Communication Optimization For Distributed Training

Uploaded by

elith_never

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views

Communication Optimization For Distributed Training

Uploaded by

elith_never

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

1

Communication Optimization for Distributed

Training: Architecture, Advances, and Opportunities
Yunze Wei, Tianshuo Hu, Cong Liang, Yong Cui, Member, IEEE

Abstract—The past few years have witnessed the flourishing natural and inevitable, creating a new demand for large-scale,
of large-scale deep neural network models with ever-growing high-performance GPU clusters.
parameter numbers. Training such large-scale models typically However, enhancing GPU performance and enlarging clus-
requires massive memory and computing resources that exceed
arXiv:2403.07585v1 [cs.DC] 12 Mar 2024

those of a single GPU, necessitating distributed training. As GPU ter size do not necessarily lead to linear performance improve-
performance has rapidly evolved in recent years, computation ment of distributed deep neural network training (“distributed
time has shrunk, thereby increasing the proportion of communi- training” for short) systems, as one might naturally expect.
cation in the overall training time. Therefore, optimizing commu- This is because the time overhead of distributed training comes
nication for distributed training has become an urgent issue. In not only from computation but also from communication.
this article, we briefly introduce the general architecture of dis-
tributed deep neural network training and analyze relationships When the time spent on computation is reduced, the com-
among Parallelization Strategy, Collective Communication Library, munication time is exposed more and gradually becomes a
and Network from the perspective of communication optimization, bottleneck. In fact, the communication tasks account for up to
which forms a three-layer paradigm. We then review current 60% of a DNN training iteration time in Meta’s production
representative research advances with this three-layer paradigm. environment [2]. Optimizing communication can significantly
We find that layers in the current three-layer paradigm are
relatively independent, but there is a rich design space for cross- reduce the overall training time, so there is an urgent need to
layer collaborative optimization in distributed training scenar- enhance communication efficiency for distributed training.
ios. Therefore, we further advocate a communication-efficient Systematically optimizing distributed training is not a
five-layer paradigm underlining opportunities for collaboration straightforward task. Distributed deep neural network training
designs and look forward to the perspectives of “Vertical”, architecture consists of hardware and software parts. The
“Horizontal”, “Intra-Inter” and “Host-Net” collaboration designs.
We hope this article can shed some light on future research on software part includes parallelization strategy, deep learning
communication optimization for distributed training. library, collective communication library (CCL), and network
protocols, while the hardware part comprises CPUs, GPUs,
Index Terms—Deep Neural Network, Distributed Training,
Parallelization Strategy, Collective Communication Library, Net- RAM, I/O, and network infrastructure. Attaining efficiency in
work Protocols and Topologies. distributed training systems necessitates effective collaboration
among various system components [3], [4].
Many components of distributed training architecture are
I. I NTRODUCTION
closely related to communication. For instance, parallelization
ARGE-SCALE deep neural network (DNN) models
L (“large models” for short) have become ubiquitous, and
their capabilities have significantly advanced in recent years.
strategy determines communication demand, CCL generates
communication traffic, and network affects the efficiency of
performing communication tasks. To sum up, there is a “Par-
The latest prominent large models such as GPT1 , LLaMA2 , allelization Strategy, CCL, and Network” three-layer paradigm
and GLM3 demonstrated unprecedented performance, enabling of communication optimization for distributed training.
significant changes in production and lifestyle. There are many research advances optimizing communi-
The rapid development of large models inevitably led to cation within the current three-layer paradigm. For example,
a significant increase in the scale of parameters and training PTD-P [1] uses a novel interleaved pipeline parallelism opti-
data. Due to the limited memory and computing power of mization scheme to overlap communication and computation
a single GPU (for simplicity, we use GPU to signify other as much as possible. TACCL [5] generates communication
AI hardware such as TPU and NPU), training large models primitive algorithms tailored to specific training tasks and
with a single GPU has already been a thing of the past. topology to enhance efficiency. TopoOpt [2] leverages the
For instance, even merely considering the computation time, reconfigurability of optical switches to optimize topologies
training a GPT-3 model with 175 billion parameters on a single and parallel strategies collaboratively, providing customized
Nvidia V100 GPU would require approximately 288 years [1]. topology for efficient communication.
Consequently, training large models with multiple GPUs is Different from general high-performance computing (HPC)
Y. Wei, T. Hu, C. Liang and Y. Cui are with the Department of Computer scenarios, distributed training has its own characteristics. For
Science and Technology, Tsinghua University, Beijing, China. example, communication traffic of distributed training exhibits
Yong Cui ([email protected]) is the corresponding author. periodic repetition, and the traffic pattern of each training
1 https://openai.com/gpt-4 iteration is relatively consistent [6]. Moreover, most communi-
2 https://llama.meta.com/llama2 cation tasks in distributed training are pre-determined, which
3 https://open.bigmodel.cn contrasts significantly with traditional tasks where traffic
22
2

Large
Large
LargeDeep
V
Deep Neural
VVNetwork
DeepNeural
NeuralNetwork Model
NetworkModel
Model
Paralization
Paralization Strategy
ParalizationStrategy
Strategy
Deep
DeepLearning
Deep LearningLibrary
Learning Library
Library
(a)
(a) (b)
(b) (c)
(c) (d)
(d)
Task
TaskScheduler Fig. 2. Four common
common parallelization
parallelizationstrategies:
strategies:a)a)
a)Data
Data parallelism; b) Pipeline
Pipeline parallelism;
Task Scheduler
Scheduler Fig.
Fig. 2.
2. Four common parallelization strategies: Data parallelism;
parallelism; b) b)
Pipeline parallelism;
parallelism; c)c) Tensor
c) Tensor Tensor
parallelism;
parallelism; d) MoE
parallelism; d) MoE parallelism.
MoE parallelism.
parallelism.
Task
TaskGraph
Task Graph
Graph
Collective
CollectiveCommunication
Collective CommunicationLibrary
Communication Library
Library
00 00 00 00
links
links
links flows
flows
flows
1-2
1-21-2 11 00 11 11
1-3
1-31-3
2-3
2-32-3
Flow
FlowScheduler
Flow Scheduler
Scheduler 22 00 22 22
time
time
time
Data
DataTraffic
Data Traffic
Traffic
00 11 00 00
Network
Network
Network
11 11 11 11
PCIe
PCIe
PCIe NVLink
NVLink
NVLink
Protocol
Protocol
Protocol 22 11 22 22
TCP/IP
TCP/IP
TCP/IP RoCE
RoCE
RoCE Infiniband
Infiniband
Infiniband
TopologyFat-Tree
Topology
Topology Fat-Tree
Fat-Tree Mesh
Mesh
Mesh Ring
Ring
Ring Torus
Torus
Torus
00 22 00 00
ComputingNode
Computing Node
11 22 11 11
GPU
GPU
GPU CPU
CPU
CPU RAM
RAM
RAM I/O
I/O
I/O 22 22 22 22

(a) (b)
(b) (c)
(c) (d)
(d)

Fig.
Fig. 1. 1.General
Fig. Generalarchitecture
architectureofofdistributed
distributedtraining.
training. Fig.
Fig. 3. common collective
3. Four common
common collectivecommunication
collective communicationprimitives:
communication primitives: a)
a) a)
primitives: Broadcast;
Broadcast; b) All-Gather;
b) b)
Broadcast; All-Gather;
All-Gather; c)c) All-to-All;
All-to-All;
c) All-to-All;
d) All-Reduce.
d) All-Reduce.

flowsarrive
flows arrivestochastically.
stochastically. Meanwhile,
Meanwhile, the the current
currentthree-layer
three-layer A.A.
A.Overview
Overview
Overview
paradigm has its shortcomings. Each layer of the
paradigm has its shortcomings. Each layer of the paradigm
paradigmisis The
relativelyindependent,
independent, making
making it it difficult Thegeneral
The general
general
architecture
architecture
architecture
of the
ofof thedistributed
the distributed
deepdeep
distributed neural net- net-
neural
deep neural net-
relatively difficult for
for them
themtotocooper-
cooper- work training system is shown in Fig. 1. Top of the architecture
work training
training systemsystem is is shown
shown in in Fig.
Fig. 1.
1. Top
Topofofthethearchitecture
architecture
work
ate for communication optimization. Therefore, we
ate for communication optimization. Therefore, weadvocate
advocatea a is a large deep neural network model to train. As analyzed in
communication-efficient five-layer
five-layer paradigm, is aa large
large deep deep neural
neural networknetwork model model to to train.
train. AsAs analyzed
analyzed in in
communication-efficient paradigm, which
whichisisdetailed
detailed Section I, a deep learning model needs distributed training
Section I, a deep
deeptolearning
learning model needs distributed training
Section it is I,tooa large model needs distributed training
in Section IV, aiming to achieve cross-layer collaboration
in Section IV, aiming to achieve cross-layer
collaboration when fit into the memory or too slow to train
design with (logical) schedulers as middleware. We prospect when
when it
it is
is
with a single GPU. too
too large
large to
to fit
fit into
into the
the memory
memory or
or too
too slow
slow toto train
train
design with (logical) schedulers as middleware. We prospect
optimization opportunities from four perspectives: “Vertical”, with
with aa single
single GPU.
GPU.
Parallelization strategy is an essential part of distributed
optimization opportunities from four perspectives: “Vertical”,
“Horizontal”, “Intra-Inter”, and “Host-Net” co-design. Parallelization
Parallelization
training deployment,strategy strategy
determining isis ananhowessential
essential part
a modelpart of distributed
of distributed
is partitioned
“Horizontal”, “Intra-Inter”, and “Host-Net” co-design.
The rest of this article is arranged as follows: Section II training
training deployment,
deployment, determining
determining
for distributed training. Commonly used parallelization strate- how
how a
a model
model is
is partitioned
partitioned
The resttheofgeneral
presents this article is arranged
architecture as follows:
of distributed trainingSection
and theII for are
for
gies distributed
distributed training.
detailed training. Commonly
in SectionCommonly II-B. used parallelization
used parallelization strate- strate-
presents
current the generalparadigm
three-layer architecturefromofthe
distributed
perspective training and the
of communi- gies
gies
Deepareare detailed
detailed
learning in Section
in
models Section
are often II-B.
II-B. implemented with the deep
current
cation three-layer
optimization. paradigm from
Section III the perspective
reviews representativeof communi-
research learning
Deeplibrary
Deep learning
learning (suchmodels
models as TensorFlow often4implemented
are often
are and PyTorch5with
implemented ) , which
with the deep
the deep
cation optimization.
advances Section III
on communication reviews representative
optimization for distributed research
train- generates
learningexecution
learning library
library (suchtask graphs
(such as including44computing
as TensorFlow
TensorFlow and
and PyTorch
PyTorch 55 and
tasks )) ,, which
which
advances on communication
ing. Section optimization for distributed
IV advocates a communication-efficient train-
five-layer corresponding
generates
generates execution collective
execution taskcommunication
task graphs
graphs including
includingtasks. Communica-
computing
computing tasks
tasksand and
ing. Section looking
paradigm, IV advocatesinto the cross-layer collaborativefive-layer
a communication-efficient design tion compression collective
corresponding
corresponding techniques communication
collective such as gradient tasks.
communication quantization
tasks. Communica- are
Communica-
paradigm, looking
opportunities into promising
from four the cross-layer
research collaborative design
directions. Finally often
tion more
tion compressionrelevant techniques
compression to training such
techniques methods
such as and are quantization
as gradient
gradient not detailed are
quantization are
opportunities from four promising research directions. Finally
is the conclusion. inoften
this article
more due to
relevant the to limited
training space.
methods
often more relevant to training methods and are not detailed and are not detailed
is the conclusion. inThe
in thisdeep
this articlelearning
article due
due to library
to the often invokes
the limited
limited space.Collective Commu-
space.
nication Library to implement collective communication tasks
The
The deepdeep learning
learning librarylibrary often often invokes
invokes Collective
Collective Commu-Commu-
II. A RCHITECTURE to transmit activations (in forward propagation) or synchronize
nication
nication Library Library to to implement
implement collective collective communication
communication tasks tasks
gradients (in backward propagation) between different GPUs.
II. A RCHITECTURE to
to transmit
transmit activations
activations (in
(in forward
forward propagation)
propagation) or
or synchronize
synchronize
Commonly used collective communication primitives imple-
This section first provides an overview of the distributed gradients
gradients (in
(in backward
backward propagation)
propagation) between
between different
different GPUs. GPUs.
mented in CCL are detailed in Section II-C.
deep neural network training architecture, followed by an Commonly
Commonly used collective
usedcommunication communication
collective communication primitives
primitives imple-
imple-
The collective primitives generate actual
in-depth introduction
This section of three an
first provides overview pertinent
components of the distributed
to com- mented
mented in
in CCL
CCL are
are detailed
detailed in
in Section
Section II-C.
II-C.
communication data traffic, which is injected into the under-
munication
deep optimization:
neural network parallelization
training strategy,
architecture, collective
followed by an
lyingThe
The collective
collective
network. communication
A varietycommunication
of network protocolsprimitives
primitives andgenerate
generate
topologies actual
actual
communication
in-depth library,
introduction ofand network.
three Finally,pertinent
components the relationships
to com- communication
communication data data traffic,
traffic, whichwhich isis injected
injected intointo thethe under-
under-
between these
munication three components
optimization: and communication
parallelization strategy, optimiza-
collective 4 https://www.tensorflow.org
lying network.
lying network. A A variety
variety of of network
network protocols
protocols and and topologies
topologies
tion are elaborated.
communication library, and network. Finally, the relationships
5https://pytorch.org

between these three components and communication optimiza- 44https://www.tensorflow.org

https://www.tensorflow.org
55https://pytorch.org
tion are elaborated. https://pytorch.org
3

for distributed training are detailed in Section II-D. As the at the corresponding position of each GPU node. Common
end-host of the underlying network, a computing node is a All-Reduce scenarios include data parallelism and some model
supercomputer equipped with high-performance GPUs and parallelism such as Megatron-lm [7].
CPUs, a large RAM, and high-speed I/O. These hardware
components also need to collaborate efficiently. D. Underlying Network
Parallelization strategy, CCL, and Network are three critical
components in the architecture that affect communication Protocols and topologies are two main factors that af-
efficiency. The interplay among them and their impacts on fect network performance. Besides general TCP/IP protocol,
communication efficiency are analyzed in Section II-E. distributed training often uses RDMA (such as RoCE or
Infiniband) for less overhead and higher bandwidth.
The traffic flow within the network is also closely related
B. Common Parallelization Strategies to the network topology. Common network topologies used
Commonly used parallelization strategies include data par- for distributed training include Fat-tree and its variants, Torus,
allelism, model parallelism (primarily including pipeline par- as well as Ring and Full-Mesh topologies. These topologies
allelism and tensor parallelism), and emerging Mixture-of- can be combined according to practical needs. For example,
Expert (MoE) parallelism, as illustrated by Fig. 2, where each the NVLink topology of Nvidia’s DGX-16 is a combination
rectangle refers to a GPU node and each round rectangle refers of Ring and Full-mesh.
to (part of) a model.
Data parallelism is one of the most commonly used paral- E. Communication Paradigm in the Architecture
lelization strategies whose basic idea is to distribute multiple
copies of a model to different GPUs. Each GPU receives Parallelization strategy, CCL, and network form a three-
a subset of training data (mini-batch) during training, and layer paradigm of communication optimization for distributed
gradients are averaged to update global parameters at the end training, as shown by the colored rectangles in Fig. 1. All of
of each iteration. Model parallelism, including pipeline and them directly or indirectly affect the communication perfor-
tensor parallelism, means splitting the model onto different mance of the training process. Different parallelization strate-
GPUs. Pipeline parallelism allocates different layers of the gies determine the primarily used collective communication
model to different GPUs, so there is mainly point-to-point primitives in task graphs, which affects the traffic pattern.
communication between layers (GPUs). Tensor parallelism Various implementations of CCL directly affect the actual
splits the same layer of the model onto different GPUs and network traffic. The same traffic demand often exhibits differ-
uses distributed matrix computing techniques for collabora- ent performance under different network infrastructures. The
tion [7], which is a communication-intensive operation. MoE topologies designed for distributed training are closely related
parallelism involves dividing a portion of a model into mul- to the algorithm implementation of collective communication
tiple expert components. Each expert specializes in a specific primitives. For example, the execution process of the Ring-
task domain and is allocated to a certain GPU. based All-Reduce algorithm results in a ring communication
In practice, the above parallelization strategies are often not mode. That is, each node transmits data to its logical neighbors
used independently, but in a hybrid manner. For example, we while all the nodes in the communicator (a communicator is
can first divide the model by layers and split each layer, then a set of nodes used to implement a collective communication
distribute the divided model to multiple sets of GPUs, thereby task) form a ring. Nvidia’s DGX-16 and Google’s Torus [4]
achieving pipeline-tensor-data three-dimensional hybrid par- topology both contain numerous ring structures, which are
allelism [1]. MoE parallelism is another example of hybrid well-suited to satisfy the communication needs of Ring-based
parallelism. The idea of data parallelism is implicit in MoE All-Reduce.
parallelism, and the expert model itself can also be split into However, layers are relatively independent in the current
multiple GPUs for parallel computing. three-layer communication paradigm. Although some methods
introduced in Section III involve cross-layer design, collabora-
tive design has not yet become mainstream in general, which
C. Common Collective Communication Primitives
limits the end-to-end training performance [6]. Therefore,
Commonly used collective communication primitives im- we advocate a communication-efficient five-layer paradigm
plemented in CCL for distributed training are illustrated in with additional (logical) schedulers (dashed boxes in Fig. 1)
Fig. 3, where each rectangle represents a GPU node and each over the current architecture. The inter-layer schedulers imply
rounded rectangle represents a data chunk. collaborative design, which is elaborated in Section IV.
Broadcast distribute data from a particular node to all other
nodes, which can be used in data parallelism and some model III. OVERVIEW OF R ECENT A DVANCES
parallelisms [8]. All-Gather is a many-to-many collective
Communication optimization in distributed training encom-
communication primitive where data from different nodes are
passes numerous aspects. We delve into three of the most
distributed to all nodes. All-to-All transmit data among various
pertinent elements: parallelization strategy, CCL, and net-
nodes, such as data distribution in MoE parallelism [9], [10].
work, which make up the current three-layer communication
All-Reduce is a sum operation of the corresponding data chunk
of each node, as shown in Fig. 3(d), where each grey rounded 6 https://images.nvidia.com/content/pdf/dgx1-system-architecture-
rectangle represents the aggregated results of the data chunks whitepaper1.pdf
4

TABLE I
A DVANCES ON C OMMUNICATION O PTIMIZATION IN D ISTRIBUTED D EEP N EURAL N ETWORK T RAINING

How to reduce exposed communication Cross-layer Time for

Layer Main focus Implementation
Method Beneficial effect co-design decision

Reducing the amount of Insert a few commu-

Megatron- Efficient intra-layer model An efficient tensor parti-
traffic by removing a syn- - Offline nication operations in
lm [7] parallel strategy tioning strategy
Model chronization point naive pytorch
Parallelism Addressing “cross-mesh Broadcast based reshard- Reducing inter-node traf-
900 lines of C++ and
AlpaComm [8] resharding” problem in ing method and eager fic and overlap communi- With CCL On/Offline
2.5K lines of python
model parallelism 1F1B pipeline schedule cation with computation
Combination of pipeline, Overlapping communica- An extension to
Hybrid A interleaved pipelining
PTD-P [1] tensor and data paral- tion and computation in - Offline Megatron-LM
Parallelism schedule
lelisms pipeline parallelism codebase
Priority scheduling All- Overlapping All-Reduce Based on DeepSpeed
All-to-All bottleneck in
Lina [9] to-All with All-Reduce communication with With CCL Offline MoE with 7500 lines
MoE parallelism
micro-ops computation of code
MoE
Parallelism Move experts to reduce
Reducing All-to-All traffic A plugin of 4.3K
A data-centric paradigm All-to-All traffic when
Janus [10] and overlap communica- - On/Offline lines of code in Py-
for MoE parallelism benificial and a provident
tion with computation Torch
prefetch strategy
Multi-GPU and multi- Select communication Integrated in Deep
node communication primitive algorithms such Mainly accelerating All- For particu- learning frameworks
NCCL On/Offline
primitives optimized for as DBTree and Ring for Reduce communication lar hardware such as PyTorch and
NVIDIA hardware All-Reduce, etc. TensorFlow.
Synthesize collective
Packing spanning trees Overlap flows within Seamlessly plugged
communication
Blink [11] with MWU and ILP algo- primitives to promote link For topology Online into pytorch and
algorithms for underlying
rithms utilization tensorflow
Collective topology
Communi- Synthesize collective Evaluated on different
cation Synthesizing pareto- Overlap flows within
communication 8-GPU systems such
SCCL [12] Library optimal algorithms with primitives to promote link For topology Offline
algorithms for underlying as Nvidia DXG-1 and
(CCL) SMT encoding utilization
topology Gigabyte Z52
Synthesize collective Routing (MILP), ordering
Overlap flows within Extension of NCCL
communication (Greedy algorithm) and
TACCL [5] primitives to promote link For topology Offline with TACCL runtime
algorithms for topology scheduling (MILP) with
utilization as interpreter
and training task communication sketches
Co-optimize collective
SYNDI- communication MCMC Search with ter- Overlap primitives to pro- With execu- UCC library interface
On/Offline
CATE [13] algorithms with runtime mination mote link utilization tion plan (pytorch integrated)
schedule
Google’s third supercom- Up to 4096 chips con-
TPUv4 [4] - - Architectural -
puter specific for ML nected by OCSs
Network Co-optimizing The best network topol- A 12-node prototype
A group theory-inspired With par-
computation, ogy and routing plan, to- with FlexFlow’s train-
TopoOpt [2] algorithm called Totient- allelization Offline
communication and gether with a paralleliza- ing engine and modi-
Perms strategy
network topology tion strategy fication to NCCL

paradigm. We use exposed communication to denote the com- To train a large-scale model that can’t fit into the memory of
munication time that cannot overlap with computation time. a single GPU, researchers introduced model parallelism, which
In this section, our focus is confined to the contributions of includes pipeline and tensor parallelism. Tensor parallelism
a specific study pertaining to communication optimization, divides models within a layer and requires intensive com-
that is, how to reduce exposed communication in distributed munication in training. Researchers from Nvidia proposed an
training. Table I provides an overview of the research advances efficient “intra-layer parallelism” approach called Megatron-
with some primary concerns. lm [7] to reduce the communication traffic in tensor paral-
lelism. Pipeline parallelism necessitates only the transfer of
A. Parallelization Strategy activations/gradients between adjacent nodes in the logical
pipeline, which is a point-to-point communication pattern that
The communication overhead in distributed training mainly
significantly reduces the communication cost compared to
arises from parallelization. Different implementations of par-
data parallelism and tensor parallelism. Because overlapping
allelization strategies affect the communication pattern and
communication and computation is hard to achieve in naive
network traffic during the model training process [2].
pipeline parallelism, an interleaved pipeline scheduling strat-
The most straightforward and most widely used paralleliza-
egy [1] is proposed where each device is assigned multiple
tion strategy is data parallelism. As mentioned in Section
data chunks to gain the chance of communication-computation
II, multiple copies of the same model are replicated across
overlap. However, pipeline parallelism, as an “inter-layer par-
different nodes in data parallelism. Each iteration requires all
allelism” approach, is not without its flaws. It faces issues
nodes to synchronize gradients, which are used for global
such as “synchronous/asynchronous” parameter updates and
parameter updates. Consequently, data parallelism results in
pipeline bubble problems, which can adversely affect the
a significant amount of All-Reduce communication.
5

accuracy of the model and the efficiency of training [7]. Communication Profiled Target
Various parallelization strategies are often used in conjunc- Sketch Topology Collective
tion. For example, PTD-P [1] not only combines Megatron-
lm’s Tensor parallelism and pipeline parallelism but also
introduces data parallelism to scale to thousands of GPUs.
Routing
AlpaComm [8] identifies the “cross-mesh resharding” problem
that arises when combining tensor and pipeline parallelism
Heuristic
and proposes a communication optimization method based on Synthesizer Ordering
broadcasting and an “overlapping-friendly” pipeline schedul-
ing scheme to accelerate end-to-end training. Contiguity and
Exact scheduling
As an emerging model structure, the MoE model is grad-
ually gaining popularity due to its fast training/inference
speed. The training process of the MoE model selects several Backend
Algorithm
Implementation
appropriate experts for a data sample, which inherently has
good compatibility with parallelization. MoE parallelism can
be achieved by simply dispersing the “experts” on different Fig. 4. TACCL’s novel synthesizer takes as input a communication sketch,
profiled topology, and target collective along with synthesizer hyperparameters
GPUs, introducing All-to-All traffic when data is distributed to generate an algorithm for the collective. The synthesized algorithm is
between different GPUs. Additionally, the model parameters implemented in the hardware cluster using
. TACCL’s backend [5]
of the non-expert part are globally shared in MoE parallelism,
so there will also be All-Reduce traffic. Lina [9] uses a
SCCL [12] takes a more unified approach to achieve ef-
scheduling strategy that prioritizes All-to-All traffic and further
ficiency: automatically synthesizing high-performance com-
splits the All-Reduce tasks to overlap with computation as
munication primitives for a given topology. To achieve this,
much as possible, thus reducing exposed communication and
SCCL designs a cost model to evaluate the latency and
achieving training acceleration. Janus [10] studies MoE paral-
bandwidth cost of algorithms and then searches for algorithms
lelism acceleration from a new perspective. Since the size of
on the Pareto frontier. However, SCCL also faces problems
expert models and training datasets varies, Janus innovatively
of high complexity. The communication primitive algorithm
proposes a “data-centric” model that “moves experts instead of
generation problem is encoded into a Mixed Integer Linear
data”. This significantly reduces the amount of communication
Program (MILP), which is unfortunately NP-hard.
during the training process under certain conditions (e.g., when
To reduce the complexity of generating collective commu-
the scale of experts is smaller than the scale of data), thus
nication algorithms, TACCL [5] is a representative work that
accelerating the training process.
introduces a “human-in-the-loop” approach, whose paradigm
is shown in Fig. 4. TACCL incorporates high-level inputs
B. Collective Communication Library from an algorithm designer, such as logical topologies, switch
hyper-edges, and algorithm symmetry, to efficiently synthe-
Communication tasks of distributed training are often im- size collective communication algorithms for heterogeneous
plemented by CCL. The current predominant NVIDIA Col- topologies. These human inputs greatly constrain the search
lective Communication Library (NCCL)7 dynamically selects space of algorithms, reducing search time to an acceptable
established primitive algorithms based on different situations. amount, in most cases from seconds to a few minutes. Com-
For instance, NCCL selects the Ring and Double Binary pared with NCCL, TACCL achieves up to 2.36x end-to-end
Tree algorithms for All-Reduce 8 for different workloads and training speedup for BERT and 1.94x for Transformer-XL.
topologies to accelerate the training process. However, NCCL’s Different from previous methods focused on speeding up
optimization for specific hardware underscores the need for a single primitives, SYNDICATE [13] proposes a novel abstrac-
more universal and adaptable CCL to speed up distributed tion to overlap primitives in a fine-grained pattern, breaking
training across different hardware and topologies. Therefore, large communication operations into smaller units, thus al-
many researchers took different approaches to implement lowing higher flexibility. SYNDICATE then develops a joint
communication primitives more efficiently. optimizer based on Markov Chain Monte Carlo (MCMC)
For example, Blink [11] dynamically generates optimal search to optimize the joint action of the control plane and
communication primitives by packing spanning trees rather the data plane.
than writing and optimizing them manually. Blink uses integer
linear programming (ILP) to reduce the number of trees C. Network
and improve performance, and also leverages heterogeneous
Distributed training focuses on GPU-to-GPU communi-
communication channels to utilize full bandwidth. At the time,
cation, which needs to consider intra/inter-host communi-
Blink achieved up to 40% speedup compared to NCCL in end-
cation protocols. Common intra-host communication pro-
to-end DNN training iterations.
tocols include the general PCIe protocol and Nvidia’s
7 https://developer.nvidia.com/nccl NVLink/NVSwitch. Inter-host communication protocols in-
8 https://developer.nvidia.com/blog/massively-scale-deep-learning-training- clude traditional TCP/IP and more efficient RDMA (e.g.,
nccl-2-4 RoCE and Infiniband).
6

Based on certain communication protocols, the underlying objectives for communication tasks are no longer the general-
network topology also significantly influences the communi- purpose flow complete time (FCT) but rather the job complete
cation efficiency of distributed training. For instance, as men- time (JCT). In this context, communication tasks are merely
tioned in Section II, Nvidia utilizes a Ring and fully connected a component of the entire distributed training process. Opti-
Mesh topology in their DXG series, and Google employs mizing JCT necessitates considering the dependency between
a 3D-Torus topology in their TPUv4 [4]. These topologies communication and computation instead of just optimizing
contain a large number of ring structures, which enhance each communication task. Communication tasks should be
the efficiency of Ring-based All-Reduce and Point-to-Point optimized to minimize computation blocking time, which can
communications (such as pipeline parallelism or broadcast- only be achieved through “cross-layer” scheduling to meet this
based hybrid parallelism [8]). However, static topologies can new optimization objective.
rarely adapt to dynamically changing job requirements, leading To this end, we advocate a communication-efficient five-
researchers to propose TopoOpt [2]. TopoOpt takes advantage layer paradigm with (logical) schedulers as the middleware,
of the reconfigurability of optical switches, enabling a scheme as illustrated in Fig. 5. The schedulers not only allow lower-
that co-optimizes the topology and parallelization strategy. layer facilities to perceive the requirements of higher-layer
Nevertheless, current optical switching technology still has applications but also enable higher-layer applications to be
issues of high reconfiguration latency, making it challenging to aware of the capabilities of lower-layer facilities. We refer to
perform dynamic topology adjustments between iterations. As this as “Vertical” co-design. The schedulers can also perform
such, TopoOpt remains a relatively coarse-grained optimiza- global scheduling across multiple jobs, achieving “Horizon-
tion, that is, reconfiguring the interconnection between training tal” collaborative optimization. Considering the heterogeneous
servers of each job only before the job starts and keeps the topologies of distributed training and the development of
topology until the training is complete. emerging technologies such as in-network aggregation, we also
Anyway, research on communication optimization for dis- provide a prospect for “Intra-Inter” and “Host-Net” co-design.
tributed training from a network perspective is still insufficient, “Vertical” co-design: Refers to exchanging necessary infor-
which is prospected in Section IV. mation between layers to optimize communication efficiency.
For example, CCL can acquire global traffic demand from
D. Collaboration Design in Current Advances the parallelization strategy and generate the optimal commu-
nication primitive algorithms in conjunction with the network
As observed in Table I, the research advances mentioned state. Furthermore, cross-layer schedulers can be constructed
above are primarily limited to optimizations within a layer, to strategically schedule communication tasks based on traffic
with less cross-layer co-design. Current advances involving load or optimize task graphs according to the traffic patterns
cross-layer optimization can be categorized into two types. of relevant communication primitives. Echelon [14] is a flow
The first type is passive collaborative design. For instance, scheduling work that embodies “Vertical” co-design, where
AlpaComm [8] and Lina [9] optimize communication for a novel communication optimization objectives are designed
specific parallelization strategy, necessitating modifications to based on the dependency between communication and com-
communication primitives. Another example is that generative putation under various parallelization strategies, aiming to
CCL (such as Blink [11], SCCL [12] and TACCL [5]) has optimize communication towards minimizing JCT.
to be aware of network topologies because their main signifi- Another example of “Vertical” co-design is that network can
cances are to customize collective communication algorithms also use traffic demand for optimization in distributed training.
for specific topologies. The second type of work involves As the general infrastructure, the network is often unaware
proactive attempts at collaborative optimization yet remains of traffic demand. This is natural in common data center
exploratory. For example, SYNDICATE [13] jointly optimizes scenarios, where the network can only implement passive
the implementation and scheduling of communication primi- adjustments such as RTT-based congestion control due to the
tives, allowing parallel execution of primitives. TopoOpt [2] random nature of flow arrivals. However, traffic patterns are
collaboratively optimizes network topology and parallelization often predictable in distributed training scenarios [6]. Conse-
strategies, obtaining optimized combination strategies through quently, the network itself can also proactively adjust by taking
alternating optimization techniques. traffic demands of high-layer training tasks into account, thus
In summary, collaborative communication optimization in enhancing goodput. This in turn reduces communication time
distributed training is still a domain that needs further ex- and accelerates the distributed training process.
ploration. Section IV details the importance of co-design and “Horizontal” co-design: Primarily focuses on the coordina-
potential research opportunities. tion among different training jobs or various communication
tasks. Within a single training job, the traffic from multiple
IV. O PPORTUNITIES concurrent collective communication primitives competes for
As noted in Section III, collaborative design has not received network resources. Similarly, in the multi-job scenario, the
widespread attention in current research advances. However, traffic from numerous concurrent training jobs also competes
different from general tasks in data centers, the communi- for network resources. Consequently, optimization strategies
cation traffic patterns in distributed training are known in tailored to a single communication primitive or training job
most cases [6], [14], providing motivations and opportunities may suffer varying degrees of inefficacy. Therefore, it is imper-
for cross-layer collaboration. Furthermore, the optimization ative to employ horizontal co-design at different granularities
7

Job 1 Job 2 Job n Intra-Inter

Parallelization Strategy P … P Co-Design Data Center
Network
Vertical Co-Design

Task Scheduler
Host-Net
CCL C … C Co-Design
Switch Switch … Switch

Flow Scheduler
… GPU GPU …
Network N N Host Host Host
GPU GPU
Horizontal Co-Design

Fig. 5. Illustration of “Vertical”, “Horizontal”, “Intra-Inter” and “Host-Net” co-design.

to fully utilize network resources, ultimately reducing JCT approach also reduces the network traffic for a collective
of training tasks. CASSINI [6] attempts horizontal co-design communication task. ATP embodies the concept of “Host-Net”
through the perspective of multi-task scheduling. It analyses co-design and represents an exemplary work of in-network
the consequence of non-coordinated job scheduling in multi- computing in the field of distributed training. Nevertheless,
job scenarios, highlighting the potential traffic conflicts arising in-network aggregation has not been widely applied in the
from the periodic patterns of network resource utilization by production environment due to its complexity. “Host-Net” co-
different training jobs. Subsequently, CASSINI introduces a design remains a direction worthy of long-term research.
design for “staggering peak” scheduling for different jobs,
achieving high utilization of network resources, thus enhanc- V. C ONCLUSION
ing the performance of multiple training jobs. Due to the dramatic increase in the number of parameters
of large-scale deep neural network models, it is imperative
“Intra-Inter” co-design: Refers to the coordinated utilization
to develop effective communication optimization methods in
of multidimensional network resources, primarily including
distributed training. In this article, we first provide a general
the “intra-host” and “inter-host” network. The current net-
distributed training architecture and the current three-layer
works used in distributed training exhibit substantial hetero-
paradigm of communication optimization consisting of par-
geneity. The interconnection patterns between different GPUs
allelization strategy, CCL, and network. We then review the
within the same host include PCIe or NVLink, while those
latest advances in the current three-layer paradigm. Further-
between different hosts include RoCE or Infiniband, etc. Given
more, we analyze the shortcomings of the current paradigm
that GPU (rather than host) is the minimum unit in the com-
and advocate a communication-efficient five-layer paradigm,
munication tasks of distributed training, two modes of GPU-
highlighting great opportunities for “cross-layer” co-design.
to-GPU communication exist: within the same host and across
We hope to shed some light on future research on communi-
different hosts. In this context, the issue of heterogeneous link
cation optimization for distributed training.
bandwidths within and between hosts is inevitably confronted.
An example of “Intra-Inter” co-design is that TACCL [5] R EFERENCES
utilizes a profiling approach to perceive the characteristics of [1] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary,
heterogeneous links, which serves as a basis for optimizing V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro
communication primitive algorithms. et al., “Efficient large-scale language model training on gpu clusters
using megatron-lm,” in Proceedings of the International Conference for
“Host-Net” co-design: Primarily refers to the collaboration of High Performance Computing, Networking, Storage and Analysis, 2021,
computing resources between the end-host and programmable pp. 1–15.
[2] W. Wang, M. Khazraee, Z. Zhong, M. Ghobadi, Z. Jia, D. Mudigere,
switches within a network. In recent years, with the de- Y. Zhang, and A. Kewitsch, “Topoopt: Co-optimizing network topol-
velopment of programmable switch technology, in-network ogy and parallelization strategy for distributed training jobs,” in 20th
aggregation has gradually become a new computing paradigm. USENIX Symposium on Networked Systems Design and Implementation
(NSDI 23), 2023, pp. 739–767.
In-network aggregation not only fully utilizes the computing [3] Y. Jiang, Y. Zhu, C. Lan, B. Yi, Y. Cui, and C. Guo, “A unified
capabilities of programmable switches within the network but architecture for accelerating distributed dnn training in heterogeneous
also reduces the traffic running in the network, achieving mul- gpu/cpu clusters,” in 14th USENIX Symposium on Operating Systems
Design and Implementation (OSDI 20), 2020, pp. 463–479.
tiple purposes in one stroke. However, since existing protocols [4] N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil,
are designed for end-to-end communication, implementing S. Subramanian, A. Swing, B. Towles et al., “Tpu v4: An optically
in-network aggregation usually requires the support of new reconfigurable supercomputer for machine learning with hardware sup-
port for embeddings,” in Proceedings of the 50th Annual International
protocols. For instance, ATP [15] is an in-network aggrega- Symposium on Computer Architecture, 2023, pp. 1–14.
tion transmission protocol designed for multi-tenant scenarios, [5] A. Shah, V. Chidambaram, M. Cowan, S. Maleki, M. Musuvathi,
which can make full use of the computing capabilities of multi- T. Mytkowicz, J. Nelson, O. Saarikivi, and R. Singh, “Taccl: Guiding
collective algorithm synthesis using communication sketches,” in 20th
level programmable switches and degrade to host aggregation USENIX Symposium on Networked Systems Design and Implementation
when the in-network computing resources are exhausted. This (NSDI 23), 2023, pp. 593–612.
8

[6] S. Rajasekaran, M. Ghobadi, and A. Akella, “Cassini: Network-aware

job scheduling in machine learning clusters,” in 21st USENIX Sympo-
sium on Networked Systems Design and Implementation (NSDI 24) (to
appear), 2024.
[7] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan-
zaro, “Megatron-lm: Training multi-billion parameter language models
using model parallelism,” arXiv preprint arXiv:1909.08053, 2019.
[8] Y. Zhuang, H. Zhao, L. Zheng, Z. Li, E. Xing, Q. Ho, J. Gonzalez,
I. Stoica, and H. Zhang, “On optimizing the communication of model
parallelism,” Proceedings of Machine Learning and Systems, vol. 5,
2023.
[9] J. Li, Y. Jiang, Y. Zhu, C. Wang, and H. Xu, “Accelerating distributed
moe training and inference with lina,” in 2023 USENIX Annual Technical
Conference (USENIX ATC 23), 2023, pp. 945–959.
[10] J. Liu, J. H. Wang, and Y. Jiang, “Janus: A unified distributed training
framework for sparse mixture-of-experts models,” in Proceedings of the
ACM SIGCOMM 2023 Conference, 2023, pp. 486–498.
[11] G. Wang, S. Venkataraman, A. Phanishayee, N. Devanur, J. Thelin,
and I. Stoica, “Blink: Fast and generic collectives for distributed ml,”
Proceedings of Machine Learning and Systems, vol. 2, pp. 172–186,
2020.
[12] Z. Cai, Z. Liu, S. Maleki, M. Musuvathi, T. Mytkowicz, J. Nelson, and
O. Saarikivi, “Synthesizing optimal collective algorithms,” in Proceed-
ings of the 26th ACM SIGPLAN Symposium on Principles and Practice
of Parallel Programming, 2021, pp. 62–75.
[13] K. Mahajan, C.-H. Chu, S. Sridharan, and A. Akella, “Better together:
Jointly optimizing ml collective scheduling and execution planning using
syndicate,” in 20th USENIX Symposium on Networked Systems Design
and Implementation (NSDI 23), 2023, pp. 809–824.
[14] R. Pan, Y. Lei, J. Li, Z. Xie, B. Yuan, and Y. Xia, “Efficient flow
scheduling in distributed deep learning training with echelon formation,”
in Proceedings of the 21st ACM Workshop on Hot Topics in Networks,
2022, pp. 93–100.
[15] C. Lao, Y. Le, K. Mahajan, Y. Chen, W. Wu, A. Akella, and M. Swift,
“Atp: In-network aggregation for multi-tenant learning,” in 18th USENIX
Symposium on Networked Systems Design and Implementation (NSDI
21), 2021, pp. 741–761.

Yunze Wei ([email protected]) is currently pursuing a Ph.D.

degree with the Department of Computer Science and Technology, Tsinghua
University, Beijing, China. He received a B.E. degree in the School of
Computer Science and Technology, Huazhong University of Science and
Technology, Wuhan, China, in 2023. His current research interests include
data center networks and distributed machine learning.

Tianshuo Hu ([email protected]) is currently pursuing a B.E.

degree with the Department of Computer Science and Technology, Tsinghua
University, Beijing, China. His current research interests include distributed
machine learning and data center networks.

Cong Liang ([email protected]) received the B.E. and M.E.

degrees from Tsinghua University, Beijing, China, in 2020 and 2023. He is
currently working toward his Ph.D. degree in the Department of Computer
Science and Technology at Tsinghua University. His current research interests
include data center networks and network optimization for distributed machine
learning.

Yong Cui (Member, IEEE) ([email protected]) received the B.E.

and Ph.D. degrees in computer science and engineering from Tsinghua Uni-
versity, China, in 1999 and 2004, respectively. He is currently a Full Professor
with the Computer Science Department, Tsinghua University. His major
research interests include mobile cloud computing and network architecture.
He served or serves on the editorial boards of the IEEE TPDS, IEEE TCC,
IEEE Internet Computing, and the IEEE Network.

Deep Learning For Intelligent Wireless Networks: A Comprehensive Survey
No ratings yet
Deep Learning For Intelligent Wireless Networks: A Comprehensive Survey
25 pages
Accelerating DNN Training in Wireless Federated Edge Learning
No ratings yet
Accelerating DNN Training in Wireless Federated Edge Learning
30 pages
Auto Parallel
No ratings yet
Auto Parallel
21 pages
User Association and Resource Allocation in Large Language Model Based Mobile Edge Computing System Over 6G Wireless Communications
No ratings yet
User Association and Resource Allocation in Large Language Model Based Mobile Edge Computing System Over 6G Wireless Communications
7 pages
Deep Learning Based
No ratings yet
Deep Learning Based
16 pages
Machine Learning To Communication System
No ratings yet
Machine Learning To Communication System
10 pages
User Association and Resource Allocation in Large
No ratings yet
User Association and Resource Allocation in Large
7 pages
SketchDLC A Sketch On Distributed Deep Learning Co
No ratings yet
SketchDLC A Sketch On Distributed Deep Learning Co
27 pages
Optics For Ai Ofc 2022
No ratings yet
Optics For Ai Ofc 2022
3 pages
Powerai DDL: Minsik Cho, Ulrich Finkler, Sameer Kumar, David Kung, Vaibhav Saxena, Dheeraj Sreedhar Ibm Research
No ratings yet
Powerai DDL: Minsik Cho, Ulrich Finkler, Sameer Kumar, David Kung, Vaibhav Saxena, Dheeraj Sreedhar Ibm Research
10 pages
A Survey of Model Compression and Acceleration For Deep Neural Networks
No ratings yet
A Survey of Model Compression and Acceleration For Deep Neural Networks
10 pages
Computation and Communication Efficient Federated Learning Over Wireless Networks
No ratings yet
Computation and Communication Efficient Federated Learning Over Wireless Networks
13 pages
Machine_Learning_with_Computer_Networks_Techniques
No ratings yet
Machine_Learning_with_Computer_Networks_Techniques
52 pages
NetGPT An AI Native Network Architecture For Provisioning Beyond Personalized Generative Service
No ratings yet
NetGPT An AI Native Network Architecture For Provisioning Beyond Personalized Generative Service
7 pages
An Introduction To Machine Learning Communications
No ratings yet
An Introduction To Machine Learning Communications
11 pages
Meta Federated Reinforcement Learning For Distributed Resource Allocation
No ratings yet
Meta Federated Reinforcement Learning For Distributed Resource Allocation
11 pages
MoGENet_CIKM21_v2
No ratings yet
MoGENet_CIKM21_v2
5 pages
2501.01078v1
No ratings yet
2501.01078v1
13 pages
MoGENet_CIKM21_v1
No ratings yet
MoGENet_CIKM21_v1
5 pages
Robust and Communication-Efficient Federated Learning From Non-i.i.d. Data
No ratings yet
Robust and Communication-Efficient Federated Learning From Non-i.i.d. Data
14 pages
1-s2.0-S1389128622000421-main
No ratings yet
1-s2.0-S1389128622000421-main
21 pages
Sun 2019
No ratings yet
Sun 2019
37 pages
Rosendo 2022. Distributed Intelligence On Edge-To-Cloud Continuum - A Systematica Literature Review
No ratings yet
Rosendo 2022. Distributed Intelligence On Edge-To-Cloud Continuum - A Systematica Literature Review
33 pages
Federated Learning-Empowered AI-Generated Content in Wireless Networks
No ratings yet
Federated Learning-Empowered AI-Generated Content in Wireless Networks
8 pages
Lyu 2019
No ratings yet
Lyu 2019
8 pages
1906.05774v2
No ratings yet
1906.05774v2
6 pages
Atc23 Li Jiamin
No ratings yet
Atc23 Li Jiamin
16 pages
A 等 - 2024 - Enhancing Long-Term Memory using Hierarchical Aggregate Tree for Retrieval Augmented Generation
No ratings yet
A 等 - 2024 - Enhancing Long-Term Memory using Hierarchical Aggregate Tree for Retrieval Augmented Generation
6 pages
Robust and Communication-Efficient Federated Learning From Non-IID Data
No ratings yet
Robust and Communication-Efficient Federated Learning From Non-IID Data
17 pages
Resource Allocation for Stable LLM Training in Mobile Edge
No ratings yet
Resource Allocation for Stable LLM Training in Mobile Edge
10 pages
A Comprehensive Review of Deep Learning Architectures for Task Specific Analysis
No ratings yet
A Comprehensive Review of Deep Learning Architectures for Task Specific Analysis
40 pages
2501.05450v1
No ratings yet
2501.05450v1
23 pages
Outrageously Large Neural Networks - The Sparely-Gated Mixture-of-Experts Layer
No ratings yet
Outrageously Large Neural Networks - The Sparely-Gated Mixture-of-Experts Layer
19 pages
Sparsh Mittal - A Survey of Techniques For Optimizing Deep Learning On GPUs
No ratings yet
Sparsh Mittal - A Survey of Techniques For Optimizing Deep Learning On GPUs
31 pages
Unstructured PruneFL
No ratings yet
Unstructured PruneFL
22 pages
Reliable and Efficient RAR-based Distributed
No ratings yet
Reliable and Efficient RAR-based Distributed
14 pages
Multimodal
No ratings yet
Multimodal
14 pages
A Deep Reinforcement Learning-Based Resource
No ratings yet
A Deep Reinforcement Learning-Based Resource
15 pages
Cross Domain Recommendation via Bi-directional Transfer Graph Collaborative Filtering Networks
No ratings yet
Cross Domain Recommendation via Bi-directional Transfer Graph Collaborative Filtering Networks
10 pages
Anderson and Reed. Deep Modulation (Deepmod) - A Self Taught PHY Layer For Resilient Digital Communications
No ratings yet
Anderson and Reed. Deep Modulation (Deepmod) - A Self Taught PHY Layer For Resilient Digital Communications
8 pages
2504.00791v1
No ratings yet
2504.00791v1
6 pages
MoGENet_CIKM21_v3
No ratings yet
MoGENet_CIKM21_v3
5 pages
Wang 18 Domain Adaptation
No ratings yet
Wang 18 Domain Adaptation
20 pages
Leveraging Hybrid Intelligence Towards Sustainable and Energy-Efficient Machine Learning
No ratings yet
Leveraging Hybrid Intelligence Towards Sustainable and Energy-Efficient Machine Learning
5 pages
5_6136228962730248572
No ratings yet
5_6136228962730248572
9 pages
Cognitive Software-Defined Networking Using Fuzzy Cognitive Maps
No ratings yet
Cognitive Software-Defined Networking Using Fuzzy Cognitive Maps
24 pages
An_End-to-End_Workflow_to_Efficiently_Compress_and_Deploy_DNN_Classifiers_on_SoC_FPGA
No ratings yet
An_End-to-End_Workflow_to_Efficiently_Compress_and_Deploy_DNN_Classifiers_on_SoC_FPGA
4 pages
Overlap of Computation and Communication Within Seqenence For LLM Inference
No ratings yet
Overlap of Computation and Communication Within Seqenence For LLM Inference
8 pages
Invitedpaper Aspdac 24
No ratings yet
Invitedpaper Aspdac 24
7 pages
5 Dr. Amit Sharma
No ratings yet
5 Dr. Amit Sharma
8 pages
Machine Learning For End-to-End Congestion Control: Data Science and Artificial Intelligence For Communications
No ratings yet
Machine Learning For End-to-End Congestion Control: Data Science and Artificial Intelligence For Communications
6 pages
20. Giải pháp mạng cho trường đại học
No ratings yet
20. Giải pháp mạng cho trường đại học
29 pages
A Consolidate View of Loss Functions for Supervised Deep Learning-Based Speech Enhancement
No ratings yet
A Consolidate View of Loss Functions for Supervised Deep Learning-Based Speech Enhancement
5 pages
To Repeat or Not To Repeat - Insights From Scaling LLM Under Token-Crisis
No ratings yet
To Repeat or Not To Repeat - Insights From Scaling LLM Under Token-Crisis
21 pages
1807.11125v2
No ratings yet
1807.11125v2
5 pages
A Survey On Image Data Augmentation For Deep Learn
No ratings yet
A Survey On Image Data Augmentation For Deep Learn
49 pages
1 s2.0 S0278612522001054 Main
No ratings yet
1 s2.0 S0278612522001054 Main
14 pages
An Introduction To Deep Learning For The Physical Layer
No ratings yet
An Introduction To Deep Learning For The Physical Layer
13 pages
A Clustering-Based Collaborative Filtering Approach For Big Data Application
No ratings yet
A Clustering-Based Collaborative Filtering Approach For Big Data Application
10 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Price List - EOL (India)_L594867A-En_GB-2
No ratings yet
Price List - EOL (India)_L594867A-En_GB-2
5 pages
Photovoltatic System Management For Smart Home Energy Management System
No ratings yet
Photovoltatic System Management For Smart Home Energy Management System
50 pages
Cygnus Technologies Pvt. LTD.: About Us
No ratings yet
Cygnus Technologies Pvt. LTD.: About Us
7 pages
Automating Outlook From Other Microsoft Office Applications
No ratings yet
Automating Outlook From Other Microsoft Office Applications
18 pages
8-Create A Database For Student and Insert Value in Database Using SQL Server in Dot Net? ANS
No ratings yet
8-Create A Database For Student and Insert Value in Database Using SQL Server in Dot Net? ANS
23 pages
Alarm Clock
No ratings yet
Alarm Clock
14 pages
Python Crash Course 0.07 PDF
No ratings yet
Python Crash Course 0.07 PDF
68 pages
SYSC 2006 Mock Final PDF
No ratings yet
SYSC 2006 Mock Final PDF
5 pages
BIQ Installation
No ratings yet
BIQ Installation
81 pages
Microproject Report Format-CO5I
No ratings yet
Microproject Report Format-CO5I
5 pages
Unix and Shell Programming QP
No ratings yet
Unix and Shell Programming QP
5 pages
Akash
No ratings yet
Akash
203 pages
Satellite L635-S3030
No ratings yet
Satellite L635-S3030
4 pages
A Smart Software Testing Tool by M. Rajasekar
No ratings yet
A Smart Software Testing Tool by M. Rajasekar
18 pages
Create A CDS View
No ratings yet
Create A CDS View
8 pages
E-00784 eMCP
No ratings yet
E-00784 eMCP
43 pages
Chapter 4
No ratings yet
Chapter 4
4 pages
Mod Menu Crash 2024 07 03-11 20 34
No ratings yet
Mod Menu Crash 2024 07 03-11 20 34
4 pages
RHEL 8.5 - Deploying Red Hat Enterprise Linux 8 On Public Cloud Platforms
No ratings yet
RHEL 8.5 - Deploying Red Hat Enterprise Linux 8 On Public Cloud Platforms
105 pages
Using Phonetic Matching To Move Excel Data Into A Visual FoxPro Database
100% (2)
Using Phonetic Matching To Move Excel Data Into A Visual FoxPro Database
19 pages
Learning Module 01
100% (2)
Learning Module 01
3 pages
UBD10
No ratings yet
UBD10
6 pages
1.1.1 General Keyboard Shortcuts
No ratings yet
1.1.1 General Keyboard Shortcuts
11 pages
Handbook of SAS DATA Step Programming
No ratings yet
Handbook of SAS DATA Step Programming
3 pages
Computers Everywhere Autosaved
No ratings yet
Computers Everywhere Autosaved
7 pages
Raptor Programming - Google Search
No ratings yet
Raptor Programming - Google Search
1 page
Guide
0% (1)
Guide
5 pages
Dell Vostro 360 AIO Spec Sheet
No ratings yet
Dell Vostro 360 AIO Spec Sheet
2 pages
Rithmic Trader Pro - Cur
No ratings yet
Rithmic Trader Pro - Cur
74 pages
Resume
No ratings yet
Resume
3 pages

Communication Optimization For Distributed Training

Uploaded by

Communication Optimization For Distributed Training

Uploaded by

1

Communication Optimization for Distributed

between these three components and communication optimiza- 44https://www.tensorflow.org

How to reduce exposed communication Cross-layer Time for

Reducing the amount of Insert a few commu-

Job 1 Job 2 Job n Intra-Inter

Fig. 5. Illustration of “Vertical”, “Horizontal”, “Intra-Inter” and “Host-Net” co-design.

[6] S. Rajasekaran, M. Ghobadi, and A. Akella, “Cassini: Network-aware

Yunze Wei ([email protected]) is currently pursuing a Ph.D.

Tianshuo Hu ([email protected]) is currently pursuing a B.E.

Cong Liang ([email protected]) received the B.E. and M.E.

Yong Cui (Member, IEEE) ([email protected]) received the B.E.

You might also like