Communication Optimization For Distributed Training
Communication Optimization For Distributed Training
Abstract—The past few years have witnessed the flourishing natural and inevitable, creating a new demand for large-scale,
of large-scale deep neural network models with ever-growing high-performance GPU clusters.
parameter numbers. Training such large-scale models typically However, enhancing GPU performance and enlarging clus-
requires massive memory and computing resources that exceed
arXiv:2403.07585v1 [cs.DC] 12 Mar 2024
those of a single GPU, necessitating distributed training. As GPU ter size do not necessarily lead to linear performance improve-
performance has rapidly evolved in recent years, computation ment of distributed deep neural network training (“distributed
time has shrunk, thereby increasing the proportion of communi- training” for short) systems, as one might naturally expect.
cation in the overall training time. Therefore, optimizing commu- This is because the time overhead of distributed training comes
nication for distributed training has become an urgent issue. In not only from computation but also from communication.
this article, we briefly introduce the general architecture of dis-
tributed deep neural network training and analyze relationships When the time spent on computation is reduced, the com-
among Parallelization Strategy, Collective Communication Library, munication time is exposed more and gradually becomes a
and Network from the perspective of communication optimization, bottleneck. In fact, the communication tasks account for up to
which forms a three-layer paradigm. We then review current 60% of a DNN training iteration time in Meta’s production
representative research advances with this three-layer paradigm. environment [2]. Optimizing communication can significantly
We find that layers in the current three-layer paradigm are
relatively independent, but there is a rich design space for cross- reduce the overall training time, so there is an urgent need to
layer collaborative optimization in distributed training scenar- enhance communication efficiency for distributed training.
ios. Therefore, we further advocate a communication-efficient Systematically optimizing distributed training is not a
five-layer paradigm underlining opportunities for collaboration straightforward task. Distributed deep neural network training
designs and look forward to the perspectives of “Vertical”, architecture consists of hardware and software parts. The
“Horizontal”, “Intra-Inter” and “Host-Net” collaboration designs.
We hope this article can shed some light on future research on software part includes parallelization strategy, deep learning
communication optimization for distributed training. library, collective communication library (CCL), and network
protocols, while the hardware part comprises CPUs, GPUs,
Index Terms—Deep Neural Network, Distributed Training,
Parallelization Strategy, Collective Communication Library, Net- RAM, I/O, and network infrastructure. Attaining efficiency in
work Protocols and Topologies. distributed training systems necessitates effective collaboration
among various system components [3], [4].
Many components of distributed training architecture are
I. I NTRODUCTION
closely related to communication. For instance, parallelization
ARGE-SCALE deep neural network (DNN) models
L (“large models” for short) have become ubiquitous, and
their capabilities have significantly advanced in recent years.
strategy determines communication demand, CCL generates
communication traffic, and network affects the efficiency of
performing communication tasks. To sum up, there is a “Par-
The latest prominent large models such as GPT1 , LLaMA2 , allelization Strategy, CCL, and Network” three-layer paradigm
and GLM3 demonstrated unprecedented performance, enabling of communication optimization for distributed training.
significant changes in production and lifestyle. There are many research advances optimizing communi-
The rapid development of large models inevitably led to cation within the current three-layer paradigm. For example,
a significant increase in the scale of parameters and training PTD-P [1] uses a novel interleaved pipeline parallelism opti-
data. Due to the limited memory and computing power of mization scheme to overlap communication and computation
a single GPU (for simplicity, we use GPU to signify other as much as possible. TACCL [5] generates communication
AI hardware such as TPU and NPU), training large models primitive algorithms tailored to specific training tasks and
with a single GPU has already been a thing of the past. topology to enhance efficiency. TopoOpt [2] leverages the
For instance, even merely considering the computation time, reconfigurability of optical switches to optimize topologies
training a GPT-3 model with 175 billion parameters on a single and parallel strategies collaboratively, providing customized
Nvidia V100 GPU would require approximately 288 years [1]. topology for efficient communication.
Consequently, training large models with multiple GPUs is Different from general high-performance computing (HPC)
Y. Wei, T. Hu, C. Liang and Y. Cui are with the Department of Computer scenarios, distributed training has its own characteristics. For
Science and Technology, Tsinghua University, Beijing, China. example, communication traffic of distributed training exhibits
Yong Cui ([email protected]) is the corresponding author. periodic repetition, and the traffic pattern of each training
1 https://openai.com/gpt-4 iteration is relatively consistent [6]. Moreover, most communi-
2 https://llama.meta.com/llama2 cation tasks in distributed training are pre-determined, which
3 https://open.bigmodel.cn contrasts significantly with traditional tasks where traffic
22
2
Large
Large
LargeDeep
V
Deep Neural
VVNetwork
DeepNeural
NeuralNetwork Model
NetworkModel
Model
Paralization
Paralization Strategy
ParalizationStrategy
Strategy
Deep
DeepLearning
Deep LearningLibrary
Learning Library
Library
(a)
(a) (b)
(b) (c)
(c) (d)
(d)
Task
TaskScheduler Fig. 2. Four common
common parallelization
parallelizationstrategies:
strategies:a)a)
a)Data
Data parallelism; b) Pipeline
Pipeline parallelism;
Task Scheduler
Scheduler Fig.
Fig. 2.
2. Four common parallelization strategies: Data parallelism;
parallelism; b) b)
Pipeline parallelism;
parallelism; c)c) Tensor
c) Tensor Tensor
parallelism;
parallelism; d) MoE
parallelism; d) MoE parallelism.
MoE parallelism.
parallelism.
Task
TaskGraph
Task Graph
Graph
Collective
CollectiveCommunication
Collective CommunicationLibrary
Communication Library
Library
00 00 00 00
links
links
links flows
flows
flows
1-2
1-21-2 11 00 11 11
1-3
1-31-3
2-3
2-32-3
Flow
FlowScheduler
Flow Scheduler
Scheduler 22 00 22 22
time
time
time
Data
DataTraffic
Data Traffic
Traffic
00 11 00 00
Network
Network
Network
11 11 11 11
PCIe
PCIe
PCIe NVLink
NVLink
NVLink
Protocol
Protocol
Protocol 22 11 22 22
TCP/IP
TCP/IP
TCP/IP RoCE
RoCE
RoCE Infiniband
Infiniband
Infiniband
TopologyFat-Tree
Topology
Topology Fat-Tree
Fat-Tree Mesh
Mesh
Mesh Ring
Ring
Ring Torus
Torus
Torus
00 22 00 00
ComputingNode
Computing Node
11 22 11 11
GPU
GPU
GPU CPU
CPU
CPU RAM
RAM
RAM I/O
I/O
I/O 22 22 22 22
(a) (b)
(b) (c)
(c) (d)
(d)
Fig.
Fig. 1. 1.General
Fig. Generalarchitecture
architectureofofdistributed
distributedtraining.
training. Fig.
Fig. 3. common collective
3. Four common
common collectivecommunication
collective communicationprimitives:
communication primitives: a)
a) a)
primitives: Broadcast;
Broadcast; b) All-Gather;
b) b)
Broadcast; All-Gather;
All-Gather; c)c) All-to-All;
All-to-All;
c) All-to-All;
d) All-Reduce.
d) All-Reduce.
flowsarrive
flows arrivestochastically.
stochastically. Meanwhile,
Meanwhile, the the current
currentthree-layer
three-layer A.A.
A.Overview
Overview
Overview
paradigm has its shortcomings. Each layer of the
paradigm has its shortcomings. Each layer of the paradigm
paradigmisis The
relativelyindependent,
independent, making
making it it difficult Thegeneral
The general
general
architecture
architecture
architecture
of the
ofof thedistributed
the distributed
deepdeep
distributed neural net- net-
neural
deep neural net-
relatively difficult for
for them
themtotocooper-
cooper- work training system is shown in Fig. 1. Top of the architecture
work training
training systemsystem is is shown
shown in in Fig.
Fig. 1.
1. Top
Topofofthethearchitecture
architecture
work
ate for communication optimization. Therefore, we
ate for communication optimization. Therefore, weadvocate
advocatea a is a large deep neural network model to train. As analyzed in
communication-efficient five-layer
five-layer paradigm, is aa large
large deep deep neural
neural networknetwork model model to to train.
train. AsAs analyzed
analyzed in in
communication-efficient paradigm, which
whichisisdetailed
detailed Section I, a deep learning model needs distributed training
Section I, a deep
deeptolearning
learning model needs distributed training
Section it is I,tooa large model needs distributed training
in Section IV, aiming to achieve cross-layer collaboration
in Section IV, aiming to achieve cross-layer
collaboration when fit into the memory or too slow to train
design with (logical) schedulers as middleware. We prospect when
when it
it is
is
with a single GPU. too
too large
large to
to fit
fit into
into the
the memory
memory or
or too
too slow
slow toto train
train
design with (logical) schedulers as middleware. We prospect
optimization opportunities from four perspectives: “Vertical”, with
with aa single
single GPU.
GPU.
Parallelization strategy is an essential part of distributed
optimization opportunities from four perspectives: “Vertical”,
“Horizontal”, “Intra-Inter”, and “Host-Net” co-design. Parallelization
Parallelization
training deployment,strategy strategy
determining isis ananhowessential
essential part
a modelpart of distributed
of distributed
is partitioned
“Horizontal”, “Intra-Inter”, and “Host-Net” co-design.
The rest of this article is arranged as follows: Section II training
training deployment,
deployment, determining
determining
for distributed training. Commonly used parallelization strate- how
how a
a model
model is
is partitioned
partitioned
The resttheofgeneral
presents this article is arranged
architecture as follows:
of distributed trainingSection
and theII for are
for
gies distributed
distributed training.
detailed training. Commonly
in SectionCommonly II-B. used parallelization
used parallelization strate- strate-
presents
current the generalparadigm
three-layer architecturefromofthe
distributed
perspective training and the
of communi- gies
gies
Deepareare detailed
detailed
learning in Section
in
models Section
are often II-B.
II-B. implemented with the deep
current
cation three-layer
optimization. paradigm from
Section III the perspective
reviews representativeof communi-
research learning
Deeplibrary
Deep learning
learning (suchmodels
models as TensorFlow often4implemented
are often
are and PyTorch5with
implemented ) , which
with the deep
the deep
cation optimization.
advances Section III
on communication reviews representative
optimization for distributed research
train- generates
learningexecution
learning library
library (suchtask graphs
(such as including44computing
as TensorFlow
TensorFlow and
and PyTorch
PyTorch 55 and
tasks )) ,, which
which
advances on communication
ing. Section optimization for distributed
IV advocates a communication-efficient train-
five-layer corresponding
generates
generates execution collective
execution taskcommunication
task graphs
graphs including
includingtasks. Communica-
computing
computing tasks
tasksand and
ing. Section looking
paradigm, IV advocatesinto the cross-layer collaborativefive-layer
a communication-efficient design tion compression collective
corresponding
corresponding techniques communication
collective such as gradient tasks.
communication quantization
tasks. Communica- are
Communica-
paradigm, looking
opportunities into promising
from four the cross-layer
research collaborative design
directions. Finally often
tion more
tion compressionrelevant techniques
compression to training such
techniques methods
such as and are quantization
as gradient
gradient not detailed are
quantization are
opportunities from four promising research directions. Finally
is the conclusion. inoften
this article
more due to
relevant the to limited
training space.
methods
often more relevant to training methods and are not detailed and are not detailed
is the conclusion. inThe
in thisdeep
this articlelearning
article due
due to library
to the often invokes
the limited
limited space.Collective Commu-
space.
nication Library to implement collective communication tasks
The
The deepdeep learning
learning librarylibrary often often invokes
invokes Collective
Collective Commu-Commu-
II. A RCHITECTURE to transmit activations (in forward propagation) or synchronize
nication
nication Library Library to to implement
implement collective collective communication
communication tasks tasks
gradients (in backward propagation) between different GPUs.
II. A RCHITECTURE to
to transmit
transmit activations
activations (in
(in forward
forward propagation)
propagation) or
or synchronize
synchronize
Commonly used collective communication primitives imple-
This section first provides an overview of the distributed gradients
gradients (in
(in backward
backward propagation)
propagation) between
between different
different GPUs. GPUs.
mented in CCL are detailed in Section II-C.
deep neural network training architecture, followed by an Commonly
Commonly used collective
usedcommunication communication
collective communication primitives
primitives imple-
imple-
The collective primitives generate actual
in-depth introduction
This section of three an
first provides overview pertinent
components of the distributed
to com- mented
mented in
in CCL
CCL are
are detailed
detailed in
in Section
Section II-C.
II-C.
communication data traffic, which is injected into the under-
munication
deep optimization:
neural network parallelization
training strategy,
architecture, collective
followed by an
lyingThe
The collective
collective
network. communication
A varietycommunication
of network protocolsprimitives
primitives andgenerate
generate
topologies actual
actual
communication
in-depth library,
introduction ofand network.
three Finally,pertinent
components the relationships
to com- communication
communication data data traffic,
traffic, whichwhich isis injected
injected intointo thethe under-
under-
between these
munication three components
optimization: and communication
parallelization strategy, optimiza-
collective 4 https://www.tensorflow.org
lying network.
lying network. A A variety
variety of of network
network protocols
protocols and and topologies
topologies
tion are elaborated.
communication library, and network. Finally, the relationships
5https://pytorch.org
for distributed training are detailed in Section II-D. As the at the corresponding position of each GPU node. Common
end-host of the underlying network, a computing node is a All-Reduce scenarios include data parallelism and some model
supercomputer equipped with high-performance GPUs and parallelism such as Megatron-lm [7].
CPUs, a large RAM, and high-speed I/O. These hardware
components also need to collaborate efficiently. D. Underlying Network
Parallelization strategy, CCL, and Network are three critical
components in the architecture that affect communication Protocols and topologies are two main factors that af-
efficiency. The interplay among them and their impacts on fect network performance. Besides general TCP/IP protocol,
communication efficiency are analyzed in Section II-E. distributed training often uses RDMA (such as RoCE or
Infiniband) for less overhead and higher bandwidth.
The traffic flow within the network is also closely related
B. Common Parallelization Strategies to the network topology. Common network topologies used
Commonly used parallelization strategies include data par- for distributed training include Fat-tree and its variants, Torus,
allelism, model parallelism (primarily including pipeline par- as well as Ring and Full-Mesh topologies. These topologies
allelism and tensor parallelism), and emerging Mixture-of- can be combined according to practical needs. For example,
Expert (MoE) parallelism, as illustrated by Fig. 2, where each the NVLink topology of Nvidia’s DGX-16 is a combination
rectangle refers to a GPU node and each round rectangle refers of Ring and Full-mesh.
to (part of) a model.
Data parallelism is one of the most commonly used paral- E. Communication Paradigm in the Architecture
lelization strategies whose basic idea is to distribute multiple
copies of a model to different GPUs. Each GPU receives Parallelization strategy, CCL, and network form a three-
a subset of training data (mini-batch) during training, and layer paradigm of communication optimization for distributed
gradients are averaged to update global parameters at the end training, as shown by the colored rectangles in Fig. 1. All of
of each iteration. Model parallelism, including pipeline and them directly or indirectly affect the communication perfor-
tensor parallelism, means splitting the model onto different mance of the training process. Different parallelization strate-
GPUs. Pipeline parallelism allocates different layers of the gies determine the primarily used collective communication
model to different GPUs, so there is mainly point-to-point primitives in task graphs, which affects the traffic pattern.
communication between layers (GPUs). Tensor parallelism Various implementations of CCL directly affect the actual
splits the same layer of the model onto different GPUs and network traffic. The same traffic demand often exhibits differ-
uses distributed matrix computing techniques for collabora- ent performance under different network infrastructures. The
tion [7], which is a communication-intensive operation. MoE topologies designed for distributed training are closely related
parallelism involves dividing a portion of a model into mul- to the algorithm implementation of collective communication
tiple expert components. Each expert specializes in a specific primitives. For example, the execution process of the Ring-
task domain and is allocated to a certain GPU. based All-Reduce algorithm results in a ring communication
In practice, the above parallelization strategies are often not mode. That is, each node transmits data to its logical neighbors
used independently, but in a hybrid manner. For example, we while all the nodes in the communicator (a communicator is
can first divide the model by layers and split each layer, then a set of nodes used to implement a collective communication
distribute the divided model to multiple sets of GPUs, thereby task) form a ring. Nvidia’s DGX-16 and Google’s Torus [4]
achieving pipeline-tensor-data three-dimensional hybrid par- topology both contain numerous ring structures, which are
allelism [1]. MoE parallelism is another example of hybrid well-suited to satisfy the communication needs of Ring-based
parallelism. The idea of data parallelism is implicit in MoE All-Reduce.
parallelism, and the expert model itself can also be split into However, layers are relatively independent in the current
multiple GPUs for parallel computing. three-layer communication paradigm. Although some methods
introduced in Section III involve cross-layer design, collabora-
tive design has not yet become mainstream in general, which
C. Common Collective Communication Primitives
limits the end-to-end training performance [6]. Therefore,
Commonly used collective communication primitives im- we advocate a communication-efficient five-layer paradigm
plemented in CCL for distributed training are illustrated in with additional (logical) schedulers (dashed boxes in Fig. 1)
Fig. 3, where each rectangle represents a GPU node and each over the current architecture. The inter-layer schedulers imply
rounded rectangle represents a data chunk. collaborative design, which is elaborated in Section IV.
Broadcast distribute data from a particular node to all other
nodes, which can be used in data parallelism and some model III. OVERVIEW OF R ECENT A DVANCES
parallelisms [8]. All-Gather is a many-to-many collective
Communication optimization in distributed training encom-
communication primitive where data from different nodes are
passes numerous aspects. We delve into three of the most
distributed to all nodes. All-to-All transmit data among various
pertinent elements: parallelization strategy, CCL, and net-
nodes, such as data distribution in MoE parallelism [9], [10].
work, which make up the current three-layer communication
All-Reduce is a sum operation of the corresponding data chunk
of each node, as shown in Fig. 3(d), where each grey rounded 6 https://images.nvidia.com/content/pdf/dgx1-system-architecture-
rectangle represents the aggregated results of the data chunks whitepaper1.pdf
4
TABLE I
A DVANCES ON C OMMUNICATION O PTIMIZATION IN D ISTRIBUTED D EEP N EURAL N ETWORK T RAINING
paradigm. We use exposed communication to denote the com- To train a large-scale model that can’t fit into the memory of
munication time that cannot overlap with computation time. a single GPU, researchers introduced model parallelism, which
In this section, our focus is confined to the contributions of includes pipeline and tensor parallelism. Tensor parallelism
a specific study pertaining to communication optimization, divides models within a layer and requires intensive com-
that is, how to reduce exposed communication in distributed munication in training. Researchers from Nvidia proposed an
training. Table I provides an overview of the research advances efficient “intra-layer parallelism” approach called Megatron-
with some primary concerns. lm [7] to reduce the communication traffic in tensor paral-
lelism. Pipeline parallelism necessitates only the transfer of
A. Parallelization Strategy activations/gradients between adjacent nodes in the logical
pipeline, which is a point-to-point communication pattern that
The communication overhead in distributed training mainly
significantly reduces the communication cost compared to
arises from parallelization. Different implementations of par-
data parallelism and tensor parallelism. Because overlapping
allelization strategies affect the communication pattern and
communication and computation is hard to achieve in naive
network traffic during the model training process [2].
pipeline parallelism, an interleaved pipeline scheduling strat-
The most straightforward and most widely used paralleliza-
egy [1] is proposed where each device is assigned multiple
tion strategy is data parallelism. As mentioned in Section
data chunks to gain the chance of communication-computation
II, multiple copies of the same model are replicated across
overlap. However, pipeline parallelism, as an “inter-layer par-
different nodes in data parallelism. Each iteration requires all
allelism” approach, is not without its flaws. It faces issues
nodes to synchronize gradients, which are used for global
such as “synchronous/asynchronous” parameter updates and
parameter updates. Consequently, data parallelism results in
pipeline bubble problems, which can adversely affect the
a significant amount of All-Reduce communication.
5
accuracy of the model and the efficiency of training [7]. Communication Profiled Target
Various parallelization strategies are often used in conjunc- Sketch Topology Collective
tion. For example, PTD-P [1] not only combines Megatron-
lm’s Tensor parallelism and pipeline parallelism but also
introduces data parallelism to scale to thousands of GPUs.
Routing
AlpaComm [8] identifies the “cross-mesh resharding” problem
that arises when combining tensor and pipeline parallelism
Heuristic
and proposes a communication optimization method based on Synthesizer Ordering
broadcasting and an “overlapping-friendly” pipeline schedul-
ing scheme to accelerate end-to-end training. Contiguity and
Exact scheduling
As an emerging model structure, the MoE model is grad-
ually gaining popularity due to its fast training/inference
speed. The training process of the MoE model selects several Backend
Algorithm
Implementation
appropriate experts for a data sample, which inherently has
good compatibility with parallelization. MoE parallelism can
be achieved by simply dispersing the “experts” on different Fig. 4. TACCL’s novel synthesizer takes as input a communication sketch,
profiled topology, and target collective along with synthesizer hyperparameters
GPUs, introducing All-to-All traffic when data is distributed to generate an algorithm for the collective. The synthesized algorithm is
between different GPUs. Additionally, the model parameters implemented in the hardware cluster using
. TACCL’s backend [5]
of the non-expert part are globally shared in MoE parallelism,
so there will also be All-Reduce traffic. Lina [9] uses a
SCCL [12] takes a more unified approach to achieve ef-
scheduling strategy that prioritizes All-to-All traffic and further
ficiency: automatically synthesizing high-performance com-
splits the All-Reduce tasks to overlap with computation as
munication primitives for a given topology. To achieve this,
much as possible, thus reducing exposed communication and
SCCL designs a cost model to evaluate the latency and
achieving training acceleration. Janus [10] studies MoE paral-
bandwidth cost of algorithms and then searches for algorithms
lelism acceleration from a new perspective. Since the size of
on the Pareto frontier. However, SCCL also faces problems
expert models and training datasets varies, Janus innovatively
of high complexity. The communication primitive algorithm
proposes a “data-centric” model that “moves experts instead of
generation problem is encoded into a Mixed Integer Linear
data”. This significantly reduces the amount of communication
Program (MILP), which is unfortunately NP-hard.
during the training process under certain conditions (e.g., when
To reduce the complexity of generating collective commu-
the scale of experts is smaller than the scale of data), thus
nication algorithms, TACCL [5] is a representative work that
accelerating the training process.
introduces a “human-in-the-loop” approach, whose paradigm
is shown in Fig. 4. TACCL incorporates high-level inputs
B. Collective Communication Library from an algorithm designer, such as logical topologies, switch
hyper-edges, and algorithm symmetry, to efficiently synthe-
Communication tasks of distributed training are often im- size collective communication algorithms for heterogeneous
plemented by CCL. The current predominant NVIDIA Col- topologies. These human inputs greatly constrain the search
lective Communication Library (NCCL)7 dynamically selects space of algorithms, reducing search time to an acceptable
established primitive algorithms based on different situations. amount, in most cases from seconds to a few minutes. Com-
For instance, NCCL selects the Ring and Double Binary pared with NCCL, TACCL achieves up to 2.36x end-to-end
Tree algorithms for All-Reduce 8 for different workloads and training speedup for BERT and 1.94x for Transformer-XL.
topologies to accelerate the training process. However, NCCL’s Different from previous methods focused on speeding up
optimization for specific hardware underscores the need for a single primitives, SYNDICATE [13] proposes a novel abstrac-
more universal and adaptable CCL to speed up distributed tion to overlap primitives in a fine-grained pattern, breaking
training across different hardware and topologies. Therefore, large communication operations into smaller units, thus al-
many researchers took different approaches to implement lowing higher flexibility. SYNDICATE then develops a joint
communication primitives more efficiently. optimizer based on Markov Chain Monte Carlo (MCMC)
For example, Blink [11] dynamically generates optimal search to optimize the joint action of the control plane and
communication primitives by packing spanning trees rather the data plane.
than writing and optimizing them manually. Blink uses integer
linear programming (ILP) to reduce the number of trees C. Network
and improve performance, and also leverages heterogeneous
Distributed training focuses on GPU-to-GPU communi-
communication channels to utilize full bandwidth. At the time,
cation, which needs to consider intra/inter-host communi-
Blink achieved up to 40% speedup compared to NCCL in end-
cation protocols. Common intra-host communication pro-
to-end DNN training iterations.
tocols include the general PCIe protocol and Nvidia’s
7 https://developer.nvidia.com/nccl NVLink/NVSwitch. Inter-host communication protocols in-
8 https://developer.nvidia.com/blog/massively-scale-deep-learning-training- clude traditional TCP/IP and more efficient RDMA (e.g.,
nccl-2-4 RoCE and Infiniband).
6
Based on certain communication protocols, the underlying objectives for communication tasks are no longer the general-
network topology also significantly influences the communi- purpose flow complete time (FCT) but rather the job complete
cation efficiency of distributed training. For instance, as men- time (JCT). In this context, communication tasks are merely
tioned in Section II, Nvidia utilizes a Ring and fully connected a component of the entire distributed training process. Opti-
Mesh topology in their DXG series, and Google employs mizing JCT necessitates considering the dependency between
a 3D-Torus topology in their TPUv4 [4]. These topologies communication and computation instead of just optimizing
contain a large number of ring structures, which enhance each communication task. Communication tasks should be
the efficiency of Ring-based All-Reduce and Point-to-Point optimized to minimize computation blocking time, which can
communications (such as pipeline parallelism or broadcast- only be achieved through “cross-layer” scheduling to meet this
based hybrid parallelism [8]). However, static topologies can new optimization objective.
rarely adapt to dynamically changing job requirements, leading To this end, we advocate a communication-efficient five-
researchers to propose TopoOpt [2]. TopoOpt takes advantage layer paradigm with (logical) schedulers as the middleware,
of the reconfigurability of optical switches, enabling a scheme as illustrated in Fig. 5. The schedulers not only allow lower-
that co-optimizes the topology and parallelization strategy. layer facilities to perceive the requirements of higher-layer
Nevertheless, current optical switching technology still has applications but also enable higher-layer applications to be
issues of high reconfiguration latency, making it challenging to aware of the capabilities of lower-layer facilities. We refer to
perform dynamic topology adjustments between iterations. As this as “Vertical” co-design. The schedulers can also perform
such, TopoOpt remains a relatively coarse-grained optimiza- global scheduling across multiple jobs, achieving “Horizon-
tion, that is, reconfiguring the interconnection between training tal” collaborative optimization. Considering the heterogeneous
servers of each job only before the job starts and keeps the topologies of distributed training and the development of
topology until the training is complete. emerging technologies such as in-network aggregation, we also
Anyway, research on communication optimization for dis- provide a prospect for “Intra-Inter” and “Host-Net” co-design.
tributed training from a network perspective is still insufficient, “Vertical” co-design: Refers to exchanging necessary infor-
which is prospected in Section IV. mation between layers to optimize communication efficiency.
For example, CCL can acquire global traffic demand from
D. Collaboration Design in Current Advances the parallelization strategy and generate the optimal commu-
nication primitive algorithms in conjunction with the network
As observed in Table I, the research advances mentioned state. Furthermore, cross-layer schedulers can be constructed
above are primarily limited to optimizations within a layer, to strategically schedule communication tasks based on traffic
with less cross-layer co-design. Current advances involving load or optimize task graphs according to the traffic patterns
cross-layer optimization can be categorized into two types. of relevant communication primitives. Echelon [14] is a flow
The first type is passive collaborative design. For instance, scheduling work that embodies “Vertical” co-design, where
AlpaComm [8] and Lina [9] optimize communication for a novel communication optimization objectives are designed
specific parallelization strategy, necessitating modifications to based on the dependency between communication and com-
communication primitives. Another example is that generative putation under various parallelization strategies, aiming to
CCL (such as Blink [11], SCCL [12] and TACCL [5]) has optimize communication towards minimizing JCT.
to be aware of network topologies because their main signifi- Another example of “Vertical” co-design is that network can
cances are to customize collective communication algorithms also use traffic demand for optimization in distributed training.
for specific topologies. The second type of work involves As the general infrastructure, the network is often unaware
proactive attempts at collaborative optimization yet remains of traffic demand. This is natural in common data center
exploratory. For example, SYNDICATE [13] jointly optimizes scenarios, where the network can only implement passive
the implementation and scheduling of communication primi- adjustments such as RTT-based congestion control due to the
tives, allowing parallel execution of primitives. TopoOpt [2] random nature of flow arrivals. However, traffic patterns are
collaboratively optimizes network topology and parallelization often predictable in distributed training scenarios [6]. Conse-
strategies, obtaining optimized combination strategies through quently, the network itself can also proactively adjust by taking
alternating optimization techniques. traffic demands of high-layer training tasks into account, thus
In summary, collaborative communication optimization in enhancing goodput. This in turn reduces communication time
distributed training is still a domain that needs further ex- and accelerates the distributed training process.
ploration. Section IV details the importance of co-design and “Horizontal” co-design: Primarily focuses on the coordina-
potential research opportunities. tion among different training jobs or various communication
tasks. Within a single training job, the traffic from multiple
IV. O PPORTUNITIES concurrent collective communication primitives competes for
As noted in Section III, collaborative design has not received network resources. Similarly, in the multi-job scenario, the
widespread attention in current research advances. However, traffic from numerous concurrent training jobs also competes
different from general tasks in data centers, the communi- for network resources. Consequently, optimization strategies
cation traffic patterns in distributed training are known in tailored to a single communication primitive or training job
most cases [6], [14], providing motivations and opportunities may suffer varying degrees of inefficacy. Therefore, it is imper-
for cross-layer collaboration. Furthermore, the optimization ative to employ horizontal co-design at different granularities
7
Task Scheduler
Host-Net
CCL C … C Co-Design
Switch Switch … Switch
Flow Scheduler
… GPU GPU …
Network N N Host Host Host
GPU GPU
Horizontal Co-Design
to fully utilize network resources, ultimately reducing JCT approach also reduces the network traffic for a collective
of training tasks. CASSINI [6] attempts horizontal co-design communication task. ATP embodies the concept of “Host-Net”
through the perspective of multi-task scheduling. It analyses co-design and represents an exemplary work of in-network
the consequence of non-coordinated job scheduling in multi- computing in the field of distributed training. Nevertheless,
job scenarios, highlighting the potential traffic conflicts arising in-network aggregation has not been widely applied in the
from the periodic patterns of network resource utilization by production environment due to its complexity. “Host-Net” co-
different training jobs. Subsequently, CASSINI introduces a design remains a direction worthy of long-term research.
design for “staggering peak” scheduling for different jobs,
achieving high utilization of network resources, thus enhanc- V. C ONCLUSION
ing the performance of multiple training jobs. Due to the dramatic increase in the number of parameters
of large-scale deep neural network models, it is imperative
“Intra-Inter” co-design: Refers to the coordinated utilization
to develop effective communication optimization methods in
of multidimensional network resources, primarily including
distributed training. In this article, we first provide a general
the “intra-host” and “inter-host” network. The current net-
distributed training architecture and the current three-layer
works used in distributed training exhibit substantial hetero-
paradigm of communication optimization consisting of par-
geneity. The interconnection patterns between different GPUs
allelization strategy, CCL, and network. We then review the
within the same host include PCIe or NVLink, while those
latest advances in the current three-layer paradigm. Further-
between different hosts include RoCE or Infiniband, etc. Given
more, we analyze the shortcomings of the current paradigm
that GPU (rather than host) is the minimum unit in the com-
and advocate a communication-efficient five-layer paradigm,
munication tasks of distributed training, two modes of GPU-
highlighting great opportunities for “cross-layer” co-design.
to-GPU communication exist: within the same host and across
We hope to shed some light on future research on communi-
different hosts. In this context, the issue of heterogeneous link
cation optimization for distributed training.
bandwidths within and between hosts is inevitably confronted.
An example of “Intra-Inter” co-design is that TACCL [5] R EFERENCES
utilizes a profiling approach to perceive the characteristics of [1] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary,
heterogeneous links, which serves as a basis for optimizing V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro
communication primitive algorithms. et al., “Efficient large-scale language model training on gpu clusters
using megatron-lm,” in Proceedings of the International Conference for
“Host-Net” co-design: Primarily refers to the collaboration of High Performance Computing, Networking, Storage and Analysis, 2021,
computing resources between the end-host and programmable pp. 1–15.
[2] W. Wang, M. Khazraee, Z. Zhong, M. Ghobadi, Z. Jia, D. Mudigere,
switches within a network. In recent years, with the de- Y. Zhang, and A. Kewitsch, “Topoopt: Co-optimizing network topol-
velopment of programmable switch technology, in-network ogy and parallelization strategy for distributed training jobs,” in 20th
aggregation has gradually become a new computing paradigm. USENIX Symposium on Networked Systems Design and Implementation
(NSDI 23), 2023, pp. 739–767.
In-network aggregation not only fully utilizes the computing [3] Y. Jiang, Y. Zhu, C. Lan, B. Yi, Y. Cui, and C. Guo, “A unified
capabilities of programmable switches within the network but architecture for accelerating distributed dnn training in heterogeneous
also reduces the traffic running in the network, achieving mul- gpu/cpu clusters,” in 14th USENIX Symposium on Operating Systems
Design and Implementation (OSDI 20), 2020, pp. 463–479.
tiple purposes in one stroke. However, since existing protocols [4] N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil,
are designed for end-to-end communication, implementing S. Subramanian, A. Swing, B. Towles et al., “Tpu v4: An optically
in-network aggregation usually requires the support of new reconfigurable supercomputer for machine learning with hardware sup-
port for embeddings,” in Proceedings of the 50th Annual International
protocols. For instance, ATP [15] is an in-network aggrega- Symposium on Computer Architecture, 2023, pp. 1–14.
tion transmission protocol designed for multi-tenant scenarios, [5] A. Shah, V. Chidambaram, M. Cowan, S. Maleki, M. Musuvathi,
which can make full use of the computing capabilities of multi- T. Mytkowicz, J. Nelson, O. Saarikivi, and R. Singh, “Taccl: Guiding
collective algorithm synthesis using communication sketches,” in 20th
level programmable switches and degrade to host aggregation USENIX Symposium on Networked Systems Design and Implementation
when the in-network computing resources are exhausted. This (NSDI 23), 2023, pp. 593–612.
8