0% found this document useful (0 votes)
58 views10 pages

Paper Colossal-AI - A Unified Deep Learning System For Large-Scale Parallel Training

Uploaded by

jferroal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views10 pages

Paper Colossal-AI - A Unified Deep Learning System For Large-Scale Parallel Training

Uploaded by

jferroal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Colossal-AI: A Unified Deep Learning System For Large-Scale

Parallel Training
Shenggui Li Hongxin Liu Zhengda Bian
[email protected] [email protected] [email protected]
HPC-AI Technology Inc. HPC-AI Technology Inc. HPC-AI Technology Inc.
Singapore China China

Jiarui Fang Haichen Huang Yuliang Liu


[email protected] [email protected] [email protected]
HPC-AI Technology Inc. HPC-AI Technology Inc. HPC-AI Technology Inc.
arXiv:2110.14883v3 [cs.LG] 5 Oct 2023

China China China


Boxiang Wang Yang You
[email protected] [email protected]
HPC-AI Technology Inc. National University of Singapore
Singapore Singapore

ABSTRACT 1 INTRODUCTION
The success of Transformer models has pushed the deep learning Deep learning has been successful in many applications and brought
model scale to billions of parameters, but the memory limitation of breakthroughs in difficult problems. With large amounts of data,
a single GPU has led to an urgent need for training on multi-GPU neural networks like BERT [8] and Vision Transformer [9] are capa-
clusters. However, the best practice for choosing the optimal parallel ble of learning high-dimensional features and making predictions
strategy is still lacking, as it requires domain expertise in both deep on a level even humans cannot match. As powerful hardware be-
learning and parallel computing. The Colossal-AI system addressed comes available, neural networks have more diverse architectures
the above challenge by introducing a unified interface to scale your and a larger number of parameters. The AI community has seen a
sequential code of model training to distributed environments. It trend of deep learning models becoming larger, with an array of
supports parallel training methods such as data, pipeline, tensor, large-scale models ranging from BERT-Large, GPT-2 [28] (1.5 billion
and sequence parallelism and is integrated with heterogeneous parameters), GPT-3 [5] (175 billion parameters), to GLM [10] (1.75
training and zero redundancy optimizer. Compared to the baseline trillion parameters). These large-scale models require more data
system, Colossal-AI can achieve up to 2.76 times training speedup and computing resources but also have better generality and per-
on large-scale models. formance. As more robust computing hardware and larger datasets
become available, the trend is expected to continue and traditional
CCS CONCEPTS training methods will become less effective, making distributed
• Computing methodologies → Machine learning algorithms; training necessary for large-scale model training.
Parallel computing methodologies. The limited fast memory of commonly used accelerator hard-
ware, such as GPU, is a bottleneck of scaling the model to billions of
KEYWORDS parameters. The memory consumption in deep learning comes from
model parameters, layer activations, gradients, and optimizer states.
datasets, neural networks, gaze detection, text tagging
We refer to model parameters, gradients, and optimizer states as
ACM Reference Format: model data and layer activations as non-model data. When training
Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, with adaptive optimizers [11, 18], the total memory consumption
Yuliang Liu, Boxiang Wang, and Yang You. 2023. Colossal-AI: A Unified Deep
of model data can be several times larger than that consumed by
Learning System For Large-Scale Parallel Training. In 52nd International
parameters alone, making a single GPU no longer sufficient for
Conference on Parallel Processing (ICPP 2023), August 7–10, 2023, Salt Lake
City, UT, USA. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/ large-scale model training. 10 billion parameters in FP16 format can
3605573.3605613 already consume 20 GB of model memory, while a typical GPU only
has 16 or 32 GB of memory. Without any optimization, training a
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
model of 10 billion parameters with one data sample can cost more
for profit or commercial advantage and that copies bear this notice and the full citation than 80 GB of memory, which is far more than that of a typical
on the first page. Copyrights for components of this work owned by others than ACM GPU.
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a Data parallelism [31] has scaled models such as ResNet [14] to
fee. Request permissions from [email protected]. multi-GPU training and other methods such as activation check-
ICPP 2023, August 7–10, 2023, Salt Lake City, UT, USA pointing [7] were proposed to reduce the non-model data by trading
© 2023 Association for Computing Machinery.
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00
computation for memory. However, these methods failed to cope
https://doi.org/10.1145/3605573.3605613 with billion-parameter model data. Parallelization techniques such
ICPP 2023, August 7–10, 2023, Salt Lake City, UT, USA Li, et al.

as pipeline parallelism [15, 24] and tensor parallelism [32] were 2 BACKGROUND
explored to shard the model data, making it possible to train mod- Thanks to the advent of the Transformer architecture, deep learning
els at a larger scale. The current state-of-the-art systems which models have gained unprecedented performance improvement in
provide a solution to the scaling challenge include GShard [19], domains such as Computer Vision and Natural Language Process-
FairScale [2], Megatron-LM [26] and DeepSpeed [29]. Among these ing. The typical architecture of a Transformer layer consists of a
systems, Megatron-Lm and DeepSpeed are the most popular in the Multi-head Attention block and a Feed Forward block as shown in
open-source community and deliver the best performance. Thus, Figure 2. This architecture can scale to billions of parameters and
they are chosen as the baseline of our experiments. Megatron-LM larger models can deliver more impressive performance improve-
trains Transformer-based models by utilizing optimized pipeline ment. For example, GPT-3 [5] outperforms the smaller models by
and tensor parallelism. Meanwhile, DeepSpeed proposed an effi- 18% absolute increase in prediction accuracy on the LAMBADA
cient method to partition the model-related data to fully eliminate language task [27].
memory redundancy in data parallel training. These two efficient
methods paved the way to scale model training to hundreds of
devices and billions of parameters. Add and Normalize Add and Normalize
As most deep learning engineers and researchers are used to
writing non-distributed code, it is reasonably difficult for them Multi-head Attention Feed Forward

to adapt to parallel and distributed programming. The existing Projection (Linear)


Projection (Linear)
systems either introduce extra complexity in parallelizing the model
Scaled Dot Product Attention
training or offer insufficient parallelization techniques. We have
thus developed Colossal-AI, which is an open-source system to Query Key Value Projection (Linear)

democratize complicated distributed training in the AI community QKV_Projection (Linear)

by unifying an array of training acceleration techniques in one deep


learning system. In this system, we also included novel parallelism Embedding Data
methods such as multi-dimension tensor parallelism and sequence
parallelism. Colossal-AI aims to make distributed training easy by
providing user-friendly APIs while allowing users to maintain their Figure 2: Architecture of The Transformer Layer
coding habit of writing single-node programs. In a nutshell, we
bring the following major contributions to large-scale distributed To cope with the increasing model size, AI engineers have ex-
training in this work: plored distributed training in pursuit of lower time costs. Various
techniques were proposed to accelerate distributed training and
• Colossal-AI is a unified deep learning system that provides
they will be discussed below.
the fullest set of acceleration techniques for the AI commu-
nity. With its modular design as shown in Figure 1, Colossal- 2.1 Data Parallelism
AI allows for free combination of these techniques to achieve
the best training speedup. The details of the system archi- Data parallelism is the most common parallelism technique due to
tecture will be discussed in the Implementation section. its simplicity. In data parallel training, the model is replicated across
• Optimized parallelism and heterogeneous training methods the devices, and the dataset is split into several shards. Each dataset
are provided in Colossal-AI. These methods achieve better shard is fed to the model on one device as shown in Figure 3a.
system performance than the baseline systems. They are Collective communication is required to synchronize the parameter
provided for the user via friendly APIs with minimum code gradients after backward propagation [22]. Data parallelism makes
changes. it easy to train a model on multiple devices and scales sub-linearly
• In-depth analysis was conducted to investigate the suitable with the number of devices.
parallelism strategies under different hardware conditions. One problem of data parallelism is that each device holds a copy
of the model parameters, optimizer states, and gradients, leading
to memory redundancy. When using stateful optimizers such as
Execution Engine
Adam [17], the optimizer states (i.e. momentum and variance) can
Schedule Hooks occupy three times larger memory space compared to that occupied
by the model parameters [12, 17]. To eliminate such redundancy,
Acceleration
Zero Redundancy Optimizer was proposed in DeepSpeed [29] to
Mixed Precision Offload
partition these redundant model data over different devices during
Model Sharding Optimizer Sharding data parallel training. As each device only holds a partition of
gradients, optimizer states, and parameters, it will only update the
Distributed Opeartors
partitioned parameters instead of the full model parameters on one
1D 2D 2.5D 3D Sequence device.
Parallel Context Manager
2.2 Model Parallelism
To go beyond data parallel training, more techniques were explored
Figure 1: Architecture of Colossal-AI to shard the model parameters over a larger number of devices. As a
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training ICPP 2023, August 7–10, 2023, Salt Lake City, UT, USA

GPU 0 GPU 1 GPU 0 GPU 1 GPU 0 GPU 1

data data data data data

(a) data parallel (b) tensor parallel (c) pipeline parallel

Figure 3: Existing parallelism for distributed training

result, model parallelism was proposed to tackle this problem. There the low communication bandwidth can hinder the efficiency of
are generally two types of model parallelism: tensor parallelism all-reduce operations in 1D tensor parallelism.
and pipeline parallelism. In addition, the 1D tensor parallelism has redundant memory
1) Tensor Parallelism usage in layer inputs and outputs. Taking the Feed Forward layer
Tensor parallelism shards the tensor over an array of devices and in the Transformer architecture as an example, the input 𝑋 and
requires a distributed matrix-matrix multiplication algorithm for output 𝑌 of the MLP layer are duplicated across different devices as
arithmetic computation as shown in Figure 3b. Megatron-LM [32] shown in Figure 4. Such memory redundancy limits the maximum
proposed 1D tensor parallelism which splits the linear layer in the model size which can be trained on limited hardware resources, and
row or column dimensions for the Transformer architecture [35]. is not helpful with the democratization of large-scale distributed
More advanced tensor sharding mechanisms such as 2D [39], 2.5D [36], training.
and 3D [4] were proposed to shard tensors in more dimensions. Besides 1D tensor parallelism, more advanced tensor parallelism
Collective communication is required among devices to ensure is introduced for large-scale model training, namely 2D, 2.5D, and
arithmetic correctness. 3D tensor parallelism [4, 36, 39]. These methods split input, weight,
In Meagtron-LM, the tensors are sharded in one dimension. and output tensors and thus have advantages in memory and com-
Taking the Feed Forward module of the Transformer layer as an munication efficiency, better coping with different hardware speci-
example, we can view the module as a matrix multiplication of fications. This provides the user with an option of using the most
𝑌 = 𝑊2𝑊1𝑋 as shown in Figure 4, where 𝑋 is the input, 𝑊1 and suitable tensor parallelism technique for their machines.
𝑊2 are the model parameters. 𝑊1 and 𝑊2 can be sharded vertically 2D tensor parallelism [39] relies on the SUMMA and Cannon ma-
and horizontally respectively and produce a partial result of 𝑌 on trix multiplication algorithm [3, 6, 34] and splits a tensor along two
each device. An all-reduce operation can be applied to the partial different dimensions. Given 𝑁 devices arranged in a square network
result to obtain the correct final result of the matrix multiplication. topology, a tensor of√shape √ [𝑃, 𝑄] will be partitioned into a chunk
In this way, each device will only hold 1/𝑁 of the parameters when tensor of shape [𝑃/ 𝑁 , 𝑄/ 𝑁 ]. 2.5D Tesnor Paralleism [36] was
training on 𝑁 devices. This allows the model size to scale beyond inspired by 2.5D matrix multiplication algorithm [33] and proposed
the memory capacity of a single device. to further parallelize 2D tensor parallelism. It adds the optional
𝑑𝑒𝑝𝑡ℎ dimension of the matrix for parallelization. When 𝑑𝑒𝑝𝑡ℎ = 1,
W2
it is close to 2D tensor parallelism. When 𝑑𝑒𝑝𝑡ℎ > 1, it partitions
W1 Partial
GPU 0 X
Y the matrix 3 times and adds one more degree of parallelization.
Given 𝑁 devices, the tensor is split in a way such as 𝑁 = 𝑆 2 ∗ 𝐷
All-Reduce where 𝑆 is the size of one side of the square and 𝐷 is the depth of
the cuboid. 3D tensor parallelism [4] was proposed based on the
Partial
GPU 1 X W1 3D matrix multiplication algorithm [1]. 3D tensor parallelism splits
W2 Y
a tensor in a cubic manner. As not every tensor has 3 dimensions,
we choose to partition the first and last dimension only where the
Figure 4: Megatron-LM MLP Module first dimension will be partitioned twice. For example, a tensor
of shape [𝑃, 𝑄] will be partitioned into a chunk tensor of shape
√ 2 √
One of the major problems of the 1D method is that it assumes [𝑃/ 3 𝑁 , 𝑄/ 3 𝑁 ].
the interconnect of devices has the same bandwidth. This makes it As the advanced tensor parallelism methods require different
friendly only on machines with fully connected NVLinks among network topologies, the user needs to choose the method based on
the GPUs on a single node as shown in Figure 9a. However, such the number of GPUs. 1D method can work with any number of
high-end hardware is expensive and scarce. Many machines, even GPUs while 2D, 2.5D and 3D methods require the 𝑛 2 , 𝑎 ∗ 𝑛 2 , and 𝑛 3
some in the supercomputing centers, only have partially connected GPUs respectively, where 𝑎 and 𝑛 are positive integers. The user
GPUs as shown in Figure 9b. With this kind of GPU topology, the can fall back to 1D tensor parallelism when the number of GPUs
communication bandwidth between distant devices via the PCIe does not fulfill the requirement. These advanced tensor parallelism
bus is much lower than that of directly connected GPUs. Therefore, methods provide lower communication volume when scaling to
ICPP 2023, August 7–10, 2023, Salt Lake City, UT, USA Li, et al.

a larger number of devices [1, 3, 6, 33] and this will be further Alpa is not made to be hardware-aware and does not automatically
discussed in Section 3.1. consider the network topology. Meanwhile, it does not search for
2) Pipeline Parallelism other optimization techniques such as activation checkpointing,
Methods such as PipeDream [25], GPipe [16], and Chimera [20] leading to suboptimal results.
were proposed to split the model into several chunks of consecutive
layers and each chunk is allocated to a device as shown in Figure 3c. 3 DESIGN
Intermediate activations and gradients are passed between pipeline Colossal-AI is featured by an array of acceleration techniques con-
stages to complete the forward and backward pass. As a result, this structed in a modular way, which can cover a wide range of training
method reduces cross-node communication. Pipeline parallelism settings to achieve maximal performance. It addresses the difficul-
allows multiple devices to compute simultaneously, leading to a ties in achieving consistent acceleration in deep learning training
higher throughput. One drawback of pipeline parallel training is due to diverse hardware conditions, met by Megatron-LM and Deep-
that there will be some bubble time, where some devices are idle Speed as well. This section will discuss the implementation and
when others are engaged in computation, leading to the waste of analysis of the acceleration techniques integrated in Colossal-AI.
computational resources [25].
3.1 Multi-dimensional model parallelism
2.3 Sequence Parallelism First of all, Colossal-AI provides an array of model parallelism
Tensor parallelism mainly tackles the memory bottleneck brought methods to cater to the needs of distributed training. Thus, it allows
by model data. However, the non-model data can be the bottleneck the model size to scale to billions of parameters by sharding the
in applications such as AlphaFold and document-level text under- model over devices. In Colossal-AI, all existing tensor parallelism
standing. This is because these applications rely on long-sequence methods are supported so that the user can choose one method
data. As the self-attention module in the Transformer layer is of based on their training requirements and the number of GPUs
quadratic complexity with respect to the sequence length, long- while Megatron-LM only supports 1D tensor splitting. As tensor
sequence data will increase the memory usage consumed by the parallelism is mainly applied to matrix-matrix multiplication, it is
intermediate activation, limiting the training capability of the de- highly suitable for the acceleration of Transformer models which
vices. widely uses linear layers.
Sequence parallelism [21] is proposed to enable long-sequence Among all tensor parallelism methods, one prominent advantage
modeling by breaking the memory wall brought by the large se- of advanced tensor parallelism, namely 2D, 2.5D, and 3D tensor
quence dimension. In sequence parallelism, the model is replicated parallelism, is that it has a lower communication cost compared to
across devices just like data parallelism. The input data is split along 1D tensor parallelism. Table 1 has shown the total communication
the sequence dimension and each device only keeps a sub-sequence. volume when computing a matrix multiplication 𝑌 = 𝑊 𝑋 where 𝑋
The self-attention module is replaced with the Ring Self-Attention is of shape (𝑏, 𝑠, ℎ), 𝑊 is of shape (ℎ, ℎ) and 𝑌 is of shape (𝑏, 𝑠, ℎ).
module such that the partial query, key, and value embeddings are In Table 1, the following notations are used.
exchanged among devices to complete the self-attention calcula- • 𝑝: the total number of GPUs
tion. • 𝑗: the number of GPUs on one side of the square-shaped
network topology, where 𝑝 = 𝑗 2
2.4 Heterogeneous Training • 𝑘: the number of GPUs on one side of the front square of the
To further expand the memory capacity of a single device, Deep- cuboid-shaped network topology, where 𝑝 = 𝑑 ∗ 𝑘 2
Speed proposed zero-offload [30] which moves the tensors from • 𝑑: the number of GPUs in the depth dimension of the cuboid-
GPU to CPU or NVMe disks when not in use to make room for shaped network topology
larger models. It is often seen that CPU memory is much larger than • 𝑙: the number of GPUs on one side of the cube-shaped net-
the GPU memory on machines such as the Nvidia DGX1 worksta- work topology, where 𝑝 = 𝑙 3
tion. By utilizing high-performance heterogeneous storage devices • 𝑆𝑥 : the number of elements in the input matrix 𝑋 , the same
and appropriately swapping tensors between different hardware semantic is applied to 𝑆𝑊 and 𝑆𝑌 .
devices, it became possible to train a model with billions of pa- • 𝑏: batch size of the input matrix 𝑋 .
rameters on a single GPU. This is especially friendly to users with • 𝑠: the sequence length of the input matrix 𝑋 .
limited computing resources and essential for the democratization • ℎ: the hidden size of the weight 𝑊 .
of large-scale model training.
Mode Total Communication Volume (number
2.5 Automatic Parallelization of elements transferred)
The latest advance in parallel training is the automatic selection and 1D 2(𝑝 − 1) ∗ 𝑆𝑥
execution of parallelization strategies as demonstrated in FlexFlow [23] 2D 3( 𝑗 − 1) ∗ (𝑆𝑥 + 𝑆 𝑤 )
and Alpa [42]. Alpa was proposed recently to automatically search
2.5D 3(𝑘 − 1) ∗ (𝑆𝑥 /𝑑 + 𝑆 𝑤 )
for a suitable parallelization plan including data and model paral-
3D 2(𝑙 − 1)/𝑙 ∗ (𝑆𝑥 + 𝑆 𝑤 + 𝑆 𝑦 )
lelism given the cluster mesh. It then compiles the computation
graph into distributed sharded graph with communication oper- Table 1: Communication Volume of Tensor Parallelism
ators and runs the compiled executable on the cluster. However,
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training ICPP 2023, August 7–10, 2023, Salt Lake City, UT, USA

As shown in Figure 5, the communication volume of the ad- Forward Backward Post-Backward

vanced tensor parallelism is significantly lower than that of 1D


FP32 Weights FP32 Weights FP32 Weights
tensor parallelism, especially when a large number of nodes is
used. The underlying reason for communication efficiency is that FP16 Weights FP16 Weights FP16 Gradients

advanced tensor parallelism only incurs communication on a sub-


FP16 Gradients
group of the computing nodes. For example, in 2D parallelism,
collective communication only involves the nodes in one row or
one column of the square-shaped network. In contrast, 1D tensor
parallelism involves all computing nodes for one collective commu- Figure 6: Memory Space Reuse
nication call. Therefore, advanced tensor parallelism allows scaling
beyond one node while 1D tensor parallelism is often restricted to
intra-node computing. Secondly, an adaptive tensor placement and parameter update
can be enabled during heterogeneous training. In DeepSpeed’s zero
offloading, it provides an implementation of CPU Adam to update
Communication Volumn the model parameters in the CPU. However, this method requires
1010
Number of Elements

all the FP32 master model weights to be placed in the CPU memory.
In Colossal-AI, we implemented an adaptive hybrid Adam optimizer
109 instead. During heterogeneous training, Colossal-AI’s hybrid Adam
optimizer monitors the available memory space on the GPU. It does
1d not statically keep all FP32 weights in the CPU memory, instead, it
108 2d dynamically keeps part of parameters and gradients on the GPU
2.5d
3d as long as there is space left. In this way, parameters are updated
107 0 100 200 300 400 500
on both CPU and GPU, leading to better resource utilization and
Number of GPUs lower communication cost.

Figure 5: Scaling Performance of Tensor Parallelism in The- 3.3 Automatic Parallelization on Dynamic
oretic Analysis (ℎ = 1024, 𝑠 = 512, 𝑏 = 32) Computation Graph
Inspired by Alpa [42], Colossal-AI has included an experimental
automatic parallelism feature to improve upon the Alpa project.
Besides tensor parallelism, Colossal-AI has also included se-
One challenge in automatic parallelization is the sharded tensor
quence parallelism and pipeline parallelism so that hybrid par-
conversion. For example, a tensor sharded on its 0th dimension
allelism is available out of the box to accelerate model training in
can be converted to the one sharded in the last dimension. Alpa
large-scale clusters.
hardcodes a conversion table, but this limits the number of sharded
dimensions to keep the table reasonably small. We implemented
3.2 Enhanced Sharding and Offloading a greedy algorithm to search to speed up sharding conversion
Zero redundancy data parallel training and offloading proposed and increase the number of sharding dimensions. Moreover, we
by DeepSpeed enable large-scale model training. However, it is integrate activation checkpoint into the search problem such that a
still bound to the CPU-GPU and GPU-GPU communication and model can be both sharded and activation checkpointed to achieve
its rigid implementation leads to poor extensibility. Colossal-AI maximum performance. As this feature is only experimental, it will
has re-designed the tensor sharding and offloading mechanism be discussed separately in another paper as a future work.
for better performance. Colossal-AI proposed a unified sharded
tensor interface and supports customizable sharding strategies and 4 IMPLEMENTATION
life-cycle hooks for easy modification of the training workflow. The overall architecture of Colossal-AI is shown in Figure 1. It has a
As such, zero-redundancy data parallel can be easily supported parallel context manager that manages the meta information of the
and extended. Meanwhile, it also integrates the chunk strategy complex hybrid parallel distributed environment and automatically
proposed in PatrickStar [12] to arrange tensors in chunks to further switches to the corresponding parallel mode based on the parallel
improve the communication bandwidth utilization and memory context. It has a user-friendly interface for building tensor-parallel
usage, making tensor offloading more efficient. models and various acceleration tools, including activation check-
Such flexible design brings several benefits. Firstly, it enables the pointing and mixed precision training. It also has an execution
re-use of FP16 storage space in the memory so that larger models engine and trainer that provide extensibility for user customization,
can be trained. In the forward pass, we hold FP16 parameters. In allowing them to define their own training schedule and hooks at
the backward pass, when the gradients are computed, the FP16 the operator or trainer level.
parameters are no longer needed. We can thus save FP16 gradients
in the same storage space which holds FP16 parameters during 4.0.1 Modularity. The principle of modularity and extensibility
forward as shown in Figure 6. In this way, Colossal-AI further is upheld throughout the development and the different acceler-
reduces redundancy and peak memory usage and the CPU memory ation techniques can easily be combined in pursuit of maximal
can afford to accommodate larger models. performance.
ICPP 2023, August 7–10, 2023, Salt Lake City, UT, USA Li, et al.

4.0.2 Extensibility. As a system under constant development, Colossal- 5 EVALUATION


AI provides various interfaces to implement customized functions
5.1 Experiment Setup
for future extensions. For example, the sharding module allows the
user to define their own sharding strategy and life-cycle hooks in To holistically evaluate the system performance of Colossal-AI, we
order in an attempt to explore more efficient training methods. have conducted various experiments on different hardware. The
system specification is listed in Table 2. Due to resource constraints,
4.0.3 User-Friendliness. To minimize the change to the user code, we only tested a portion of the prominent features on each system
Colossal-AI provides user-friendly APIs for model training. The user as stated in the Experiment Item column. We used Megatron-LM
only needs to prepare a configuration that specifies the features and DeepSpeed as our baselines for experiments and Megatron-
by following a pre-defined schema. Colossal-AI will then inject LM tensor parallelism is annotated as 1D tensor parallelism in the
the acceleration features into the execution engine with ‘colos- results.
salai.initialize‘ as shown in Listing ??.
Colossal-AI also provides parallelized popular model compo- 5.2 Multi-Dimensional Tensor Parallelism
nents such as BERT [8], GPT [28], ViT [9], which the users can use
1) Convergence
directly. This does not require the users to have domain expertise so
Experiments were conducted with Vision Transformer (ViT) [9]
that they do not have to manually design their parallelism strategy
on the ImageNet-1k dataset to verify the arithmetic correctness
like GShard [19].
and numerical stability of multi-dimensional tensor parallelism
1 import colossalai on System III. The ViT model has 12 Transformer layers with 384
2
3 # specify using 1D tensor parallelism with parallel size 4 hidden size and 6 attention heads. We used Jax initialization and
4 config = dict ( paralle = dict ( AdamW optimizer with 0.003 learning rate and 0.3 weight decay.
5 tensor = dict (
6 size =4 ,
The input image is of shape 224 and the patch size is 16. The global
7 mode = '1d ' batch size is 4k and the model is trained for 250 epochs. As shown in
8 ) Figure 7, the testing accuracy curves of Multi-Dimensional tensor
9 )
10 )
parallelism well align with that of the PyTorch data parallel training.
11 2) Memory Efficiency
12 # launch distributed network As 2D, 2.5D, 3D tensor parallelisms partition the input data,
13 colossalai . launch_from_torch ( config = config )
14 layer weight, and output activation while 1D tensor parallelism
15 # define your training components only partitions the layer weight, the former is expected to have
16 ...
17
lower memory consumption. As a result, the first three methods
18 # initialize with Colossal - AI allow the GPUs to accommodate larger models. To demonstrate
19 engine , trainloader , _ = \ the memory efficiency, we have conducted two range tests which
20 colossalai . initialize ( model ,
21 optimizer ,
scale by batch size and hidden size on System I. In this range test,
22 criterion , we created a model which consists of two linear layers. We run
23 trainloader ) 1D, 2D and 2.5D experiments on 4 GPUs and 1D, 2.5D (depth=2),
24
25 # run training and 3D experiments on 8 GPUs. We measure the max allocated
26 for data , label in train_dataloader : CUDA memory during the forward and backward pass, and the
27 engine . zero_grad ()
28 output = engine ( data )
results are shown in Figure 8. The memory consumption of 1D
29 train_loss = engine . criterion ( output , label ) tensor parallelism is much higher than those of 2D, 2.5D, and 3D
30 engine . backward ( train_loss ) tensor parallelism. With the batch size equal to 512 and 8 GPUs,
31 engine . step ()
the memory consumption of 2.5D and 3D is 44% and 65% lower
Listing 1: Colossal-AI Usage than that of 1D tensor parallelism respectively in Figure 8b. With
the hidden size of 16384 and 8 GPUs, the memory performance of
2.5D and 3D tensor parallelism is 62% and 74.2% better than that
ViT Convergence on ImageNet-1K of 1D tensor parallelism respectively in Figure 8d. Therefore, more
0.7 advanced tensor parallelism is a better option when scaling to super
0.6 large-scale models.
0.5 3) Hardware Compatibility
Accuracy

0.4 Non-Tensor-Parallel Experiments were conducted on Systems I and II to further inves-


0.3 1D tigate the impact of GPU interconnect on the performance of tensor
0.2 2D parallelism. System I and System II were selected for experiments as
0.1 2.5D
3D the former have fully connected NVLink between any pair of GPUs
0.0 as shown in Figure 9a while the latter only has NVLink between 4
0 50 100 150 200 250
epoch pairs of adjacent GPUs as shown in Figure 9b. The communication
bandwidth of System I is consistently high regardless of whether it
Figure 7: Convergence Performance of ViT on ImageNet is measured for a pair of GPUs or a group of GPUs as shown in Fig-
ure 10. However, the communication bandwidth drops significantly
from 184 GB/s to 15 GB/s when the communication occurs among
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training ICPP 2023, August 7–10, 2023, Salt Lake City, UT, USA

System ID #GPUs #Nodes GPU Model GPU Interconnect Cross-node Interconnect Experiment Item
per node
I 8 1 Nvidia A100 (80GB) NVlink N/A Tensor Parallelism
II 8 1 Nvidia A100 (80GB) NVlink between adjacent N/A Tensor Parallelism, ZeRO
GPUs, PCIe between dis-
tant GPUs
III 4 16 Nvidia A100 (40GB) NVLink InfiniBand HDR (200Gb/s), Tensor Parallelism, Sequence
and Dragonfly network topol- Parallelism
ogy
IV 1 64 Nvidia P100 (16GB) RDMA Cray Aries routing and com- Tensor Parallelism
munications ASIC, and Drag-
onfly network topology
Table 2: System Specification for Experiments

Max Allocated Memory Max Allocated Memory Max Allocated Memory Max Allocated Memory
60 1D (WS=4, TP=4) 60 1D (WS=8, TP=8) 60 1D (WS=4, TP=4) 60 1D (WS=8, TP=8)
50 2D (WS=4, TP=4) 50 2.5D (WS=8, TP=8) 50 2D (WS=4, TP=4) 50 2.5D (WS=8, TP=8)
Memory/GB

Memory/GB

Memory/GB

Memory/GB
40 2.5D (WS=4, TP=4) 40 3D (WS=8, TP=8) 2.5D (WS=4, TP=4) 3D (WS=8, TP=8)
40 40
30 30 30 30
20 20 20 20
10 10 10 10
0 0 0 0
25 26 27 28 29 210 25 26 27 28 29 210 25 26 27 28 29 210211212213214 25 26 27 28 29 210211212213214
Batch Size Batch Size Hidden Size Hidden Size
(a) 4 GPUs by Batch Size (b) 8 GPUs by Batch Size (c) 4 GPUs by Hidden Size (d) 8 GPUs by Hidden Size

Figure 8: Range Test for Memory Consumption of Tensor Parallelism with 4/8 GPUs

#GPUs Mode #Transformer Layer Hidden Size #Attention Heads Batch Size Throughput Speedup over 1D
(img/sec) (%)
1D 128 5.06 -
4 2D 24 2048 32 256 6.18 22.1
2.5D 256 6.73 33.0
1D 256 7.46 -
8 2.5D 24 2048 32 384 6.57 -11.9
3D 512 8.38 12.3
1D 64 3.42 -
16 2D 32 4096 64 256 5.33 55.8
2.5D 256 5.46 59.6
1D 128 4.22 -
32 2.5D 32 4096 64 256 5.46 50.6
1D 128 4.63 -
2D 512 12.76 275.5
64 32 4096 64
2.5D 512 4.93 6.5
3D 512 8.63 86.4
Table 3: Performance of Tensor Parallelism with Different Number of GPUs

non-adjacent GPUs as only adjacent GPUs have high-performance the ViT model has 64 Transformer layers with hidden size of 3072
NVLink. and 48 attention heads. On 8 GPUs, the hidden size and the number
The GPU topology of System II is therefore not friendly to 1D of attention heads are adjusted to 4096 and 64 respectively as there
tensor parallelism which relies on all-reduce operations across all is more memory available. The model is trained with increasing
the GPUs via PCIe. Instead, the 2D and 2.5D only have commu- batch size until the out-of-memory problem occurs. As such, we
nication between a pair of GPUs instead of across all GPUs. This present the best throughput for each tensor parallelism method. In
allows part of the communication to still utilize the high NVLink Figure 11a, the throughput of 2D, 2.5D, and 3D tensor parallelism
bandwidth between adjacent GPUs. cannot compete with 1D tensor parallelism on both 4 GPUs and 8
We trained ViT on the ImageNet-1k dataset with different config- GPUs. This is expected for two reasons. The first reason is that 1D
urations for 4 GPUs and 8 GPUs on both System I and II. On 4 GPUs, tensor parallelism can utilize the high communication bandwidth
ICPP 2023, August 7–10, 2023, Salt Lake City, UT, USA Li, et al.

memory, therefore, we adjusted the configuration of the ViT model


accordingly. The model is set to have 24 layers with the hidden size
NVLink of 2048 and 32 attention heads for 4 and 8 GPUs. From 16 GPUs
onwards, the model is set to have 32 layers with the hidden size of
NVLink 4096 and 64 attention heads.
The results for 4 to 64 GPUs are shown in Table 3. It can be
PCIe observed that as the number of GPUs increases, the speedup of
advanced tensor parallelism over 1D tensor parallelism increases up
(b) Partially Connected to 2.76. This can be attributed to the lower communication volume
(a) Fully Connected GPUs GPUs of advanced tensor parallelism methods when scaling to more pro-
cessors. Together with memory efficiency and low communication
Figure 9: Common network topology on GPU nodes volume, 2D, 2.5D, and 3D tensor parallelism is a better option for
large-scale distributed training.
Bandwidth Between Two GPU Collective Communication Bandwidth
175 200
Bandwidth (GB/s)

5.3 Sequence Parallelism


Bandwidth (GB/s)

150 175
125 150
100 125 In this section, we compare Sequence Parallelism with 1D tensor
75 100 parallelism for memory efficiency and training throughput. As
50 75
25 50 Sequence Parallelism is designed for situations where activations
0 25
0-1 0-2 0-4 0-6 0 2 4 8 consume more memory than model data, BERT-Base is chosen as
GPU Pair Number of GPUs
System I System II our experiment model and trained on the Wikipedia dataset [13].
System I System II We conducted the experiments on System III. It should be noted
(a) Communication Bandwidth be-(b) Communication Bandwidth for that 1D tensor parallelism requires the number of attention heads
tween GPU Pairs Collective Communication (12) to be divisible by the parallel size, we can only use 4, 6, and
12 GPUs whereas the 6-GPU experiment uses 2 nodes and 3 GPUs
Figure 10: Communication Bandwidth on System I and II from each node. Meanwhile, Sequence Parallelism is not limited by
(broadcasting 125 MB data using the NCCL Bandwidth Test the number of attention heads, thus we conducted experiments on
tool) 4, 8, and 12 GPUs.

Maximum Batch Size for BERT-Base Maximum Sequence Length for BERT-Base
Throughput of ViT - System I 30 Throughput of ViT - System II
Throughput (img/sec)

Throughput (img/sec)

1200 1D
35 Sequence 2000

Sequence Length
30 25 1000
25 20 1800
Batch Size

20 15 800
15 1600
10 10 600
5 5 400 1400
0 0 1D
4 8 4 8 200 1200 Sequence
Number of GPUs Number of GPUs 4 5 6 7 8 9 10 11 12 4 5 6 7 8 9 10 11 12
1D 2D 2.5D 3D 1D 2D 2.5D 3D Number of GPUs Number of GPUs
(a) Max Batch Size (b) Max Sequence Length
(a) System I (b) System II

Figure 12: Memory Efficiency of Sequence Parallelism over


Figure 11: Throughput of ViT Training on System I and II
1D Tensor Parallelism

with all GPUs involved in System I. The second reason is that 2D, 1) Memory Efficiency
2.5D, and 3D tensor parallelisms have more communication volume We increase the batch size and sequence length until the out-
with a small number of processors and will only surpass 1D tensor of-memory problem occurs for both 1D tensor parallelism and
parallelism when the number of processors increases. Sequence Parallelism. The sequence length is fixed at 512 for the
However, when the experiment is switched to System II in Fig- batch size test while the batch size is fixed at 64 for the sequence
ure 11b, 1D tensor parallelism will encounter a bottleneck due to the length test.
low communication bandwidth in collective communication across As shown in Figure 12a, Sequence Parallelism can reach larger
4 and 8 GPUs. Meanwhile, 2D and 2.5D can deliver a throughput batch size than 1D tensor parallelism. This is because that 1D tensor
that is 40% higher than that of 1D tensor parallelism with 4 GPUs. parallelism has a memory bottleneck in the duplicated activations
With 8 GPUs, 2.5D tensor parallelism can still outperform 1D ten- where the activation is split along the sequence dimension in Se-
sor parallelism by 20.6%. 3D tensor parallelism still delivers lower quence Parallelism. The maximum batch size of Sequence Paral-
performance than 1D tensor parallelism due to the low scaling. lelism is 4.44 times larger than that of the 1D tensor parallelism
4) Throughput Comparison with 12 GPUs. The same pattern is observed in the maximum se-
To test the performance of tensor parallelism with more GPUs, quence length test as shown in Figure 12b. The maximum sequence
we trained ViT on System IV. As System IV only has 16 GB GPU length of Sequence Parallelism is 1.18 times larger than that of 1D
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training ICPP 2023, August 7–10, 2023, Salt Lake City, UT, USA

tensor parallelism. If linear-complexity attention modules [37, 40] is Throughput of GPT


700

Throughput (tokens/sec)
used instead of the quadratic-complexity self-attention in BERT, Se-
600
quence Parallelism can achieve linear scaling of maximum sequence
500
length with the number of GPUs, better supporting document-level DeepSpeed
text understanding.
400 ColossalAI
300
200
Training Throughput for BERT-Base Training Throughput for BERT-Base 100
With Pipeline Parallelism
Throughput (tokens/sec)

Throughput (tokens/sec)
90000 95000 1 2 3 4 5 6 7 8
85000 90000 Number of GPUs
80000 85000
75000 80000 1D
70000 75000 Sequence
65000 70000 Figure 14: Throughput of GPT Training with Sharding and
1D 65000
60000 Sequence 60000
55000 55000 Offloading with Batch Size 4
4 5 6 7 8 9 10 11 12 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Number of GPUs Pipeline Size

(a) Wihtout Pipeline (b) With Pipeline 6 FUTURE WORK


The Colossal-AI system is open-sourced and maintained on GitHub.
Figure 13: Training Throughput of BERT-Base
One future work is to design a hardware-aware and efficient al-
gorithm to automatically search for the optimal parallelization
strategy as mentioned in Section 3.3. As an open-source project,
2) Throughput Comparison
we would actively integrate with model zoos such as Hugging Face
To evaluate the training speed of Sequence Parallelism, we trained
Transformers [38]. We expect ColossalAI to be capable of paral-
BERT-Base with the sequence length of 512 and its maximum batch
lelizing models in state-of-the-art model zoos so that distributed
size. As shown in Figure 13a, Sequence Parallelism is up to 1.43
training can be more accessible to the Deep Learning community.
times faster than that of 1D tensor parallelism.
As sequence parallelism splits the input data and activation, it
is naturally compatible with Pipeline Parallelism. While 1D tensor
7 CONCLUSION
parallelism will split the activation before transferring the tensor to In this work, we designed and implemented Colossal-AI which
the next stage and gather it back afterward, Sequence Parallelism re- integrated a vast number of advanced acceleration techniques into
quires no such communication between pipeline stages. We further one unified system for large-scale distributed training. Colossal-AI
scaled the training with Pipeline Parallelism. The parallel size for comes with a flexible system design that supports an easy combina-
both Sequence and 1D tensor parallelism is fixed at 4 and we scale tion of different parallelism methods. In addition, its acceleration
the number of pipeline stages from 1 to 4. As shown in Figure 13b, techniques provide robust performance under different hardware
Sequence Parallelism can train 1.55 times faster than 1D tensor conditions and deliver superior performance compared to the base-
parallelism with 4 pipeline stages. line systems. In our experiments, we have demonstrated Colossal-AI
can achieve up to 2.76x speedup over the baseline systems.
5.4 Sharding and Offloading
REFERENCES
In this section, we evaluated our own sharding and offloading mod- [1] R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar. 1995. A three-
ule as discussed in Section 3.2 against DeepSpeed. We used Deep- dimensional approach to parallel matrix multiplication. IBM Journal of Research
Speed Stage 3 as the baseline, which partitions model parameters, and Development 39, 5 (1995), 575–582. https://doi.org/10.1147/rd.395.0575
[2] Mandeep Baines, Shruti Bhosale, Vittorio Caggiano, Naman Goyal, Siddharth
gradients, and optimizer states in data parallel training. To demon- Goyal, Myle Ott, Benjamin Lefaudeux, Vitaliy Liptchinsky, Mike Rabbat, Sam
strate the capability of dynamic tensor placement in ColossalAI, we Sheiffer, Anjali Sridhar, and Min Xu. 2021. FairScale: A general purpose modular
trained GPT-2 model with 10 billion parameters on the Wikipedia PyTorch library for high performance and large scale training. https://github.
com/facebookresearch/fairscale.
dataset on System II. We set the batch size to 4 and scaled the data [3] Jarle Berntsen. 1989. Communication efficient matrix multiplication on hyper-
parallel training from 1 GPU to 8 GPU. As the batch size is small, cubes. Parallel computing 12, 3 (1989), 335–342.
[4] Zhengda Bian, Qifan Xu, Boxiang Wang, and Yang You. 2021. Maximizing
the GPU memory is not completely used up. However, DeepSpeed’s Parallelism in Distributed Training for Huge Neural Networks. arXiv preprint
static policy will still offload all model data to the CPU memory, arXiv:2105.14450 (2021).
leading to low memory efficiency and high communication over- [5] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
head. Instead, Colossal-AI will dynamically determine whether a Askell, et al. 2020. Language models are few-shot learners. arXiv preprint
tensor should be placed on GPU or CPU depending on the memory arXiv:2005.14165 (2020).
availability. In this case, since Colossal-AI detects that there is still [6] Lynn Elliot Cannon. 1969. A cellular computer to implement the Kalman filter
algorithm. Montana State University.
free memory on the GPU, it will only offload a small portion of the [7] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training Deep
model data, leading to better utilization of the hardware resources Nets with Sublinear Memory Cost. https://doi.org/10.48550/ARXIV.1604.06174
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
and better training throughput as shown in Figure 14. We have also Pre-training of deep bidirectional transformers for language understanding. arXiv
performed training on the OPT model [41] of 13 billion parameters preprint arXiv:1810.04805 (2018).
with the batch size per GPU equal to 32. With a larger batch size, [9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi-
aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg
both systems utilized almost all GPU memory and Colossal-AI can Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers
still achieve 1.33 times speed up over DeepSpeed on 8 GPUs. for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
ICPP 2023, August 7–10, 2023, Salt Lake City, UT, USA Li, et al.

[10] Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and [29] Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deep-
Jie Tang. 2021. All NLP Tasks Are Generation Tasks: A General Pretraining speed: System optimizations enable training deep learning models with over
Framework. arXiv preprint arXiv:2103.10360 (2021). 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International
[11] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods Conference on Knowledge Discovery & Data Mining. 3505–3506.
for online learning and stochastic optimization. Journal of machine learning [30] Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase,
research 12, 7 (2011). Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload:
[12] Jiarui Fang, Yang Yu, Zilin Zhu, Shenggui Li, Yang You, and Jie Zhou. 2021. Democratizing Billion-Scale Model Training. arXiv:2101.06840 [cs.DC]
PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory [31] Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed
Management. https://doi.org/10.48550/ARXIV.2108.05818 deep learning in TensorFlow. CoRR abs/1802.05799 (2018). arXiv:1802.05799
[13] Wikimedia Foundation. [n. d.]. Wikimedia Downloads. https://dumps.wikimedia. http://arxiv.org/abs/1802.05799
org [32] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper,
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter
Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).
Pattern Recognition (CVPR). 770–778. https://doi.org/10.1109/CVPR.2016.90 [33] Edgar Solomonik and James Demmel. 2011. Communication-Optimal Parallel
[15] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia 2.5D Matrix Multiplication and LU Factorization Algorithms. In Euro-Par.
Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, and zhifeng [34] Robert A. van de Geijn and Jerrell Watts. 1995. SUMMA: Scalable Universal Matrix
Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Multiplication Algorithm. Technical Report. USA.
Parallelism. In Advances in Neural Information Processing Systems, H. Wallach, [35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/ you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. V.
093f65e080a295f8076b1c5722a46aa2-Paper.pdf Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.),
[16] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/
Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Chen. 2019. GPipe: Efficient Training of Giant Neural Networks Using Pipeline [36] Boxiang Wang, Qifan Xu, Zhengda Bian, and Yang You. 2021. 2.5-dimensional
Parallelism. Curran Associates Inc., Red Hook, NY, USA. distributed model training. arXiv preprint arXiv:2105.14500 (2021).
[17] Diederik Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimiza- [37] Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer:
tion. International Conference on Learning Representations (12 2014). Self-Attention with Linear Complexity. arXiv preprint arXiv:2006.04768 (2020).
[18] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti- [38] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
mization. arXiv preprint arXiv:1412.6980 (2014). Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe
[19] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu,
Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. {GS}hard: Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest,
Scaling Giant Models with Conditional Computation and Automatic Sharding. and Alexander M. Rush. 2020. HuggingFace’s Transformers: State-of-the-art
In International Conference on Learning Representations. https://openreview.net/ Natural Language Processing. arXiv:1910.03771 [cs.CL]
forum?id=qrwe7XHTmYb [39] Qifan Xu, Shenggui Li, Chaoyu Gong, and Yang You. 2021. An Efficient 2D Method
[20] Shigang Li and Torsten Hoefler. 2021. Chimera: Efficiently Training Large-Scale for Training Super-Large Deep Learning Models. arXiv preprint arXiv:2104.05343
Neural Networks with Bidirectional Pipelines. In Proceedings of the International (2021).
Conference for High Performance Computing, Networking, Storage and Analysis [40] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie,
(St. Louis, Missouri) (SC ’21). Association for Computing Machinery, New York, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang,
NY, USA, Article 27, 14 pages. https://doi.org/10.1145/3458817.3476145 Li Yang, and Amr Ahmed. 2020. Big Bird: Transformers for Longer Se-
[21] Shenggui Li, Fuzhao Xue, Yongbin Li, and Yang You. 2021. Sequence Parallelism: quences. In Advances in Neural Information Processing Systems, H. Larochelle,
Long Sequence Training from System Perspective. https://doi.org/10.48550/ M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran As-
ARXIV.2105.13120 sociates, Inc., 17283–17297. https://proceedings.neurips.cc/paper/2020/file/
[22] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, c8512d142a2d849725f31a9a7a361ab9-Paper.pdf
Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. [41] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui
2020. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov,
Proc. VLDB Endow. 13, 12 (aug 2020), 3005–3018. https://doi.org/10.14778/3415478. Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali
3415530 Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. OPT: Open Pre-trained
[23] Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. 2017. Transformer Language Models. arXiv:2205.01068 [cs.CL]
FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural [42] Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yan-
Networks. In 2017 IEEE International Symposium on High Performance Computer ping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E.
Architecture (HPCA). 553–564. https://doi.org/10.1109/HPCA.2017.29 Gonzalez, and Ion Stoica. 2022. Alpa: Automating Inter- and Intra-Operator Par-
[24] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. allelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating
Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad,
PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings CA, 559–578. https://www.usenix.org/conference/osdi22/presentation/zheng-
of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, lianmin
Canada) (SOSP ’19). Association for Computing Machinery, New York, NY, USA,
1–15. https://doi.org/10.1145/3341301.3359646
[25] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R.
Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019.
PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings
of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario,
Canada) (SOSP ’19). Association for Computing Machinery, New York, NY, USA,
1–15. https://doi.org/10.1145/3341301.3359646
[26] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley,
Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti,
Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021.
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-
LM. In Proceedings of the International Conference for High Performance Com-
puting, Networking, Storage and Analysis (St. Louis, Missouri) (SC ’21). Asso-
ciation for Computing Machinery, New York, NY, USA, Article 58, 15 pages.
https://doi.org/10.1145/3458817.3476209
[27] Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Pham, Raffaella
Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández.
2016. The LAMBADA dataset: Word prediction requiring a broad discourse
context. 1525–1534. https://doi.org/10.18653/v1/P16-1144
[28] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
Sutskever. 2019. Language Models are Unsupervised Multitask Learners.

You might also like