Paper Colossal-AI - A Unified Deep Learning System For Large-Scale Parallel Training
Paper Colossal-AI - A Unified Deep Learning System For Large-Scale Parallel Training
Parallel Training
Shenggui Li Hongxin Liu Zhengda Bian
[email protected] [email protected] [email protected]
HPC-AI Technology Inc. HPC-AI Technology Inc. HPC-AI Technology Inc.
Singapore China China
ABSTRACT 1 INTRODUCTION
The success of Transformer models has pushed the deep learning Deep learning has been successful in many applications and brought
model scale to billions of parameters, but the memory limitation of breakthroughs in difficult problems. With large amounts of data,
a single GPU has led to an urgent need for training on multi-GPU neural networks like BERT [8] and Vision Transformer [9] are capa-
clusters. However, the best practice for choosing the optimal parallel ble of learning high-dimensional features and making predictions
strategy is still lacking, as it requires domain expertise in both deep on a level even humans cannot match. As powerful hardware be-
learning and parallel computing. The Colossal-AI system addressed comes available, neural networks have more diverse architectures
the above challenge by introducing a unified interface to scale your and a larger number of parameters. The AI community has seen a
sequential code of model training to distributed environments. It trend of deep learning models becoming larger, with an array of
supports parallel training methods such as data, pipeline, tensor, large-scale models ranging from BERT-Large, GPT-2 [28] (1.5 billion
and sequence parallelism and is integrated with heterogeneous parameters), GPT-3 [5] (175 billion parameters), to GLM [10] (1.75
training and zero redundancy optimizer. Compared to the baseline trillion parameters). These large-scale models require more data
system, Colossal-AI can achieve up to 2.76 times training speedup and computing resources but also have better generality and per-
on large-scale models. formance. As more robust computing hardware and larger datasets
become available, the trend is expected to continue and traditional
CCS CONCEPTS training methods will become less effective, making distributed
• Computing methodologies → Machine learning algorithms; training necessary for large-scale model training.
Parallel computing methodologies. The limited fast memory of commonly used accelerator hard-
ware, such as GPU, is a bottleneck of scaling the model to billions of
KEYWORDS parameters. The memory consumption in deep learning comes from
model parameters, layer activations, gradients, and optimizer states.
datasets, neural networks, gaze detection, text tagging
We refer to model parameters, gradients, and optimizer states as
ACM Reference Format: model data and layer activations as non-model data. When training
Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, with adaptive optimizers [11, 18], the total memory consumption
Yuliang Liu, Boxiang Wang, and Yang You. 2023. Colossal-AI: A Unified Deep
of model data can be several times larger than that consumed by
Learning System For Large-Scale Parallel Training. In 52nd International
parameters alone, making a single GPU no longer sufficient for
Conference on Parallel Processing (ICPP 2023), August 7–10, 2023, Salt Lake
City, UT, USA. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/ large-scale model training. 10 billion parameters in FP16 format can
3605573.3605613 already consume 20 GB of model memory, while a typical GPU only
has 16 or 32 GB of memory. Without any optimization, training a
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
model of 10 billion parameters with one data sample can cost more
for profit or commercial advantage and that copies bear this notice and the full citation than 80 GB of memory, which is far more than that of a typical
on the first page. Copyrights for components of this work owned by others than ACM GPU.
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a Data parallelism [31] has scaled models such as ResNet [14] to
fee. Request permissions from [email protected]. multi-GPU training and other methods such as activation check-
ICPP 2023, August 7–10, 2023, Salt Lake City, UT, USA pointing [7] were proposed to reduce the non-model data by trading
© 2023 Association for Computing Machinery.
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00
computation for memory. However, these methods failed to cope
https://doi.org/10.1145/3605573.3605613 with billion-parameter model data. Parallelization techniques such
ICPP 2023, August 7–10, 2023, Salt Lake City, UT, USA Li, et al.
as pipeline parallelism [15, 24] and tensor parallelism [32] were 2 BACKGROUND
explored to shard the model data, making it possible to train mod- Thanks to the advent of the Transformer architecture, deep learning
els at a larger scale. The current state-of-the-art systems which models have gained unprecedented performance improvement in
provide a solution to the scaling challenge include GShard [19], domains such as Computer Vision and Natural Language Process-
FairScale [2], Megatron-LM [26] and DeepSpeed [29]. Among these ing. The typical architecture of a Transformer layer consists of a
systems, Megatron-Lm and DeepSpeed are the most popular in the Multi-head Attention block and a Feed Forward block as shown in
open-source community and deliver the best performance. Thus, Figure 2. This architecture can scale to billions of parameters and
they are chosen as the baseline of our experiments. Megatron-LM larger models can deliver more impressive performance improve-
trains Transformer-based models by utilizing optimized pipeline ment. For example, GPT-3 [5] outperforms the smaller models by
and tensor parallelism. Meanwhile, DeepSpeed proposed an effi- 18% absolute increase in prediction accuracy on the LAMBADA
cient method to partition the model-related data to fully eliminate language task [27].
memory redundancy in data parallel training. These two efficient
methods paved the way to scale model training to hundreds of
devices and billions of parameters. Add and Normalize Add and Normalize
As most deep learning engineers and researchers are used to
writing non-distributed code, it is reasonably difficult for them Multi-head Attention Feed Forward
result, model parallelism was proposed to tackle this problem. There the low communication bandwidth can hinder the efficiency of
are generally two types of model parallelism: tensor parallelism all-reduce operations in 1D tensor parallelism.
and pipeline parallelism. In addition, the 1D tensor parallelism has redundant memory
1) Tensor Parallelism usage in layer inputs and outputs. Taking the Feed Forward layer
Tensor parallelism shards the tensor over an array of devices and in the Transformer architecture as an example, the input 𝑋 and
requires a distributed matrix-matrix multiplication algorithm for output 𝑌 of the MLP layer are duplicated across different devices as
arithmetic computation as shown in Figure 3b. Megatron-LM [32] shown in Figure 4. Such memory redundancy limits the maximum
proposed 1D tensor parallelism which splits the linear layer in the model size which can be trained on limited hardware resources, and
row or column dimensions for the Transformer architecture [35]. is not helpful with the democratization of large-scale distributed
More advanced tensor sharding mechanisms such as 2D [39], 2.5D [36], training.
and 3D [4] were proposed to shard tensors in more dimensions. Besides 1D tensor parallelism, more advanced tensor parallelism
Collective communication is required among devices to ensure is introduced for large-scale model training, namely 2D, 2.5D, and
arithmetic correctness. 3D tensor parallelism [4, 36, 39]. These methods split input, weight,
In Meagtron-LM, the tensors are sharded in one dimension. and output tensors and thus have advantages in memory and com-
Taking the Feed Forward module of the Transformer layer as an munication efficiency, better coping with different hardware speci-
example, we can view the module as a matrix multiplication of fications. This provides the user with an option of using the most
𝑌 = 𝑊2𝑊1𝑋 as shown in Figure 4, where 𝑋 is the input, 𝑊1 and suitable tensor parallelism technique for their machines.
𝑊2 are the model parameters. 𝑊1 and 𝑊2 can be sharded vertically 2D tensor parallelism [39] relies on the SUMMA and Cannon ma-
and horizontally respectively and produce a partial result of 𝑌 on trix multiplication algorithm [3, 6, 34] and splits a tensor along two
each device. An all-reduce operation can be applied to the partial different dimensions. Given 𝑁 devices arranged in a square network
result to obtain the correct final result of the matrix multiplication. topology, a tensor of√shape √ [𝑃, 𝑄] will be partitioned into a chunk
In this way, each device will only hold 1/𝑁 of the parameters when tensor of shape [𝑃/ 𝑁 , 𝑄/ 𝑁 ]. 2.5D Tesnor Paralleism [36] was
training on 𝑁 devices. This allows the model size to scale beyond inspired by 2.5D matrix multiplication algorithm [33] and proposed
the memory capacity of a single device. to further parallelize 2D tensor parallelism. It adds the optional
𝑑𝑒𝑝𝑡ℎ dimension of the matrix for parallelization. When 𝑑𝑒𝑝𝑡ℎ = 1,
W2
it is close to 2D tensor parallelism. When 𝑑𝑒𝑝𝑡ℎ > 1, it partitions
W1 Partial
GPU 0 X
Y the matrix 3 times and adds one more degree of parallelization.
Given 𝑁 devices, the tensor is split in a way such as 𝑁 = 𝑆 2 ∗ 𝐷
All-Reduce where 𝑆 is the size of one side of the square and 𝐷 is the depth of
the cuboid. 3D tensor parallelism [4] was proposed based on the
Partial
GPU 1 X W1 3D matrix multiplication algorithm [1]. 3D tensor parallelism splits
W2 Y
a tensor in a cubic manner. As not every tensor has 3 dimensions,
we choose to partition the first and last dimension only where the
Figure 4: Megatron-LM MLP Module first dimension will be partitioned twice. For example, a tensor
of shape [𝑃, 𝑄] will be partitioned into a chunk tensor of shape
√ 2 √
One of the major problems of the 1D method is that it assumes [𝑃/ 3 𝑁 , 𝑄/ 3 𝑁 ].
the interconnect of devices has the same bandwidth. This makes it As the advanced tensor parallelism methods require different
friendly only on machines with fully connected NVLinks among network topologies, the user needs to choose the method based on
the GPUs on a single node as shown in Figure 9a. However, such the number of GPUs. 1D method can work with any number of
high-end hardware is expensive and scarce. Many machines, even GPUs while 2D, 2.5D and 3D methods require the 𝑛 2 , 𝑎 ∗ 𝑛 2 , and 𝑛 3
some in the supercomputing centers, only have partially connected GPUs respectively, where 𝑎 and 𝑛 are positive integers. The user
GPUs as shown in Figure 9b. With this kind of GPU topology, the can fall back to 1D tensor parallelism when the number of GPUs
communication bandwidth between distant devices via the PCIe does not fulfill the requirement. These advanced tensor parallelism
bus is much lower than that of directly connected GPUs. Therefore, methods provide lower communication volume when scaling to
ICPP 2023, August 7–10, 2023, Salt Lake City, UT, USA Li, et al.
a larger number of devices [1, 3, 6, 33] and this will be further Alpa is not made to be hardware-aware and does not automatically
discussed in Section 3.1. consider the network topology. Meanwhile, it does not search for
2) Pipeline Parallelism other optimization techniques such as activation checkpointing,
Methods such as PipeDream [25], GPipe [16], and Chimera [20] leading to suboptimal results.
were proposed to split the model into several chunks of consecutive
layers and each chunk is allocated to a device as shown in Figure 3c. 3 DESIGN
Intermediate activations and gradients are passed between pipeline Colossal-AI is featured by an array of acceleration techniques con-
stages to complete the forward and backward pass. As a result, this structed in a modular way, which can cover a wide range of training
method reduces cross-node communication. Pipeline parallelism settings to achieve maximal performance. It addresses the difficul-
allows multiple devices to compute simultaneously, leading to a ties in achieving consistent acceleration in deep learning training
higher throughput. One drawback of pipeline parallel training is due to diverse hardware conditions, met by Megatron-LM and Deep-
that there will be some bubble time, where some devices are idle Speed as well. This section will discuss the implementation and
when others are engaged in computation, leading to the waste of analysis of the acceleration techniques integrated in Colossal-AI.
computational resources [25].
3.1 Multi-dimensional model parallelism
2.3 Sequence Parallelism First of all, Colossal-AI provides an array of model parallelism
Tensor parallelism mainly tackles the memory bottleneck brought methods to cater to the needs of distributed training. Thus, it allows
by model data. However, the non-model data can be the bottleneck the model size to scale to billions of parameters by sharding the
in applications such as AlphaFold and document-level text under- model over devices. In Colossal-AI, all existing tensor parallelism
standing. This is because these applications rely on long-sequence methods are supported so that the user can choose one method
data. As the self-attention module in the Transformer layer is of based on their training requirements and the number of GPUs
quadratic complexity with respect to the sequence length, long- while Megatron-LM only supports 1D tensor splitting. As tensor
sequence data will increase the memory usage consumed by the parallelism is mainly applied to matrix-matrix multiplication, it is
intermediate activation, limiting the training capability of the de- highly suitable for the acceleration of Transformer models which
vices. widely uses linear layers.
Sequence parallelism [21] is proposed to enable long-sequence Among all tensor parallelism methods, one prominent advantage
modeling by breaking the memory wall brought by the large se- of advanced tensor parallelism, namely 2D, 2.5D, and 3D tensor
quence dimension. In sequence parallelism, the model is replicated parallelism, is that it has a lower communication cost compared to
across devices just like data parallelism. The input data is split along 1D tensor parallelism. Table 1 has shown the total communication
the sequence dimension and each device only keeps a sub-sequence. volume when computing a matrix multiplication 𝑌 = 𝑊 𝑋 where 𝑋
The self-attention module is replaced with the Ring Self-Attention is of shape (𝑏, 𝑠, ℎ), 𝑊 is of shape (ℎ, ℎ) and 𝑌 is of shape (𝑏, 𝑠, ℎ).
module such that the partial query, key, and value embeddings are In Table 1, the following notations are used.
exchanged among devices to complete the self-attention calcula- • 𝑝: the total number of GPUs
tion. • 𝑗: the number of GPUs on one side of the square-shaped
network topology, where 𝑝 = 𝑗 2
2.4 Heterogeneous Training • 𝑘: the number of GPUs on one side of the front square of the
To further expand the memory capacity of a single device, Deep- cuboid-shaped network topology, where 𝑝 = 𝑑 ∗ 𝑘 2
Speed proposed zero-offload [30] which moves the tensors from • 𝑑: the number of GPUs in the depth dimension of the cuboid-
GPU to CPU or NVMe disks when not in use to make room for shaped network topology
larger models. It is often seen that CPU memory is much larger than • 𝑙: the number of GPUs on one side of the cube-shaped net-
the GPU memory on machines such as the Nvidia DGX1 worksta- work topology, where 𝑝 = 𝑙 3
tion. By utilizing high-performance heterogeneous storage devices • 𝑆𝑥 : the number of elements in the input matrix 𝑋 , the same
and appropriately swapping tensors between different hardware semantic is applied to 𝑆𝑊 and 𝑆𝑌 .
devices, it became possible to train a model with billions of pa- • 𝑏: batch size of the input matrix 𝑋 .
rameters on a single GPU. This is especially friendly to users with • 𝑠: the sequence length of the input matrix 𝑋 .
limited computing resources and essential for the democratization • ℎ: the hidden size of the weight 𝑊 .
of large-scale model training.
Mode Total Communication Volume (number
2.5 Automatic Parallelization of elements transferred)
The latest advance in parallel training is the automatic selection and 1D 2(𝑝 − 1) ∗ 𝑆𝑥
execution of parallelization strategies as demonstrated in FlexFlow [23] 2D 3( 𝑗 − 1) ∗ (𝑆𝑥 + 𝑆 𝑤 )
and Alpa [42]. Alpa was proposed recently to automatically search
2.5D 3(𝑘 − 1) ∗ (𝑆𝑥 /𝑑 + 𝑆 𝑤 )
for a suitable parallelization plan including data and model paral-
3D 2(𝑙 − 1)/𝑙 ∗ (𝑆𝑥 + 𝑆 𝑤 + 𝑆 𝑦 )
lelism given the cluster mesh. It then compiles the computation
graph into distributed sharded graph with communication oper- Table 1: Communication Volume of Tensor Parallelism
ators and runs the compiled executable on the cluster. However,
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training ICPP 2023, August 7–10, 2023, Salt Lake City, UT, USA
As shown in Figure 5, the communication volume of the ad- Forward Backward Post-Backward
all the FP32 master model weights to be placed in the CPU memory.
In Colossal-AI, we implemented an adaptive hybrid Adam optimizer
109 instead. During heterogeneous training, Colossal-AI’s hybrid Adam
optimizer monitors the available memory space on the GPU. It does
1d not statically keep all FP32 weights in the CPU memory, instead, it
108 2d dynamically keeps part of parameters and gradients on the GPU
2.5d
3d as long as there is space left. In this way, parameters are updated
107 0 100 200 300 400 500
on both CPU and GPU, leading to better resource utilization and
Number of GPUs lower communication cost.
Figure 5: Scaling Performance of Tensor Parallelism in The- 3.3 Automatic Parallelization on Dynamic
oretic Analysis (ℎ = 1024, 𝑠 = 512, 𝑏 = 32) Computation Graph
Inspired by Alpa [42], Colossal-AI has included an experimental
automatic parallelism feature to improve upon the Alpa project.
Besides tensor parallelism, Colossal-AI has also included se-
One challenge in automatic parallelization is the sharded tensor
quence parallelism and pipeline parallelism so that hybrid par-
conversion. For example, a tensor sharded on its 0th dimension
allelism is available out of the box to accelerate model training in
can be converted to the one sharded in the last dimension. Alpa
large-scale clusters.
hardcodes a conversion table, but this limits the number of sharded
dimensions to keep the table reasonably small. We implemented
3.2 Enhanced Sharding and Offloading a greedy algorithm to search to speed up sharding conversion
Zero redundancy data parallel training and offloading proposed and increase the number of sharding dimensions. Moreover, we
by DeepSpeed enable large-scale model training. However, it is integrate activation checkpoint into the search problem such that a
still bound to the CPU-GPU and GPU-GPU communication and model can be both sharded and activation checkpointed to achieve
its rigid implementation leads to poor extensibility. Colossal-AI maximum performance. As this feature is only experimental, it will
has re-designed the tensor sharding and offloading mechanism be discussed separately in another paper as a future work.
for better performance. Colossal-AI proposed a unified sharded
tensor interface and supports customizable sharding strategies and 4 IMPLEMENTATION
life-cycle hooks for easy modification of the training workflow. The overall architecture of Colossal-AI is shown in Figure 1. It has a
As such, zero-redundancy data parallel can be easily supported parallel context manager that manages the meta information of the
and extended. Meanwhile, it also integrates the chunk strategy complex hybrid parallel distributed environment and automatically
proposed in PatrickStar [12] to arrange tensors in chunks to further switches to the corresponding parallel mode based on the parallel
improve the communication bandwidth utilization and memory context. It has a user-friendly interface for building tensor-parallel
usage, making tensor offloading more efficient. models and various acceleration tools, including activation check-
Such flexible design brings several benefits. Firstly, it enables the pointing and mixed precision training. It also has an execution
re-use of FP16 storage space in the memory so that larger models engine and trainer that provide extensibility for user customization,
can be trained. In the forward pass, we hold FP16 parameters. In allowing them to define their own training schedule and hooks at
the backward pass, when the gradients are computed, the FP16 the operator or trainer level.
parameters are no longer needed. We can thus save FP16 gradients
in the same storage space which holds FP16 parameters during 4.0.1 Modularity. The principle of modularity and extensibility
forward as shown in Figure 6. In this way, Colossal-AI further is upheld throughout the development and the different acceler-
reduces redundancy and peak memory usage and the CPU memory ation techniques can easily be combined in pursuit of maximal
can afford to accommodate larger models. performance.
ICPP 2023, August 7–10, 2023, Salt Lake City, UT, USA Li, et al.
System ID #GPUs #Nodes GPU Model GPU Interconnect Cross-node Interconnect Experiment Item
per node
I 8 1 Nvidia A100 (80GB) NVlink N/A Tensor Parallelism
II 8 1 Nvidia A100 (80GB) NVlink between adjacent N/A Tensor Parallelism, ZeRO
GPUs, PCIe between dis-
tant GPUs
III 4 16 Nvidia A100 (40GB) NVLink InfiniBand HDR (200Gb/s), Tensor Parallelism, Sequence
and Dragonfly network topol- Parallelism
ogy
IV 1 64 Nvidia P100 (16GB) RDMA Cray Aries routing and com- Tensor Parallelism
munications ASIC, and Drag-
onfly network topology
Table 2: System Specification for Experiments
Max Allocated Memory Max Allocated Memory Max Allocated Memory Max Allocated Memory
60 1D (WS=4, TP=4) 60 1D (WS=8, TP=8) 60 1D (WS=4, TP=4) 60 1D (WS=8, TP=8)
50 2D (WS=4, TP=4) 50 2.5D (WS=8, TP=8) 50 2D (WS=4, TP=4) 50 2.5D (WS=8, TP=8)
Memory/GB
Memory/GB
Memory/GB
Memory/GB
40 2.5D (WS=4, TP=4) 40 3D (WS=8, TP=8) 2.5D (WS=4, TP=4) 3D (WS=8, TP=8)
40 40
30 30 30 30
20 20 20 20
10 10 10 10
0 0 0 0
25 26 27 28 29 210 25 26 27 28 29 210 25 26 27 28 29 210211212213214 25 26 27 28 29 210211212213214
Batch Size Batch Size Hidden Size Hidden Size
(a) 4 GPUs by Batch Size (b) 8 GPUs by Batch Size (c) 4 GPUs by Hidden Size (d) 8 GPUs by Hidden Size
Figure 8: Range Test for Memory Consumption of Tensor Parallelism with 4/8 GPUs
#GPUs Mode #Transformer Layer Hidden Size #Attention Heads Batch Size Throughput Speedup over 1D
(img/sec) (%)
1D 128 5.06 -
4 2D 24 2048 32 256 6.18 22.1
2.5D 256 6.73 33.0
1D 256 7.46 -
8 2.5D 24 2048 32 384 6.57 -11.9
3D 512 8.38 12.3
1D 64 3.42 -
16 2D 32 4096 64 256 5.33 55.8
2.5D 256 5.46 59.6
1D 128 4.22 -
32 2.5D 32 4096 64 256 5.46 50.6
1D 128 4.63 -
2D 512 12.76 275.5
64 32 4096 64
2.5D 512 4.93 6.5
3D 512 8.63 86.4
Table 3: Performance of Tensor Parallelism with Different Number of GPUs
non-adjacent GPUs as only adjacent GPUs have high-performance the ViT model has 64 Transformer layers with hidden size of 3072
NVLink. and 48 attention heads. On 8 GPUs, the hidden size and the number
The GPU topology of System II is therefore not friendly to 1D of attention heads are adjusted to 4096 and 64 respectively as there
tensor parallelism which relies on all-reduce operations across all is more memory available. The model is trained with increasing
the GPUs via PCIe. Instead, the 2D and 2.5D only have commu- batch size until the out-of-memory problem occurs. As such, we
nication between a pair of GPUs instead of across all GPUs. This present the best throughput for each tensor parallelism method. In
allows part of the communication to still utilize the high NVLink Figure 11a, the throughput of 2D, 2.5D, and 3D tensor parallelism
bandwidth between adjacent GPUs. cannot compete with 1D tensor parallelism on both 4 GPUs and 8
We trained ViT on the ImageNet-1k dataset with different config- GPUs. This is expected for two reasons. The first reason is that 1D
urations for 4 GPUs and 8 GPUs on both System I and II. On 4 GPUs, tensor parallelism can utilize the high communication bandwidth
ICPP 2023, August 7–10, 2023, Salt Lake City, UT, USA Li, et al.
150 175
125 150
100 125 In this section, we compare Sequence Parallelism with 1D tensor
75 100 parallelism for memory efficiency and training throughput. As
50 75
25 50 Sequence Parallelism is designed for situations where activations
0 25
0-1 0-2 0-4 0-6 0 2 4 8 consume more memory than model data, BERT-Base is chosen as
GPU Pair Number of GPUs
System I System II our experiment model and trained on the Wikipedia dataset [13].
System I System II We conducted the experiments on System III. It should be noted
(a) Communication Bandwidth be-(b) Communication Bandwidth for that 1D tensor parallelism requires the number of attention heads
tween GPU Pairs Collective Communication (12) to be divisible by the parallel size, we can only use 4, 6, and
12 GPUs whereas the 6-GPU experiment uses 2 nodes and 3 GPUs
Figure 10: Communication Bandwidth on System I and II from each node. Meanwhile, Sequence Parallelism is not limited by
(broadcasting 125 MB data using the NCCL Bandwidth Test the number of attention heads, thus we conducted experiments on
tool) 4, 8, and 12 GPUs.
Maximum Batch Size for BERT-Base Maximum Sequence Length for BERT-Base
Throughput of ViT - System I 30 Throughput of ViT - System II
Throughput (img/sec)
Throughput (img/sec)
1200 1D
35 Sequence 2000
Sequence Length
30 25 1000
25 20 1800
Batch Size
20 15 800
15 1600
10 10 600
5 5 400 1400
0 0 1D
4 8 4 8 200 1200 Sequence
Number of GPUs Number of GPUs 4 5 6 7 8 9 10 11 12 4 5 6 7 8 9 10 11 12
1D 2D 2.5D 3D 1D 2D 2.5D 3D Number of GPUs Number of GPUs
(a) Max Batch Size (b) Max Sequence Length
(a) System I (b) System II
with all GPUs involved in System I. The second reason is that 2D, 1) Memory Efficiency
2.5D, and 3D tensor parallelisms have more communication volume We increase the batch size and sequence length until the out-
with a small number of processors and will only surpass 1D tensor of-memory problem occurs for both 1D tensor parallelism and
parallelism when the number of processors increases. Sequence Parallelism. The sequence length is fixed at 512 for the
However, when the experiment is switched to System II in Fig- batch size test while the batch size is fixed at 64 for the sequence
ure 11b, 1D tensor parallelism will encounter a bottleneck due to the length test.
low communication bandwidth in collective communication across As shown in Figure 12a, Sequence Parallelism can reach larger
4 and 8 GPUs. Meanwhile, 2D and 2.5D can deliver a throughput batch size than 1D tensor parallelism. This is because that 1D tensor
that is 40% higher than that of 1D tensor parallelism with 4 GPUs. parallelism has a memory bottleneck in the duplicated activations
With 8 GPUs, 2.5D tensor parallelism can still outperform 1D ten- where the activation is split along the sequence dimension in Se-
sor parallelism by 20.6%. 3D tensor parallelism still delivers lower quence Parallelism. The maximum batch size of Sequence Paral-
performance than 1D tensor parallelism due to the low scaling. lelism is 4.44 times larger than that of the 1D tensor parallelism
4) Throughput Comparison with 12 GPUs. The same pattern is observed in the maximum se-
To test the performance of tensor parallelism with more GPUs, quence length test as shown in Figure 12b. The maximum sequence
we trained ViT on System IV. As System IV only has 16 GB GPU length of Sequence Parallelism is 1.18 times larger than that of 1D
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training ICPP 2023, August 7–10, 2023, Salt Lake City, UT, USA
Throughput (tokens/sec)
used instead of the quadratic-complexity self-attention in BERT, Se-
600
quence Parallelism can achieve linear scaling of maximum sequence
500
length with the number of GPUs, better supporting document-level DeepSpeed
text understanding.
400 ColossalAI
300
200
Training Throughput for BERT-Base Training Throughput for BERT-Base 100
With Pipeline Parallelism
Throughput (tokens/sec)
Throughput (tokens/sec)
90000 95000 1 2 3 4 5 6 7 8
85000 90000 Number of GPUs
80000 85000
75000 80000 1D
70000 75000 Sequence
65000 70000 Figure 14: Throughput of GPT Training with Sharding and
1D 65000
60000 Sequence 60000
55000 55000 Offloading with Batch Size 4
4 5 6 7 8 9 10 11 12 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Number of GPUs Pipeline Size
[10] Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and [29] Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deep-
Jie Tang. 2021. All NLP Tasks Are Generation Tasks: A General Pretraining speed: System optimizations enable training deep learning models with over
Framework. arXiv preprint arXiv:2103.10360 (2021). 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International
[11] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods Conference on Knowledge Discovery & Data Mining. 3505–3506.
for online learning and stochastic optimization. Journal of machine learning [30] Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase,
research 12, 7 (2011). Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload:
[12] Jiarui Fang, Yang Yu, Zilin Zhu, Shenggui Li, Yang You, and Jie Zhou. 2021. Democratizing Billion-Scale Model Training. arXiv:2101.06840 [cs.DC]
PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory [31] Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed
Management. https://doi.org/10.48550/ARXIV.2108.05818 deep learning in TensorFlow. CoRR abs/1802.05799 (2018). arXiv:1802.05799
[13] Wikimedia Foundation. [n. d.]. Wikimedia Downloads. https://dumps.wikimedia. http://arxiv.org/abs/1802.05799
org [32] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper,
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter
Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).
Pattern Recognition (CVPR). 770–778. https://doi.org/10.1109/CVPR.2016.90 [33] Edgar Solomonik and James Demmel. 2011. Communication-Optimal Parallel
[15] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia 2.5D Matrix Multiplication and LU Factorization Algorithms. In Euro-Par.
Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, and zhifeng [34] Robert A. van de Geijn and Jerrell Watts. 1995. SUMMA: Scalable Universal Matrix
Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Multiplication Algorithm. Technical Report. USA.
Parallelism. In Advances in Neural Information Processing Systems, H. Wallach, [35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/ you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. V.
093f65e080a295f8076b1c5722a46aa2-Paper.pdf Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.),
[16] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/
Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Chen. 2019. GPipe: Efficient Training of Giant Neural Networks Using Pipeline [36] Boxiang Wang, Qifan Xu, Zhengda Bian, and Yang You. 2021. 2.5-dimensional
Parallelism. Curran Associates Inc., Red Hook, NY, USA. distributed model training. arXiv preprint arXiv:2105.14500 (2021).
[17] Diederik Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimiza- [37] Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer:
tion. International Conference on Learning Representations (12 2014). Self-Attention with Linear Complexity. arXiv preprint arXiv:2006.04768 (2020).
[18] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti- [38] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
mization. arXiv preprint arXiv:1412.6980 (2014). Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe
[19] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu,
Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. {GS}hard: Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest,
Scaling Giant Models with Conditional Computation and Automatic Sharding. and Alexander M. Rush. 2020. HuggingFace’s Transformers: State-of-the-art
In International Conference on Learning Representations. https://openreview.net/ Natural Language Processing. arXiv:1910.03771 [cs.CL]
forum?id=qrwe7XHTmYb [39] Qifan Xu, Shenggui Li, Chaoyu Gong, and Yang You. 2021. An Efficient 2D Method
[20] Shigang Li and Torsten Hoefler. 2021. Chimera: Efficiently Training Large-Scale for Training Super-Large Deep Learning Models. arXiv preprint arXiv:2104.05343
Neural Networks with Bidirectional Pipelines. In Proceedings of the International (2021).
Conference for High Performance Computing, Networking, Storage and Analysis [40] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie,
(St. Louis, Missouri) (SC ’21). Association for Computing Machinery, New York, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang,
NY, USA, Article 27, 14 pages. https://doi.org/10.1145/3458817.3476145 Li Yang, and Amr Ahmed. 2020. Big Bird: Transformers for Longer Se-
[21] Shenggui Li, Fuzhao Xue, Yongbin Li, and Yang You. 2021. Sequence Parallelism: quences. In Advances in Neural Information Processing Systems, H. Larochelle,
Long Sequence Training from System Perspective. https://doi.org/10.48550/ M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran As-
ARXIV.2105.13120 sociates, Inc., 17283–17297. https://proceedings.neurips.cc/paper/2020/file/
[22] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, c8512d142a2d849725f31a9a7a361ab9-Paper.pdf
Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. [41] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui
2020. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov,
Proc. VLDB Endow. 13, 12 (aug 2020), 3005–3018. https://doi.org/10.14778/3415478. Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali
3415530 Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. OPT: Open Pre-trained
[23] Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. 2017. Transformer Language Models. arXiv:2205.01068 [cs.CL]
FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural [42] Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yan-
Networks. In 2017 IEEE International Symposium on High Performance Computer ping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E.
Architecture (HPCA). 553–564. https://doi.org/10.1109/HPCA.2017.29 Gonzalez, and Ion Stoica. 2022. Alpa: Automating Inter- and Intra-Operator Par-
[24] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. allelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating
Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad,
PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings CA, 559–578. https://www.usenix.org/conference/osdi22/presentation/zheng-
of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, lianmin
Canada) (SOSP ’19). Association for Computing Machinery, New York, NY, USA,
1–15. https://doi.org/10.1145/3341301.3359646
[25] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R.
Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019.
PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings
of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario,
Canada) (SOSP ’19). Association for Computing Machinery, New York, NY, USA,
1–15. https://doi.org/10.1145/3341301.3359646
[26] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley,
Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti,
Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021.
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-
LM. In Proceedings of the International Conference for High Performance Com-
puting, Networking, Storage and Analysis (St. Louis, Missouri) (SC ’21). Asso-
ciation for Computing Machinery, New York, NY, USA, Article 58, 15 pages.
https://doi.org/10.1145/3458817.3476209
[27] Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Pham, Raffaella
Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández.
2016. The LAMBADA dataset: Word prediction requiring a broad discourse
context. 1525–1534. https://doi.org/10.18653/v1/P16-1144
[28] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
Sutskever. 2019. Language Models are Unsupervised Multitask Learners.