Arxiv - 20210823 - Deepak Narayanan - Efficient Large-Scale Language Model Training On GPU Clusters Using Megatron-LM

Efficient Large-Scale Language Model Training on GPU Clusters

Using Megatron-LM
Deepak Narayanan‡★, Mohammad Shoeybi† , Jared Casper† , Patrick LeGresley† ,
Mostofa Patwary† , Vijay Korthikanti† , Dmitri Vainbrand† , Prethvi Kashinkunti† ,
Julie Bernauer† , Bryan Catanzaro† , Amar Phanishayee∗ , Matei Zaharia‡
† NVIDIA ‡ Stanford University ∗ Microsoft Research


arXiv:2104.04473v5 [cs.CL] 23 Aug 2021

Large language models have led to state-of-the-art accuracies across  +48 &

several tasks. However, training these models efficiently is chal-  8YVMRK20+ &
lenging because: a) GPU memory capacity is limited, making it 1IKEXVSR01 &
 +48 &
impossible to fit large models on even a multi-GPU server, and
  &)680 1
b) the number of compute operations required can result in un- )01S 1
realistically long training times. Consequently, new methods of  
model parallelism such as tensor and pipeline parallelism have =IEV
been proposed. Unfortunately, naive usage of these methods leads
to scaling issues at thousands of GPUs. In this paper, we show how Figure 1: Trend of sizes of state-of-the-art Natural Language Pro-
tensor, pipeline, and data parallelism can be composed to scale cessing (NLP) models with time. The number of floating-point op-
to thousands of GPUs. We propose a novel interleaved pipelining erations to train these models is increasing at an exponential rate.
schedule that can improve throughput by 10+% with memory foot-
print comparable to existing approaches. Our approach allows us Various model parallelism techniques have been proposed to
to perform training iterations on a model with 1 trillion parameters address these two challenges. For example, recent work [39, 40] has
at 502 petaFLOP/s on 3072 GPUs (per-GPU throughput of 52% of shown how tensor (intra-layer) model parallelism, where matrix
theoretical peak). multiplications within each transformer layer are split over multiple
GPUs, can be used to overcome these limitations. Although this
approach works well for models of sizes up to 20 billion parameters
1 INTRODUCTION on NVIDIA DGX A100 servers (with 8 80GB-A100 GPUs), it breaks
Transformer-based language models [13, 27, 33–35, 42, 46] in Nat- down for larger models. Larger models need to be split across
ural Language Processing (NLP) have driven rapid progress in re- multiple multi-GPU servers, which leads to two problems: (a) the
cent years as computation at scale has become more available and all-reduce communication required for tensor parallelism needs
datasets have become larger. Recent work [11, 40] has shown large to go through inter-server links, which are slower than the high-
language models to be effective zero- or few-shot learners, with high bandwidth NVLink [9] available within a multi-GPU server, and
accuracy on many NLP tasks and datasets. These large language (b) a high degree of model parallelism can create small matrix
models have a number of exciting downstream applications such multiplications (GEMMs), potentially decreasing GPU utilization.
as client feedback summarization, automatic dialogue generation, Pipeline model parallelism [14, 20, 23, 29, 30, 45] is another tech-
semantic search, and code autocompletion [1, 4, 5]. As a result, the nique to support the training of large models, where layers of a
number of parameters in state-of-the-art NLP models have grown model are striped over multiple GPUs. A batch is split into smaller
at an exponential rate (Figure 1). Training such models, however, microbatches, and execution is pipelined across these microbatches.
is challenging for two reasons: (a) it is no longer possible to fit the Layers can be assigned to workers in various ways, and various
parameters of these models in the main memory of even the largest schedules for the forward and backward passes of inputs can be
GPU (NVIDIA recently released 80GB-A100 cards), and (b) even if used. The layer assignment and scheduling strategy results in dif-
we are able to fit the model in a single GPU (e.g., by swapping pa- ferent performance tradeoffs. Regardless of schedule, to preserve
rameters between host and device memory [38]), the high number strict optimizer semantics, optimizer steps need to be synchronized
of compute operations required can result in unrealistically long across devices, leading to a pipeline flush at the end of every batch,
training times (e.g., training GPT-3 with 175 billion parameters [11] where microbatches are allowed to complete execution (and no
would require approximately 288 years with a single V100 NVIDIA new microbatches are injected). As much as 50% of time can be
GPU). This calls for parallelism. Data-parallel scale-out usually spent flushing the pipeline depending on the number of micro-
works well, but suffers from two limitations: a) beyond a point, the batches injected into the pipeline. The larger the ratio of number
per-GPU batch size becomes too small, reducing GPU utilization of microbatches to the pipeline size, the smaller the time spent in
and increasing communication cost, and b) the maximum number the pipeline flush. Therefore, to achieve high efficiency, a larger
of devices that can be used is the batch size, limiting the number of batch size is often necessary. In this work, we also introduce a new
accelerators that can be used for training. pipeline schedule that improves efficiency at small batch sizes.
Users can thus train their large models using various techniques,
★Work done as an intern at NVIDIA. each with different tradeoffs. Moreover, these techniques can be
combined. However, combining these techniques leads to non-trivial • The schedule used for pipeline parallelism has an impact
interactions, which need to be reasoned through carefully for good on the amount of communication, the pipeline bubble size,
performance. In this paper, we address the following question: and memory used to store activations. We propose a novel
How should parallelism techniques be combined to max- interleaved schedule that can improve throughput by as
imize the training throughput of large models given a much as 10% compared to previously-proposed schedules [20,
batch size while retaining strict optimizer semantics? 30] with comparable memory footprint.
• Values of hyperparameters such as microbatch size have an
In particular, we show how to combine pipeline, tensor, and impact on the memory footprint, the arithmetic efficiency of
data parallelism, a technique we call PTD-P, to train large language kernels executed on the worker, and the pipeline bubble size.
models with good computational performance (52% of peak device In our experiments, the optimal value of the microbatch size
throughput) on 1000s of GPUs. Our method leverages the com- is problem-dependent and can increase throughput by 15%.
bination of pipeline parallelism across multi-GPU servers, tensor • At scale, distributed training is communication-intensive.
parallelism within a multi-GPU server, and data parallelism, to When training a trillion-parameter model on 3072 GPUs, our
practically train models with a trillion parameters with graceful implementation used an effective bisection bandwidth of 892
scaling in an optimized cluster environment with high-bandwidth GB/s for pipeline-parallel communication, and 13 TB/s for
links between GPUs on the same server and across servers. We can data-parallel communication. Using slower inter-node in-
use similar ideas to train larger models as well, given more train- terconnects or more communication-intensive partitionings
ing resources. In our experiments, we demonstrate close to linear would hinder scaling performance.
scaling to 3072 A100 GPUs, with an achieved end-to-end training
We should note that we do not automatically explore the search
throughput of 163 teraFLOP/s per GPU (including communication,
space of parallelism strategies (such as FlexFlow [22], PipeDream [29],
data processing, and optimization), and an aggregate throughput
Tarnawski et al. [41], and DAPPLE [14]), but instead suggest heuris-
of 502 petaFLOP/s, on a GPT model [11] with a trillion parame-
tics (in §3) that we found work well in practice.
ters using mixed precision. This throughput facilitates practical
training times: we estimate end-to-end training of this model to
take ∼ 3 months. We believe this is the fastest training throughput
achieved for this size of model: past systems [29, 40] cannot train In this section, we discuss the parallelism techniques that facilitate
such large models since they do not combine pipeline and tensor the efficient training of large models that do not fit in the memory of
parallelism. We also compared to ZeRO [36], and found that our a single GPU. In this work, we combine pipeline model parallelism
approach outperforms ZeRO-3 by 70% for models with 175 and 530 and tensor model parallelism (combination shown in Figure 2) with
billion parameters due to less cross-node communication. These data parallelism. We call this PTD-P for short.
models are too large to fit on a multi-GPU server.
Achieving this throughput at scale required innovation and care- 2.1 Data Parallelism
ful engineering along multiple axes: efficient kernel implementa- With data parallelism [25, 43], each worker has a copy of the full
tions that allowed most of the computation to be compute-bound model, the input dataset is sharded, and workers aggregate their
as opposed to memory-bound, smart partitioning of computation gradients periodically to ensure that all workers see a consistent
graphs over the devices to reduce the number of bytes sent over net- version of the weights. For large models which do not fit on a single
work links while also limiting device idle periods, domain-specific worker, data parallelism can be used on smaller model shards.
communication optimization, and fast hardware (state-of-the-art
GPUs and high-bandwidth links between GPUs on the same and 2.2 Pipeline Model Parallelism
different servers). We are hopeful that our open-sourced software With pipeline parallelism, the layers of a model are sharded across
(available at will enable multiple devices. When used on models with the same transformer
other groups to train large NLP models efficiently at scale. block repeated, each device can be assigned an equal number of
In addition, we studied the interaction between the various com- transformer layers. We do not consider more asymmetric model ar-
ponents affecting throughput, both empirically and analytically chitectures, where assignment of layers to pipeline stages is harder;
when possible. Based on these studies, we offer the following guid- we defer to related work [22, 29, 41] to solve this problem.
ing principles on how to configure distributed training: A batch is split into smaller microbatches; execution is then
• Different forms of parallelism interact in non-trivial ways: pipelined across microbatches. Pipelining schemes need to ensure
the parallelization strategy has an impact on the amount of that inputs see consistent weight versions across forward and back-
communication, the compute efficiency with which kernels ward passes for well-defined synchronous weight update semantics.
are executed, as well as the idle time workers spend waiting Specifically, naive pipelining can lead to an input seeing weight
for computation due to pipeline flushes (pipeline bubbles). updates in the backward pass not seen in the forward pass.
For example, in our experiments, we found that sub-optimal To retain strict optimizer semantics exactly, we introduce peri-
combinations of tensor and pipeline model parallelism can odic pipeline flushes so that optimizer steps are synchronized across
lead to up to 2× lower throughput, even with high-bandwidth devices. At the start and end of every batch, devices are idle. We
network links between servers; tensor model parallelism call this idle time the pipeline bubble, and want to make it as small
is effective within a multi-GPU server, but pipeline model as possible. Asynchronous and bounded-staleness approaches such
parallelism must be used for larger models. as PipeMare, PipeDream, and PipeDream-2BW [23, 29, 30, 45] do
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Transformer layer #1 Transformer layer #2

Tensor MP partition #1 Tensor MP partition #1

Tensor MP partition #2 Tensor MP partition #2

Pipeline MP partition #1 Pipeline MP partition #2

Figure 2: Combination of tensor and pipeline model parallelism (MP) used in this work for transformer-based models.
Pipeline flush

Device 1 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10111213141516

Device 2 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10111213141516

Device 3 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10111213141516 9

Device 4 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10111213141516 9 10

Time Devices idle

Forward Pass Backward Pass

Figure 3: GPipe pipeline schedule with forward passes (blue) for all microbatches (represented by numbers) followed by backward passes
(green). The gray area represents the pipeline bubble. For simplicity, we assume that the backward pass takes twice as long as the forward
pass. The efficiency of the pipeline schedule does not depend on this factor. Each batch in this example consists of 8 microbatches, and the
numbers in each blue or green box are unique identifiers given to the corresponding microbatch (in particular, the first batch consists of
microbatches 1 − 8, the second batch consists of microbatches 9 − 16, and so on). The optimizer is stepped and weight parameters updated at
the pipeline flush to ensure strict optimizer semantics, leading to idle devices and a pipeline bubble.

Device 1 1 2 3 4 1 5 2 6 3 7 4 8 5 6 7 8 9 10 11 12 9 10

Device 2 1 2 3 4 1 2 5 3 6 4 7 5 8 6 7 8 9 10 11 12 9 10

Device 3 1 2 3 4 1 2 3 5 4 6 5 7 6 8 7 8 9 10 11 12 9 13 10 11

Device 4 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12

Assign multiple stages
to each device
111 11111 1 1 1 1 1 1
Device 1 1234123456 7 1 8 2 5 3 6 4 7 1 8 2 3 4 5 6 7 8 5 6 7 8 9 9
012 01234 5
9 10 11 12 9 10
6 3 4 5 6

111 111 1 1 1 1 1 1 1 1
Device 2 12341234 5 1 6 2 7 3 8 4 5 1 6 2 7 3 8 4 5 6 7 8 5 6 7 8 9 9
012 012 3
9 10 11 12 9 10 11 12
4 5 6 3 4 5 6

Device 3 123412 3 1 4 2 5 3 6 4 7 1 8 2 5 3 6 4 7 5 8 6 7 8 5 6 7 8 9
111 1
012 0
1 1 1 1 1 1 1 1
9 10 11 12 9 10 11 12 13
2 3 4 5 6 3 4 5

Device 4 12341 1 2 2 3 3 4 4 5 1 6 2 7 3 8 4 5 5 6 6 7 7 8 8 5 6 7 8 9
1 1 1 1 1 1 1 1 1
9 9 10 11 12 9 10 11 12 13 14
0 1 2 3 4 5 6 3 4

Forward Pass Backward Pass

Figure 4: Default and interleaved 1F1B pipeline schedules. The top figure shows the default non-interleaved 1F1B schedule. The bottom figure
shows the interleaved 1F1B schedule, where each device is assigned multiple chunks (in this case, 2). Dark colors show the first chunk and
light colors show the second chunk. The size of the pipeline bubble is smaller (the pipeline flush happens sooner in the interleaved timeline).

away with flushes completely, but relax weight update semantics. followed by backward passes for all microbatches (shown in Fig-
We defer consideration of such schemes to future work. ure 3). We can quantify the size of GPipe’s pipeline bubble (𝑡𝑝𝑏 ).
There are several possible ways of scheduling forward and back- We denote the number of microbatches in a batch as 𝑚, the number
ward microbatches across devices; each approach offers different of pipeline stages (number of devices used for pipeline parallelism)
tradeoffs between pipeline bubble size, communication, and mem- as 𝑝, the ideal time per iteration as 𝑡𝑖𝑑 (assuming perfect or ideal
ory footprint. We discuss two such approaches in this section. scaling), and the time to execute a single microbatch’s forward and
backward pass as 𝑡 𝑓 and 𝑡𝑏 . In this schedule, the pipeline bubble
2.2.1 Default Schedule. GPipe [20] proposes a schedule where the consists of 𝑝 − 1 forward passes at the start of a batch, and 𝑝 − 1
forward passes for all microbatches in a batch are first executed, backward passes at the end. The total amount of time spent in the
pipeline bubble is then 𝑡𝑝𝑏 = (𝑝 − 1) · (𝑡 𝑓 + 𝑡𝑏 ). The ideal processing This means that the new schedule reduces the bubble time by 𝑣.
time for the batch is 𝑡𝑖𝑑 = 𝑚 · (𝑡 𝑓 + 𝑡𝑏 ). Therefore, the fraction of This reduced pipeline bubble size, however, does not come for free:
ideal computation time spent in the pipeline bubble is: this schedule requires extra communication. Quantitatively, the
amount of communication also increases by 𝑣. In the next section,
𝑡𝑝𝑏 𝑝 −1
Bubble time fraction (pipeline bubble size) = = . we discuss how we can utilize the 8 InfiniBand networking cards in
𝑡𝑖𝑑 𝑚 a multi-GPU server (e.g., a DGX A100 node) to reduce the impact
For the bubble time fraction to be small, we thus need 𝑚 ≫ 𝑝. of this extra communication.
However, for such large 𝑚, this approach has a high memory foot-
print as it requires stashed intermediate activations (or just input 2.3 Tensor Model Parallelism
activations for each pipeline stage when using activation recompu- With tensor model parallelism, individual layers of the model are
tation) to be kept in memory for all 𝑚 microbatches through the partitioned over multiple devices. In this paper, we use the particular
lifetime of a training iteration. partitioning strategy used by Megatron [40] for transformer layers,
Instead, we use the PipeDream-Flush schedule [30]. In this sched- the bedrock of language models. We can apply similar ideas to other
ule, we first enter a warm-up phase where workers perform dif- types of models, like CNNs, as well. We briefly outline this strategy,
fering numbers of forward passes as shown in Figure 4 (top). This illustrated in Figure 5, below.
schedule limits the number of in-flight microbatches (the number of A transformer layer consists of a self-attention block followed
microbatches for which the backward pass is outstanding and acti- by a two-layer multi-layer perceptron (MLP). Further details of the
vations need to be maintained) to the depth of the pipeline, instead transformer layer can be found in Vaswani et al [42].
of the number of microbatches in a batch. After the warm-up phase, The MLP block consists of two GEMMs and a GeLU non-linearity:
each worker then enters a steady state, where workers perform
𝑌 = GeLU(𝑋𝐴). 𝑍 = Dropout(𝑌 𝐵).
one forward pass followed by one backward pass (1F1B for short).
Finally, at the end of a batch, we complete backward passes for We can split 𝐴 along its columns 𝐴 = [𝐴1, 𝐴2 ]. This partitioning
all remaining in-flight microbatches. The time spent in the bubble allows the GeLU non-linearity to be independently applied to the
is the same for this new schedule, but the number of outstanding output of each partitioned GEMM:
forward passes is at most the number of pipeline stages for the [𝑌1, 𝑌2 ] = [GeLU(𝑋𝐴1 ), GeLU(𝑋𝐴2 )].
PipeDream-Flush schedule. As a result, this schedule requires acti-
vations to be stashed for 𝑝 or fewer microbatches (compared to 𝑚 This is advantageous as it removes the need for synchronization
microbatches for the GPipe schedule). Consequently, when 𝑚 ≫ 𝑝, (needed if 𝐴 is split along its rows since GeLU is non-linear).
PipeDream-Flush is much more memory-efficient than GPipe. The rows of the second weight matrix 𝐵 can then be split along
its rows to remove the need for any communication between the
2.2.2 Schedule with Interleaved Stages. To reduce the size of the GEMMs (shown in Figure 5a), as shown below:
pipeline bubble, each device can perform computation for multiple  
subsets of layers (called a model chunk), instead of a single contigu- 𝐵 = 1 , 𝑌 = [𝑌1, 𝑌2 ].
ous set of layers. For example, if each device had 4 layers before
(i.e., device 1 had layers 1 − 4, device 2 had layers 5 − 8, and so on), The output of the second GEMM is then reduced across the GPUs
we could have each device perform computation for two model before the dropout layer.
chunks (each with 2 layers), i.e., device 1 has layers 1, 2, 9, 10; device We exploit the inherent parallelism in the multi-head attention
2 has layers 3, 4, 11, 12; and so on. With this scheme, each device operation to partition the self-attention block (shown in Figure 5b).
in the pipeline is assigned multiple pipeline stages (each pipeline The key (𝐾), query (𝑄), and value (𝑉 ) matrices can be partitioned in
stage has less computation compared to before). a column-parallel fashion. The output linear layer can then directly
As before, we can use an “all-forward, all-backward” version of operate on the partitioned output of the attention operation (weight
this schedule, but this has a high memory footprint (proportional to matrix partitioned across rows).
𝑚). Instead, we developed an interleaved schedule that adapts the This approach splits GEMMs in the MLP and self-attention blocks
memory-efficient 1F1B schedule from before. This new schedule is across GPUs while requiring only two all-reduce operations in the
shown in Figure 4, and requires the number of microbatches in a forward pass (𝑔 operator) and two all-reduces in the backward pass
batch to be an integer multiple of the degree of pipeline parallelism (𝑓 operator). We implemented 𝑓 and 𝑔 in a few lines of code.
(number of devices in the pipeline). For example, with 4 devices,
the number of microbatches in a batch must be a multiple of 4. 3 PERFORMANCE ANALYSIS OF
As shown in Figure 4, the pipeline flush for the same batch PARALLELIZATION CONFIGURATIONS
size happens sooner in the new schedule. If each device has 𝑣 In this section, we consider the performance implications of com-
stages (or model chunks), then the forward and backward time bining pipeline and tensor model parallelism with data parallelism.
for a microbatch for each stage or chunk will now be 𝑡 𝑓 /𝑣 and 𝑡𝑏 /𝑣. Given a fixed budget of GPUs and batch size, one can use different
int. = (𝑝−1) ·(𝑡 𝑓 +𝑡𝑏 )
The pipeline bubble time thus reduces to 𝑡𝑝𝑏 𝑣 , and degrees of the parallelism types in PTD-P to train models; each
the bubble time fraction is then: dimension exposes tradeoffs between memory footprint, device
utilization, and amount of communication.
𝑡𝑝𝑏 1 𝑝 −1 We discuss these tradeoffs in the rest of this section, and then
Bubble time fraction (pipeline bubble size) = = · . show empirical results in §5.4. We present analytical models where
𝑡𝑖𝑑 𝑣 𝑚
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

𝑌 = GeLU(𝑋𝐴) 𝑍 = Dropout(𝑌𝐵) R!F ! R!F !

R!F ! R!F !

𝑋 𝑋𝐴! 𝑌! 𝑌! 𝐵! 𝑍!


𝑋 𝑓 𝑔 𝑍

𝑋 𝑋𝐴" 𝑌" 𝑌" 𝐵" 𝑍"


𝐴 = [𝐴! , 𝐴" ] 𝐵=
(a) MLP. 
𝑌 = Self-Attention(𝑋)
𝑉! 𝑍 = Dropout(𝑌𝐵)
𝑋 𝑄! Figure 6: Fraction of time spent idling due to pipeline flush (pipeline


𝑌! 𝑌! 𝐵! 𝑍! bubble size) versus data-parallel size (𝑑), for different numbers of

𝐾! GPUs (𝑛) and ratio of batch size to microbatch size (𝑏 ′ = 𝐵/𝑏).

𝑋 𝑓 𝑔 𝑍


𝑌" 𝑌" 𝐵" 𝑍"

𝑋 𝑄" The amount of communication performed between different
𝑉" 𝐵= GPUs is also affected by the values of 𝑝 and 𝑡. Pipeline model par-
𝑄 = [𝑄!, 𝑄" ] allelism features cheaper point-to-point communication. Tensor
Split attention heads → & 𝐾 = [𝐾! , 𝐾" ]
𝑉 = [𝑉! ,𝑉" ]
model parallelism, on the other hand, uses all-reduce communi-
cation (two all-reduce operations each in the forward and back-
(b) Self-Attention. ward pass, see §2.3). With pipeline parallelism, the total amount
Figure 5: Blocks of transformer model partitioned with tensor of communication that needs to be performed between every pair
model parallelism (figures borrowed from Megatron [40]). 𝑓 and 𝑔 of consecutive devices (for either the forward or backward pass)
are conjugate. 𝑓 is the identity operator in the forward pass and all- for each microbatch is 𝑏𝑠ℎ, where 𝑠 is the sequence length and ℎ
reduce in the backward pass, while 𝑔 is the reverse. is the hidden size. With tensor model parallelism, tensors of total
relevant for the pipeline bubble size. We qualitatively describe how size 𝑏𝑠ℎ need to be all-reduced among 𝑡 model replicas twice each
communication time behaves and present cost models for amount in the forward and backward
 pass for each layer, leading to a total
of communication; however, we do not present direct cost models communication of 8𝑏𝑠ℎ 𝑡 −1𝑡 per layer per device for each micro-
for communication time, which is harder to model for a hierarchical batch. Each device typically has multiple layers; the total amount
network topology where interconnects between GPUs on the same of tensor-parallel-communication
   per device for each microbatch
server have higher bandwidth than interconnects between servers. is then 𝑙 stage · 8𝑏𝑠ℎ 𝑡 −1 , where 𝑙 stage is the number of layers in
To the best of our knowledge, this is the first work to analyze the
a pipeline stage.
performance interactions of these parallelization dimensions.
Consequently, we see that tensor model parallelism increases
the amount of communication between devices. Thus, when 𝑡 is
3.1 Notation larger than the number of GPUs in a single node, the overhead of
We use the following notation in this section: performing tensor model parallelism across slower inter-node links
• (𝑝, 𝑡, 𝑑): Parallelization dimensions. 𝑝 for the pipeline-model- can be impractical. We see these results empirically in §5.4.
parallel size, 𝑡 for the tensor-model-parallel size, and 𝑑 for
the data-parallel size. Takeaway #1: When considering different forms of model par-
• 𝑛: Number of GPUs. We require 𝑝 · 𝑡 · 𝑑 = 𝑛. allelism, tensor model parallelism should generally be used up
• 𝐵: Global batch size (provided as input). to degree 𝑔 when using 𝑔-GPU servers, and then pipeline model
• 𝑏: Microbatch size. parallelism can be used to scale up to larger models across servers.
• 𝑚 = 𝑏1 · 𝑑𝐵 : Number of microbatches in a batch per pipeline.
3.3 Data and Model Parallelism
3.2 Tensor and Pipeline Model Parallelism We also want to consider the interaction between data parallelism
Tensor and pipeline model parallelism can both be used to partition and the two types of model parallelism. In this section, we consider
a model’s parameters over multiple GPUs. As stated earlier, using these interactions independently for simplicity.
pipeline parallelism with periodic flushes results in a pipeline bubble
of size (𝑝 − 1)/𝑚. Let us assume that 𝑑 = 1 (data-parallel size); 3.3.1 Pipeline Model Parallelism. Let 𝑡 = 1 (tensor-model-parallel
consequently, 𝑡 · 𝑝 = 𝑛. The pipeline bubble size in terms of 𝑡 is: size). The number of microbatches per pipeline is 𝑚 = 𝐵/(𝑑 · 𝑏) =
𝑏 ′ /𝑑, where 𝑏 ′ := 𝐵/𝑏. With total number of GPUs 𝑛, the number
𝑝 − 1 𝑛/𝑡 − 1
= . of pipeline stages is 𝑝 = 𝑛/(𝑡 · 𝑑) = 𝑛/𝑑. The pipeline bubble size is:
𝑚 𝑚
As 𝑡 increases, the pipeline bubble thus decreases for fixed 𝐵, 𝑏, and 𝑝 − 1 𝑛/𝑑 − 1 𝑛 − 𝑑
𝑑 (𝑚 = 𝐵/(𝑏 · 𝑑) is fixed as well). = = .
𝑚 𝑏 ′ /𝑑 𝑏′






Figure 7: Per-GPU throughput versus microbatch size for a GPT
model with a billion parameters (128 attention heads, hidden size Figure 8: Behavior of normalized estimated  throughput (time com-
of 4096, 4 transformer layers). puted as 𝑡 = (𝑏 ′ /𝑏 + 𝑝 − 1) · 𝑡 𝑓 (𝑏) + 𝑡𝑏 (𝑏) ) with respect to the mi-
crobatch size 𝑏 for the same GPT model from Figure 7.

As 𝑑 becomes larger, 𝑛 − 𝑑 becomes smaller, and thus the pipeline

bubble becomes smaller. Figure 6 shows the behavior of the pipeline
bubble size for various values of 𝑑, 𝑛, and 𝑏 ′ . It might not be pos- The microbatch size thus affects both the arithmetic intensity of
sible to increase 𝑑 all the way to 𝑛 for all models, since a model’s operations as well as the pipeline bubble size (by affecting 𝑚). Fig-
full training memory footprint might be larger than the memory ure 8 shows estimated throughput (equation (1) used to estimate
capacity of a single accelerator. processing time) for a GPT model with a billion parameters and
Overall throughput will thus increase if the all-reduce commu- (𝑝, 𝑡) = (8, 8). The optimal 𝑏 for both batch sizes is 4.
nication needed for data parallelism does not drastically increase
with higher 𝑑, which should hold since the communication time Takeaway #3: The optimal microbatch size 𝑏 depends on the
for a ring-based implementation scales with 𝑑−1 1 throughput and memory footprint characteristics of the model, as
𝑑 = 1− 𝑑.
We can also analyze the impact of increasing the batch size 𝐵. well as the pipeline depth 𝑝, data-parallel size 𝑑, and batch size 𝐵.
For a given parallel configuration, as the batch size 𝐵 increases,
𝑏 ′ = 𝐵/𝑏 increases, (𝑛 − 𝑑)/𝑏 ′ decreases, consequently increasing 3.5 Activation Recomputation
throughput. All-reduce communication required by data parallelism Activation recomputation [12, 18, 20, 21] is an optional technique
also becomes more infrequent, further increasing throughput. that trades off an increase in the number of compute operations per-
3.3.2 Data and Tensor Model Parallelism. With tensor model paral- formed for additional memory footprint, by running the forward
lelism, all-reduce communication needs to be performed for every pass a second time just before the backward pass (and stashing
microbatch. This can be expensive across multi-GPU servers. On only the input activations for a given pipeline stage, as opposed to
the other hand, data parallelism only needs to perform expensive the entire set of intermediate activations, which is much larger).
all-reduce communication once per batch. Moreover, with tensor Activation recomputation is required to train reasonably large mod-
model parallelism, each model-parallel rank performs a subset of els with pipeline parallelism to keep memory footprint acceptably
the computation in each model layer, and thus for insufficiently- low. Previous work like PipeDream-2BW [30] has looked at the
large layers, modern GPUs might not perform these sub-matrix performance ramifications of activation recomputation.
computations with peak efficiency. The number of activation checkpoints does not impact through-
put, but impacts memory footprint. Let 𝐴input be the size of the
Takeaway #2: When using data and model parallelism, a total input activations of a layer, and 𝐴intermediate be the size of interme-
model-parallel size of 𝑀 = 𝑡 · 𝑝 should be used so that the model’s diate activations per layer. If a model stage has 𝑙 layers, and if 𝑐 is
parameters and intermediate metadata fit in GPU memory; data the number of checkpoints, the total memory footprint is going to
parallelism can be used to scale up training to more GPUs. be 𝑐 · 𝐴input +𝑙/𝑐 · 𝐴intermediate
√︃ . The minimum value of this function
is obtained when 𝑐 = 𝑙 · 𝐴intermediate /𝐴input . In practice, we

3.4 Microbatch Size
measure 𝐴intermediate empirically. For most cases, checkpointing
The choice of the microbatch size 𝑏 also affects model-training every 1 or 2 transformer layers is optimal.
throughput. For example, we see in Figure 7 that per-GPU through- Other techniques such as activation partitioning [36] can also
put increases by up to 1.3× with a larger microbatch size on a single be used in conjunction with tensor model parallelsim to reduce the
GPU. We now want to determine the optimal microbatch size 𝑏 memory footprint due to activations further.
given a parallel configuration (𝑝, 𝑡, 𝑑) and batch size 𝐵. The amount
of data-parallel communication will be the same regardless of the
microbatch size. Given functions 𝑡 𝑓 (𝑏) and 𝑡𝑏 (𝑏) that map the mi- 4 IMPLEMENTATION
crobatch size to the forward and backward computation times for a We implemented PTD-P as an extension to the Megatron-LM code-
single microbatch, the total time spent computing a batch, ignoring base. Our implementation is built using PyTorch [32]. We use
communication cost, is (as before, define 𝑏 ′ as 𝐵/𝑑): NCCL [7] for communication between devices. To obtain good
   performance, we implemented optimizations targeting both com-
𝑏 ′ /𝑏 + 𝑝 − 1 · 𝑡 𝑓 (𝑏) + 𝑡𝑏 (𝑏) . (1) munication and computation, which we outline below.
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Infiniband Scatter of All-gather of [𝑠, 𝑏, 𝑎, ℎ], where 𝑏, 𝑠, 𝑎, and ℎ are batch, sequence, attention-head,
1 3 1 3
and hidden-size dimensions, respectively. Second, we generated
NVLink fused kernels for a sequence of element-wise operations (bias +
2 4 2 4
GeLU and bias + dropout + add) using PyTorch JIT [10]. Third, we
created two custom kernels to enable the fusion of scale, mask, and
(a) W/o scatter/gather optimization. (b) With scatter/gather optimization. softmax (reduction) operations: one to support general masking
Figure 9: Scatter/gather communication optimization. Light blue (used in models such as BERT) and another to support implicit
blocks are layers in the first pipeline stage, and dark blue blocks causal masking (used in auto-regressive models such as GPT). We
are layers in the second pipeline stage. Without the scatter/gather quantify the effect of these optimizations in the next section.
optimization, the same tensor is sent redundantly over inter-node
InfiniBand links. Instead, at the sender, we can scatter the tensor
into smaller chunks, reducing the sizes of tensors sent over Infini-
Band links. The final tensor can then be rematerialized at the re- In this section, we seek to answer the following questions:
ceiver using a gather operation.
• How well does PTD-P perform? Does it result in realistic
end-to-end training times?
4.1 Communication Optimizations • How well does pipeline parallelism scale for a given model
and batch size? How much impact does the interleaved sched-
When using pipeline parallelism, we want to send and receive ten- ule have on performance?
sors in the forward and backward direction in parallel. Each DGX • How do different parallelization dimensions interact with
A100 is equipped with 8 InfiniBand (IB) networking cards. Unfor- each other? What is the impact of hyperparameters such as
tunately, sends and receives are point-to-point, and only happen microbatch size?
between a pair of GPUs on two servers, making it hard to leverage • What is the impact of the scatter-gather communication
all 8 cards for a single communication call within the pipeline. optimization? What types of limits do we put on hardware
However, we can leverage the fact that we use both tensor model when running training iterations at scale?
parallelism and pipeline model parallelism to reduce the overhead
of cross-node communication. In particular, we note that the output All of our results are run with mixed precision on the Selene
of each transformer layer is replicated (after 𝑔 in MLP block, see supercomputer [8]. Each cluster node has 8 NVIDIA 80-GB A100
Figure 5a) across the tensor-parallel ranks. As a result, ranks in two GPUs [6], connected to each other by NVLink and NVSwitch [9].
consecutive pipeline stages that are performing tensor model par- Each node has eight NVIDIA Mellanox 200Gbps HDR Infiniband
allelism send and receive the exact same set of tensors (Figure 9a). HCAs for application communication, with an additional two HCAs
For large enough models, we use a tensor-model-parallel size per node for dedicated storage. The nodes are connected in a three-
of 8. This means we are sending the same set of tensors 8 times level (leaf, spine, core) fat-tree topology with 850 switches. This
between corresponding GPUs on adjacent multi-GPU servers. To topology allows efficient all-reduce communication (dominant com-
reduce this redundancy, we can instead split the tensor on the send munication pattern in deep learning training). The cluster uses an
side into equal-sized chunks, and then only send one chunk to all-NVME shared parallel filesystem for high-performance data ac-
the corresponding rank on the next node using the rank’s own cess and storage. The peak device throughput of an A100 GPU with
InfiniBand card (e.g., rank 1 sends to rank 3 and rank 2 sends to 16-bit precision is 312 teraFLOP/s. For most of our results, we report
rank 4 in Figure 9). With 8 tensor-model-parallel ranks, each chunk throughput per GPU. Aggregate throughput can be computed by
would be one-eighth smaller. Then, on the receive side, we can multiplying with the number of GPUs used.
perform an all-gather over NVLink, which is much faster than the For our experiments, we use GPT models of appropriate sizes. In
InfiniBand interconnect, to re-materialize the full tensor. This is particular, for any given microbenchmark, the model needs to fit on
shown in Figure 9b. We call this the scatter/gather communication the number of model-parallel GPUs used in the experiment. We use
optimization. This optimization helps better leverage the multiple standard model architectures such as GPT-3 [11] when appropriate.
IB cards on the DGX A100 servers, and makes more communication-
intensive schedules such as the interleaved one feasible. 5.1 End-to-End Performance
Quantitatively, with the scatter-gather communication optimiza- We consider the end-to-end performance of our system on GPT
tion, the total amount of communication that needs to be performed models ranging from a billion to a trillion parameters, using ten-
between every pair of consecutive stages is reduced to 𝑏𝑠ℎ 𝑡 , where sor, pipeline, and data parallelism (degrees picked using heuristics
𝑡 is the tensor-model-parallel size, 𝑠 is the sequence length, and ℎ described in §3). In particular, we use the interleaved pipeline sched-
is the hidden size (𝑡 = 8 in our experiments). ule with the scatter/gather optimization enabled. All models use a
vocabulary size (denoted by 𝑉 ) of 51,200 (multiple of 1024) and a
4.2 Computation Optimizations sequence length (𝑠) of 2048. We vary hidden size (ℎ), number of at-
We implemented three model-specific optimizations to the compu- tention heads, and number of layers (𝑙). The number of parameters
tation graph to attain high performance. First, we changed the data in a model, 𝑃, can be computed as:
layout in the transformer layer to avoid memory-intensive trans-
pose operations, and to enable the use of strided batched GEMM 13 𝑉 + 𝑠
𝑃 = 12𝑙ℎ 2 1 + + . (2)
kernels. Specifically, we changed the data layout from [𝑏, 𝑠, 𝑎, ℎ] to 12ℎ 12𝑙ℎ
Number of Achieved Percentage of Achieved
Attention Hidden Number Tensor model- Pipeline model- Number Batch
parameters teraFlOP/s theoretical aggregate
heads size of layers parallel size parallel size of GPUs size
(billion) per GPU peak FLOP/s petaFLOP/s
1.7 24 2304 24 1 1 32 512 137 44% 4.4
3.6 32 3072 30 2 1 64 512 138 44% 8.8
7.5 32 4096 36 4 1 128 512 142 46% 18.2
18.4 48 6144 40 8 1 256 1024 135 43% 34.6
39.1 64 8192 48 8 2 512 1536 138 44% 70.8
76.1 80 10240 60 8 4 1024 1792 140 45% 143.8
145.6 96 12288 80 8 8 1536 2304 148 47% 227.1
310.1 128 16384 96 8 16 1920 2160 155 50% 297.4
529.6 128 20480 105 8 35 2520 2520 163 52% 410.2
1008.0 160 25600 128 8 64 3072 3072 163 52% 502.0

Table 1: Weak-scaling throughput for GPT models ranging from 1 billion to 1 trillion parameters.

As the model size increases, we also increase the batch size (𝐵) and >I63& 48(4&
the number of GPUs (𝑛). The majority of floating-point operations >I63& 48(4&

in the model are performed in the matrix multiplications (GEMMs)
in the transformer and logit layers. Considering just these GEMMs,

the number of FLOPs per iteration is (more details in the Appendix):

𝑠 𝑉
𝐹 = 96𝐵𝑠𝑙ℎ 2 1 + + . (3)
6ℎ 16𝑙ℎ 
This is a lower bound for the true FLOP count but should be close 
to the actual value. We count a FLOP as a floating-point operation    
regardless of precision. We also note that equation (3) assumes 2YQFIVSJ+49W
activation recomputation and takes into account the floating-point
Figure 10: Throughput per GPU of PTD-P and ZeRO-3 for two differ-
operations associated with the extra forward pass.
ent GPT models (the 175B GPT-3 model is shown with dotted lines,
Table 1 shows the model configurations along with the achieved
and the 530B model is shown with solid lines). Global batch sizes
FLOP/s (both per GPU and aggregate over all GPUs). We see super- are fixed and ZeRO-3 is used without any model parallelism.
linear scaling to 3072 A100 GPUs (384 DGX A100 nodes), since
GPU utilization improves as the models get larger (larger matrix 5.2 Comparison to ZeRO-3
multiplications) without significant increase in the communication
We compare PTD-P to ZeRO-3 [36, 37] in Table 2 and Figure 10 for
time relative to computation time. Note that throughput is measured
the standard GPT-3 model architecture, as well as the 530-billion-
for end-to-end training, i.e., includes all operations including data
parameter model from Table 1. The results provide a point of com-
loading, optimizer steps, communication, and logging. We achieve
parison to a method that does not use model parallelism. We in-
52% of peak device throughput for the largest model, and 44% of
tegrated ZeRO into our codebase using the DeepSpeed Python
peak device throughput for the smallest model.
library [3]. We keep the global batch size the same as we increase
Training Time Estimates. Given these throughputs, we can
the number of GPUs. With fewer GPUs and a microbatch size of 4,
also estimate the total amount of time needed for end-to-end train-
PTD-P results in 6% and 24% higher throughput for the 175- and
ing on 𝑇 tokens. Training requires 𝐼 = 𝑇 /(𝐵 · 𝑠) iterations. Using
530-billion-parameter models respectively. As we increase the num-
the value of 𝐹 from equation (3) and empirical end-to-end through-
ber of GPUs, PTD-P scales more gracefully than ZeRO-3 in isolation
puts from Table 1 (denoted by 𝑋 ), we can estimate total training
(see Figure 10). For example, by doubling the number of GPUs (keep-
time. We note that for the configurations in Table 1, we have 6ℎ ≫ 𝑠,
ing the batch size the same), PTD-P outperforms ZeRO-3 by 70%
16𝑙ℎ ≫ (𝑉 + 𝑠), and 12𝑙ℎ ≫ 𝑉 . Combining these observations with
for both models due to less cross-node communication. We note
equations (2) and (3), we arrive at
that we have only considered ZeRO-3 without tensor parallelism.
8𝑇 𝑃 ZeRO-3 can be combined with model parallelism to potentially
End-to-end training time ≈ . (4)
𝑛𝑋 improve its scaling behavior.
Let us consider the GPT-3 model with 𝑃 =175 billion parameters as
an example. This model was trained on 𝑇 = 300 billion tokens. On 5.3 Pipeline Parallelism
𝑛 = 1024 A100 GPUs using batch size 1536, we achieve 𝑋 = 140 ter- We now evaluate the weak-scaling performance of pipeline paral-
aFLOP/s per GPU. As a result, the time required to train this model lelism in isolation, and also compare the performance of the non-
is 34 days. For the 1 trillion parameter model, we assume that 450 interleaved schedule to the interleaved schedule.
billion tokens are needed for end-to-end training. With 3072 A100
GPUs, we can achieve a per-GPU throughput of 163 teraFLOP/s, 5.3.1 Weak Scaling. We evaluate the scaling of the default non-
and end-to-end training time of 84 days. We believe these training interleaved pipeline-parallel schedule using a weak scaling setup,
times (using a reasonable number of GPUs) are practical. a GPT model with 128 attention heads and a hidden size of 20480,
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Number of Model- Achieved Training time

Batch Number Microbatch
Scheme parameters parallel teraFlOP/s for 300B
size of GPUs size
(billion) size per GPU tokens (days)
384 4 144 90

ZeRO-3 174.6 1 1536 768 2 88 74

without 1536 1 44 74
Model 2560* 640 4 138 169
529.6 1 1120 2 98 137
2240 1 48 140
384 1 153 84
174.6 96 1536 768 1 149 43
PTD 1536 1 141 23
Parallelism 560 1 171 156
529.6 280 2240 1120 1 167 80
2240 1 159 42

Table 2: Comparison of PTD Parallelism to ZeRO-3 (without model paralllelism). The 530-billion-parameter GPT model did not fit on 560
GPUs when using a microbatch size of 4 with ZeRO-3, so we increased the number of GPUs used to 640 and global batch size to 2560 to provide
a throughput estimate (relevant row marked in table with a *).





Figure 11: Throughput per GPU of pipeline parallelism using two Figure 13: Throughput per GPU of various parallel configurations
different batch sizes in a weak-scaling experiment setup (model size that combine pipeline and tensor model parallelism using a GPT
increases with the pipeline-parallel size). model with 162.2 billion parameters and 64 A100 GPUs.


5.3.2 Interleaved versus Non-Interleaved Schedule. Figure 12 shows

the per-GPU-throughput for interleaved and non-interleaved sched-

ules on the GPT-3 [11] model with 175 billion parameters (96

 layers, 96 attention heads, hidden size of 12288). The interleaved

 2SRMRXIVPIEZIH schedule with the scatter/gather communication optimization has
-RXIVPIEZIH higher computational performance than the non-interleaved (de-
 fault) schedule. This gap closes as the batch size increases due to
     two reasons: (a) as the batch size increases, the bubble size in the
&EXGLWM^I default schedule decreases, and (b) the amount of point-to-point
Figure 12: Throughput per GPU of interleaved and non-interleaved communication within the pipeline is proportional to the batch size,
schedules for a GPT model (175 billion parameters) on 96 GPUs. and consequently the non-interleaved schedule catches up as the
amount of communication increases (the interleaved schedule fea-
tures more communication per sample). Without the scatter/gather
and a microbatch size of 1. As we increase the number of pipeline optimization, the default schedule performs better than the inter-
stages, we also increase the size of the model by proportionally leaved schedule at larger batch sizes (not shown).
increasing the number of layers in the model, e.g., with a pipeline-
parallel size of 1, we use a model with 3 transformer layers and 15 5.4 Comparison of Parallel Configurations
billion parameters, and with a pipeline-parallel size of 8, we use a
In this sub-section, we show the various tradeoffs associated with
model with 24 transformer layers and 121 billion parameters. We
combining different parallelization dimensions. In particular, we
use a tensor-parallel size of 8 for all configurations, and vary the
show the performance for parallel configurations using the same
total number of A100 GPUs used from 8 to 64. Figure 11 shows
number of GPUs for a given model and multiple batch sizes.
throughput per GPU for two different batch sizes to illustrate the
impact of the pipeline bubble, which behaves as 𝑚 (§2.2.1). As 5.4.1 Tensor versus Pipeline Parallelism. We evaluate the impact of
expected, the higher batch size scales better since the pipeline pipeline and tensor model parallelism on performance for a given
bubble is amortized over more microbatches. model and batch size. The empirical results in Figure 13 show the

(used by PipeDream [30] and others) in isolation can match the
&EXGLWM^I! performance of using both techniques in conjunction.

5.4.2 Pipeline versus Data Parallelism. We evaluate the impact of
 data and pipeline model parallelism on performance for a GPT
model with 5.9 billion parameters (32 transformer layers, 32 at-

     tention heads, hidden size of 3840) in Figure 14. We use a smaller
4MTIPMRITEVEPPIPWM^I(EXETEVEPPIPWM^I model than before since we want to show performance for models
that fit when the model-parallel size is only 2. For simplicity, we
Figure 14: Throughput per GPU of various parallel configurations
keep the microbatch size equal to 1 in these experiments. We see
that combine data and pipeline model parallelism using a GPT
that for each batch size, the throughput decreases as the pipeline-
model with 5.9 billion parameters, three different batch sizes, mi-
crobatch size of 1, and 64 A100 GPUs. parallel size increases, matching our analytical model from §3.3.
Pipeline model parallelism should be used primarily to support the


training of large models that do not fit on a single worker, and data
&EXGLWM^I! parallelism should be used to scale up training.

5.4.3 Tensor versus Data Parallelism. We also evaluate the impact
 of data and tensor model parallelism on performance for the same
GPT model with 5.9 billion parameters in Figure 15 (smaller model

     used for same reason as above). As before, we keep the microbatch
8IRWSVTEVEPPIPWM^I(EXETEVEPPIPWM^I size equal to 1 initially. With larger batch sizes and a microbatch
size of 1, data-parallel communication is infrequent; the all-to-all
Figure 15: Throughput per GPU of various parallel configurations
communication required in tensor model parallelism needs to be
that combine data and tensor model parallelism using a GPT model
with 5.9 billion parameters, three different batch sizes, microbatch
performed for every microbatch in a batch. This all-to-all communi-
size of 1, and 64 A100 GPUs. cation with tensor model parallelism dominates end-to-end training
time, especially when communication needs to be performed across


multi-GPU nodes. Additionally, as the tensor-model-parallel size

increases, we perform smaller matrix multiplications on every GPU,
 decreasing utilization on each GPU.

We should note that although data parallelism can lead to effi-
 &EXGLWM^I! cient scaling, we cannot use data parallelism in isolation for very
&EXGLWM^I! large models with a limited training batch size because of a) insuffi-

    cient memory capacity, and b) scaling limitations of data parallelism
1MGVSFEXGLWM^I (e.g., GPT-3 was trained to convergence with a batch size of 1536.
Data parallelism thus supports parallelization to only 1536 GPUs;
Figure 16: Throughput per GPU of a (𝑡, 𝑝) = (8, 8) parallel configura-
tion for different microbatch sizes on a GPT model with 91 billion
however, roughly 10, 000 GPUs were used to train this model in a
parameters, for two different batch sizes using 64 A100 GPUs. reasonable amount of time).

importance of using both tensor and pipeline model parallelism in

conjunction to train a 161-billion-parameter GPT model (32 trans- 5.5 Microbatch Size
former layers to support pipeline-parallel size of 32, 128 attention We evaluate the impact of the microbatch size on the performance
heads, hidden size of 20480) with low communication overhead and of parallel configurations that combine pipeline and tensor model
high compute resource utilization. We observe that tensor model parallelism in Figure 16 for a model with 91 billion parameters
parallelism is best within a node (DGX A100 server) due to its expen- ((𝑡, 𝑝) = (8, 8)). We see that the best microbatch size is 2 for this
sive all-reduce communication. Pipeline model parallelism, on the model; the optimal microbatch size is different for other models (not
other hand, uses much cheaper point-to-point communication that shown in Figure) and model-dependent. For a given batch size, in-
can be performed across nodes without bottlenecking the entire creasing the microbatch size decreases the number of microbatches
computation. However, with pipeline parallelism, significant time in the pipeline (𝑚), leading to a larger pipeline bubble; however,
can be spent in the pipeline bubble: the total number of pipeline increasing the microbatch size can also improve GPU utilization
stages should thus be limited so that the number of microbatches by increasing the arithmetic intensity of executed kernels. These
in the pipeline is a reasonable multiple of the number of pipeline two factors are at odds with each other, which makes the choice
stages. Consequently, we see peak performance when the tensor- of optimal microbatch size challenging. Our analytical model from
parallel size is equal to the number of GPUs in a single node (8 with §3.3 reasonably approximates true performance, and can be used
DGX A100 nodes). This result indicates that neither tensor model as a proxy to determine how to pick this hyperparameter value for
parallelism (used by Megatron [40]) nor pipeline model parallelism various training configurations and models.
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

 5.9 Inter-Node Communication Bandwidth

 Our strong results are a byproduct of using an optimized software

and hardware stack together. In particular, we take advantage of the
 high-bandwidth communication links between GPUs on the same
 server and across servers. On the trillion-parameter model with
3072 GPUs, we observed that the effective bisection bandwidth of
 point-to-point communication among pipeline stages is 892 GB/s,
&EXGLWM^I while the effective bisection bandwidth of all-reduce operations
among data-parallel replicas is 12.9 TB/s. A less-optimized parti-
Figure 17: Throughput (in sequences per second) with and without tioning of operators across devices would lead to more inter-node
activation recomputation for a GPT model with 145 billion param-
communication, hampering scaling performance.
eters using 128 A100 GPUs ((𝑡, 𝑝) = (8, 16)).

5.10 Checkpoint Loading and Saving


An important practical consideration for the training of large mod-

 els is loading and saving model checkpoints, which are especially

large for the models considered in this paper. For example, the
 trillion-parameter model has a checkpoint of size 13.8 terabytes.
9RSTXMQM^IH The initial load of checkpoints for the trillion-parameter model by

7GEXXIVKEXLIVSTXMQM^EXMSR all 384 nodes (3072 GPUs) reaches a peak read bandwidth of 1TB/s,
 the maximum read throughput possible from the parallel filesystem.
&EXGLWM^I Checkpoint saves reach 40% of peak write bandwidth (273 GB/s).

Figure 18: Throughput per GPU with and without the scatter/gather
optimization for a GPT model with 175 billion parameters using 96
A100 GPUs and the interleaved schedule. In this section, we discuss other techniques to train models at scale.

Parallelism for Large Models. Pipeline model parallelism is a com-

mon technique used to train large models. Pipeline parallelism
5.6 Activation Recomputation comes in a few flavors: the mode discussed in this paper uses flushes
Figure 17 shows throughput with and without activation recompu- to ensure strict optimizer semantics. TeraPipe [26] exposes fine-
tation for a GPT model with 145 billion parameters (80 transformer grained pipeline parallelism across tokens in a single training se-
layers, 96 attention heads, hidden size of 12288) using 128 A100 quence for auto-regressive models like GPT. PipeTransformer [19]
GPUs, (𝑡, 𝑝) = (8, 16), and a range of batch sizes. For small batch elastically adjusts the degree of pipelining and data parallelism
sizes, activation recomputation leads to up to 33% lower throughput by freezing layers with “stable” weights, and instead dedicates re-
(in sequences per second) due to the extra forward pass that needs sources to train the remaining “active” layers. HetPipe [31] uses a
to be executed during the backward pass. However, activation re- combination of pipeline and data parallelism on a set of heteroge-
computation is needed to support larger batch sizes. Throughput at neous accelerators. Pipeline parallelism can also be implemented
large batch sizes with activation recomputation is up to 2× higher with relaxed semantics: PipeDream-2BW [30] maintains two weight
than the best throughput achieved without activation recomputa- versions and guarantees 1-stale weight updates without expen-
tion (for a smaller batch size) due to a smaller pipeline bubble. sive flushes, while PipeMare [45] and Kosson et al. [23] use asyn-
choronous pipeline parallelism. These techniques have improved
5.7 Scatter-Gather Optimization throughput compared to the techniques with pipeline flushes con-
sidered in this paper, but potentially at the cost of convergence rate
Figure 18 shows per-GPU-throughput with and without (unop-
or final accuracy. Moreover, pipeline parallelism in isolation can
timized) the scatter/gather communication optimization for the
still only scale to a number of devices equal to the number of layers
GPT-3 model with 175 billion parameters. We see an improvement
in the model, which is limiting for certain model architectures.
of up to 11% in throughput for communication-intensive sched-
PipeDream [29] combined pipeline parallelism and data paral-
ules (large batch size with interleaving) by reducing the amount of
lelism in a principled way to reduce cross-device communication.
communication over cross-node links.
DeepSpeed [2] combined pipeline parallelism with tensor and data
parallelism to train models with up to a trillion parameters, but
5.8 Fused Operators with lower throughput than what was shown in this paper (52%
We also evaluate the performance impact of operator fusion de- vs. 36% of peak) for a few reasons: operator fusion to keep most of
scribed in §4.2. For the GPT-3 model (175 billion parameters), through- the operator graph compute-bound, a more-efficient pipeline paral-
put increased by 19% with fusion (113 teraFLOP/s per GPU to 135 lelism schedule to minimize the pipeline bubble size, fast hardware
teraFLOP/s per GPU). For the larger GPT model with 530 billion (A100 vs. V100 GPUs and high-bandwidth links between GPUs
parameters (model configuration in Figure 1), throughput increased on the same and different servers), and scaling to more GPUs. We
by 11% (133 teraFLOP/s per GPU to 148 teraFLOP/s per GPU). want to emphasize that this higher throughput makes estimated
training times much more practical (about 3 months); an aggregate various tradeoffs associated with each of these types of parallelism,
throughput of 37.6 petaFLOP/s would take about 40 months to train and how the interactions between them need to be considered
an equivalently-sized model. We can scale to larger models as well, carefully when combined.
but would need more GPUs to keep training time practical. Even though the implementation and evaluation in this paper
Mesh-TensorFlow [39] proposes a language for easily specifying is GPU-centric, many of these ideas translate to other types of
parallelization strategies that combine data and model parallelism. accelerators as well. Concretely, the following are ideas that are
Switch Transformers [15] used Mesh-Tensorflow to train a sparsely accelerator-agnostic: a) the idea of smartly partitioning the model
activated expert-based model with 1.6 trillion parameters, with training graph to minimize the amount of communication while
improved pre-training speed over the T5-11B model [35]. still keeping devices active, b) minimizing the number of memory-
bound kernels with operator fusion and careful data layout, c) other
Sharded Data Parallelism. As part of performance optimizations
domain-specific optimizations (e.g., scatter-gather optimization).
for MLPerf 0.6 [28], sharded data parallelism [24, 44], where opti-
mizer state is sharded over data-parallel workers, was introduced. ACKNOWLEDGEMENTS
This method has two advantages: (a) it does not introduce extra
communication over vanilla data parallelism, and (b) it divides the We thank the anonymous reviewers, Seonmyeong Bak, Keshav San-
optimizer’s computation and memory cost across the data-parallel thanam, Trevor Gale, Dimitrios Vytiniotis, and Siddharth Karam-
partitions. ZeRO [36, 37] extends this idea: weight parameters and cheti for their help and feedback that improved this work. This
gradients are sharded across data-parallel workers as well, and research was supported in part by NSF Graduate Research Fellow-
workers fetch relevant state from their “owning” workers before ship grant DGE-1656518 and NSF CAREER grant CNS-1651570. Any
performing computations. This adds additional communication, opinions, findings, and conclusions or recommendations expressed
which can be partially hidden by carefully overlapping computa- in this material are those of the authors alone.
tion and communication. However, this can become harder if tensor
parallelism is not used or the batch size is not large enough to hide
the extra communication overhead (Figure 10). ZeRO-Infinity [37] In this section, we describe how we calculate the number of floating-
uses NVMe to efficiently swap parameters, enabling the training of point operations (FLOPs) in a model. We consider a language model
very large models on a small number of GPUs. We note that using with 𝑙 transformer layers, hidden size ℎ, sequence length 𝑠, vocabu-
a small number of GPUs for training a very large model results in lary size 𝑉 , and training batch size 𝐵.
unrealistic training times (e.g., thousands of years to converge). A 𝐴𝑚×𝑘 × 𝑋𝑘×𝑛 matrix multiplication requires 2𝑚 × 𝑘 × 𝑛 FLOPs
(factor of 2 needed to account for multiplies and adds).
Automatic Partitioning. FlexFlow [22], PipeDream [29], DAP- A transformer layer consists of an attention block followed by
PLE [14], and Tarnawski et al. [41] all auto-partition model training a 2-layer feed-forward network. For the attention block, the main
graphs over multiple devices with the help of cost models. However, FLOP contributors are the key, query, and value transformation
each of these do not consider all the parallelism dimensions con- (6𝐵𝑠ℎ 2 operations), attention matrix computation (2𝐵𝑠 2ℎ opera-
sidered in this paper: pipeline and tensor model parallelism, data tions), attention over values (2𝐵𝑠 2ℎ operations), and post-attention
parallelism, microbatch size, and the effect of memory-savings op- linear projection (2𝐵𝑠ℎ 2 operations). The feed-forward network
timizations like activation recomputation on the training of models increases the hidden size to 4ℎ and then reduces it back to ℎ; this
larger than the memory capacity of an accelerator. These added requires 16𝐵𝑠ℎ 2 FLOPs. Summing these together, each transformer
dimensions increase the search space that needs to be explored. layer results in 24𝐵𝑠ℎ 2 + 4𝐵𝑠 2ℎ FLOPs for the forward pass. The
Gholami et al. [16] show how communication costs for combina- backward pass requires double the number of FLOPs since we
tions of data and model parallelism can be modeled. need to calculate the gradients with respect to both input and
HPC for Model Training. Goyal et al. [17] and You et al. [47] both weight tensors. In addition, we are using activation recomputation,
demonstrate the use of High Performance Computing techniques which requires an additional forward pass before the backward
to train highly-accurate ImageNet models in minutes. However, the pass. As a result, the total number of FLOPs per transformer layer
2 2  2 𝑠 
image classification models considered fit comfortably on a single is 4 × 24𝐵𝑠ℎ + 4𝐵𝑠 ℎ = 96𝐵𝑠ℎ 1 + .
accelerator, rendering model parallelism unnecessary, support very The other main contributor to the FLOP count is the logit layer in
large batch sizes (> 32k) that allow scaling data parallelism to large the language model head, which transforms features of dimension
worker counts with infrequent communication, and are composed ℎ to the vocabulary dimension 𝑉 . The required FLOPs for this
of compact convolutional layers that are inherently amenable to operation is 2𝐵𝑠ℎ𝑉 in the forward pass and 4𝐵𝑠ℎ𝑉 in the backward
data-parallel communication. pass, resulting in 6𝐵𝑠ℎ𝑉 FLOPs in total.
Thus, for a transformer model with 𝑙 transformer layers, the
7 DISCUSSION AND CONCLUSION total number of floating-point operations is:
In this paper, we have shown how PTD-P (inter-node pipeline par-
𝑠 𝑉
allelism, intra-node tensor parallelism, and data parallelism) can be 96𝐵𝑠𝑙ℎ 2 1 + + .
6ℎ 16𝑙ℎ
composed to achieve high aggregate throughput (502 petaFLOP/s)
while training large models with a trillion parameters. This facil-
itates end-to-end training in reasonable times (estimated time of
around 3 months for a trillion-parameter model). We discussed the
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

