0% found this document useful (0 votes)
11 views

3 Sequence Parallelism Long Sequ

Uploaded by

dylanxult
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

3 Sequence Parallelism Long Sequ

Uploaded by

dylanxult
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Sequence Parallelism: Long Sequence Training from System Perspective

Shenggui Li 1 Fuzhao Xue 1 Chaitanya Baranwal 1 Yongbin Li 2 Yang You 1

Abstract et al., 2020; Wang et al., 2021). These Transformer-based


Transformer achieves promising results on vari- models learn powerful context-aware representation by ap-
ous tasks. However, self-attention suffers from plying self-attention to all pairs of tokens from the input
quadratic memory requirements with respect to sequence. This mechanism captures long-term dependen-
the sequence length. Existing work focuses on cies at the token level for sequence modeling. However,
reducing time and space complexity from an al- self-attention suffers from quadratic memory requirements
gorithm perspective. In this work, we propose with respect to sequence length. Existing works focusing
sequence parallelism, a memory-efficient paral- on long sequence modeling devote to solve this problem
lelism to solve this issue from system perspec- from algorithm perspective. That is, these works mainly try
tive instead. With sequence parallelism, we no to reduce the time and space complexity of attention. In
longer require a single device to hold the whole this paper, we focus on solving the long sequence training
sequence. Besides, using efficient attention with problem from system perspective. Existing system requires
linear complexity, our sequence parallelism en- us to hold the whole sequence in one GPU, which limits the
ables us to train transformer with infinite long length of input sequence. Unfortunately, the long sequence
sequence. Experiments show that sequence par- is common in real-world applications. For instance, when
allelism performs well when scaling with batch we train Transformer for medical image classification, each
size and sequence length. Compared with ten- image is much larger than it is in usual (e.g., 512×512×512
sor parallelism, our approach achieved 13.7× and vs 256×256×3). Then, each medical image has much more
3.0× maximum batch size and sequence length tokens (i.e., over 512×). Each input sequence is much
respectively when scaling up to 64 NVIDIA P100 longer than usual. In this case, it is challenging to hold the
GPUs. With efficient attention, sequence can han- whole sequence within single GPU.
dle sequence with over 114K tokens, which is In this paper, we designed and implemented sequence paral-
over 27× longer than existing efficient attention lelism, which aims at breaking the limitation that we must
works holding the whole sequence on a single store the whole sequence in one GPU. The proposed system
device. can train transformer-based models with longer sequences
and a larger batch size. Specifically, we first split the input
sequence into multiple chunks along the sequence dimen-
1. Introduction sion and feed each sub-sequence chunk to one correspond-
Transformer-based language models (Radford et al., 2019; ing GPU. Each GPU thus only holds a part of the full se-
Brown et al., 2020; Devlin et al., 2018) have achieved im- quence, i.e., a sub-sequence. To apply self-attention to the
pressive performance on various natural language under- tokens from different chunks, the main challenge is to com-
standing and generation tasks (e.g., Q&A (Qu et al., 2019; pute attention scores and outputs across GPUs efficiently.
Yang et al., 2020), relation extraction (Xue et al., 2020b;a; To tackle this problem, we proposed Ring Self-Attention
Zhou et al., 2020) and dialogue system (Ni et al., 2021)). (RSA), which circulates key and value embeddings across
Recently, Transformer also achieved promising results on GPUs in a ring manner. In this case, each device is just
computer vision tasks (Dosovitskiy et al., 2020; Zhang et al., required to keep the attention embeddings corresponding to
2020; 2021) and even on bioinformatics tasks (Elnaggar its own sub-sequence. As a result, our sequence parallelism
is memory-efficient, especially for long input sequences.
1
School of Computing, National University of Singa-
pore 2 School of Computer Science and Engineering, Nanyang To model long sequences, existing works mainly focus on
Technological University. Correspondence to: Yang You efficient attention (e.g., (Zaheer et al., 2020)) with linear
<[email protected]>. instead of quadratic space complexity. In this paper, we
aim to solve the long sequence modeling problem from the
Work presented at the ES-FoMo Workshop at ICML 2023., Hon- distributed system perspective. We evaluated our system
olulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the
author(s). on both vanilla attention to verify our system is a general

1
Sequence Parallelism: Long Sequence Training from System Perspective

Device 1 Query Key Value QK^T Output

Device 4 Query Key Value QK^T Output Device 2 Query Key Value QK^T Output

Device 3 Query Key Value QK^T Output

(a) Transmitting key embeddings among devices to calculate attention scores.

Device 1 Query Key Value QK^T Output

Device 4 Query Key Value QK^T Output Device 2 Query Key Value QK^T Output

Device 3 Query Key Value QK^T Output

(b) Transmitting value embeddings among devices to calculate the output of attention layers.
Figure 1: Ring Self-Attention.

solution, and evaluated on efficient attention setting to show 2. Sequence parallelism


the upper bound sequence length. Existing pipeline paral-
lelism (Huang et al., 2018) and tensor parallelism (Shoeybi We propose sequence parallelism for training Transformer
et al., 2019)) are designed to cope with a larger model size with longer sequences. Input sequences are split into multi-
instead of longer sequences. However, when the sequence ple chunks and the sub-sequences are fed to different corre-
is long, the challenge is, existing parallelism must keep sponding devices. All devices are holding the same trainable
the whole sequence on one single device. Even if split- parameters but different sub-sequence input chunks. We will
ting model along hidden and attention-head dimension (i.e., introduce and analyze sequence parallelism in detail below.
tensor parallelism) or depth dimension (i.e., pipeline paral- We use the following notation in this section: (1) B: batch
lelism) can still process longer sequences to some extent, the size; (2) L: sequence length; (3) H: hidden size of linear
attention-head and depth are much smaller than sequence layers; (4) A: attention head size; (5) Z: number of attention
length (e.g., 12 vs 512), which limits the training scalability heads; (6) N: number of GPUs.
and the maximum length of the input sequence. In con- To distribute sub-sequences to multiple devices, the main
trast, our approach splits the whole sequence into multiple challenge is calculating attention scores across devices.
devices, enabling it to fit longer input data. Therefore, we propose Ring Self-Attention (RSA) to com-
Contributions (1) Our system breaks the length limitation pute attention output in a distributed setting. There are two
of Transformer model training. Sequence parallelism splits steps in RSA to obtain the final output. Please note, we only
long sequences into multiple chunks and feeds them into consider bidirectional self-attention here to introduce RSA
different devices. With linear space complexity attention, succinctly. We treat all heads equally so it can be extended
sequence parallelism can help us train the attention model to multi-head attention directly.
with infinite long sequences. (2) To our best knowledge, Given query embeddings {q11 , q21 , ..., qL
N
}, key embeddings
our work first proposed to use distributed system to handle {k1 , k2 , ..., kL } and value embeddings {v11 , v21 , ..., vL
1 1 N N
},
long sequence training for attention-based models. Our im- n th
where qs represents the key embedding of the s token
plementation is fully based on PyTorch and is compatible in the the sequence which is on nth device. We define
with data parallelism, pipeline parallelism, and tensor paral- all key embeddings on nth device as K n . In RSA, nth
lelism without any extra compiler or library. This makes it device holds the corresponding query embeddings Qn ,
possible to integrate sequence parallelism with data paral- key embeddings K n and value embeddings V n . The
lelism, pipeline parallelism and tensor parallelism into 4D embeddings on nth device correspond to the nth chunk
parallelism, and pave the way to train large-scale models whose sub-sequence length is L/N . Our goal is to obtain
with long sequences. Attentionn (Qn , K, V ) which is the self-attention layer out-
put on nth device. To this end, as shown in Figure 1(a), we
first transmit the key embeddings among devices to calcu-
late the attention scores QK T in a circular fashion. Such
communication needs to be conducted N − 1 times to make

2
Sequence Parallelism: Long Sequence Training from System Perspective

sure the query embeddings of each sub-sequence can mul- 1600 Tensor parallelism
tiply all the key embeddings. To be more specific, each 1400 Sequence parallelism

Maximum batch size


device will compute the partial attention scores based on 1200
its local query and key embeddings first. Then, it will re- 1000
ceive different key embeddings from the previous device 800
and calculate the partial attention scores with respect to the 600
new key embeddings for each ring-style communication. As 400
a result, all query embeddings {Q1 , Q2 , ..., QN } collected 200
their corresponding attention scores {S 1 , S 2 , ..., S N } on 0
2 3 4 5 6
their own devices. 2 2 2 2 2
Tensor or sequence parallel size
In the second stage of RSA, we can calculate the
(a) Maximum batch size of BERT Base scaling along
self-attention layer output {O1 , O2 , ..., ON } based on tensor or sequence parallel size
{S 1 , S 2 , ..., S N } and {V 1 , V 2 , ..., V N }. Since computing
On requires S n and all value embeddings, as we described 24000.0 Tensor parallelism
Sequence parallelism
in Figure 1(b), we transmit all value embeddings instead

Throughput (Tokens/s)
22000.0
of key embeddings in a P similar way. For On , we calculate
N
S V by O = S V = i=1 Sin Vi , where Vi = V n , Sin is
n n n
20000.0
S after column splitting, which means Sin ∈ RL/N ×L/N
n

but S n ∈ RL/N ×L . 18000.0

16000.0
3. Experiments
2 3 4 5 6
2 2 2 2 2
3.1. Experimental setup Tensor or sequence parallel size

We conducted our experiments on the Piz Daint supercom- (b) Throughput of BERT Base scaling along tensor or
puter provided by Swiss National Supercomputing Center sequence parallel size
(CSCS). The Piz Daint supercomputer provides one P100 Figure 2: Scaling with sequence/tensor parallelism
GPU (16GB GPU RAM) for each compute node and the
compute nodes are connected by a high-bandwidth network. is used. The tensor parallelism size in Megatron is limited
We chose two bidirectional language models, namely BERT by the number of attention heads and hidden size, because
Base and BERT Large, to evaluate our sequence parallelism. these two hyper-parameters are required to be divisible by
We also verified the convergence performance of sequence the tensor parallelism size. Among them, the number of
parallelism (see Appendix E). Since we are using the origi- attention heads is small so it limits the tensor parallelism.
nal model but different systems, the accuracy should be the Thus, tensor parallelism size is a maximum of 12 for the
same. The slight differences are from randomness. BERT Base model in Megatron. In contrast, for our se-
quence parallelism, only the sequence length is required to
3.2. Maximum batch size be divisible by the sequence parallelism size, so that we can
Since our sequence parallelism is memory-efficient to han- scale sequence parallelism to a larger size since it is a much
dle larger batch sizes, we first investigated the maximum larger hyper-parameter than the number of attention heads.
batch size we can reach with sequence parallelism. In this For BERT Base, our sequence parallelism outperforms ten-
section, for a comprehensive comparison, we scaled with sor parallelism in terms of memory consumption. Fig-
tensor or sequence parallelism on BERT Base and BERT ure 2(a) shows that our system on 64 GPUs can achieve
Large. We also fixed the tensor or parallel size and then 13.7× larger batch size than Megatron on 12 GPUs. Even if
scale them with pipeline parallelism to evaluate the verify we combine data parallelism and tensor parallelism to scale
the compatibility with pipeline parallelism. We used tokens up to 64 GPUs for Megatron, our system would still support
per second as the metric for throughput. To this end, we a larger batch size. In Figure 2(b), we can observe sequence
trained BERT Base and BERT Large for 150 iterations in parallelism achieved comparable throughput with the same
total, and then we calculate the mean tokens processed per parallel size, and our system can extend to a larger parallel
second within the last 100 iterations. size to achieve better performance. For the results on BERT
Scaling with sequence/tensor parallelism We fixed all Large, please see Appendix F for details.
hyper-parameters except the batch size and the tensor paral- Scaling with pipeline parallelism To verify the compatibil-
lelism or sequence parallelism size. We trained the model ity with pipeline parallelism, we fixed the tensor parallelism
with a sequence length of 512 and no pipeline parallelism and sequence parallelism size as 4 and scale the pipeline

3
Sequence Parallelism: Long Sequence Training from System Perspective

1750 Tensor parallelism 2250 Tensor parallelism

Maximum sequence length


Maximum batch size
1500
Sequence parallelism 2000 Sequence parallelism

1250 1750

1000 1500

1250
750

1000
500

750
250

500
0
0 1 2 3 2 3 4 5 6
2 2 2 2 2 2 2 2 2
Pipeline parallel size Sequence parallel size

(a) Maximum batch size of BERT base scaling along (a) Maximum sequence length on BERT base.
pipeline parallel size. 140000
Full attention

Maximum sequence length


24000.0 Tensor parallelism 120000 Sparse attention
Sequence parallelism Ideal scaling
Throughput (Tokens/s)

100000
22000.0
80000

20000.0 60000

40000

18000.0
20000

0
16000.0
0 1 2 3 4 5
2 2 2 2 2 2
2 3 4 5 6
2 2 2 2 2 Sequence parallel size
Tensor or sequence parallel size
(b) Sequence length upper bound.
(b) Throughput of BERT base scaling along pipeline
parallel size. Figure 4: Scaling with sequence length.
Figure 3: Scaling with pipeline parallelism
64 for BERT Base and no pipeline parallelism was used. We
parallel size. For BERT Base, we can observe that sequence show the maximum sequence length in Figure 4(a). If we
parallelism outperforms tensor parallelism on the maximum scale up to 64 GPUs, we can achieve around 3× maximum
batch size in Figure 3(a). It can be noted that sequence sequence length on BERT Base. Another observation is
parallelism also achieved higher throughput when using splitting along the number of attention heads limits the input
more pipeline stages as shown in Figure 3(b). This is be- sequence length of tensor parallelism in Megatron, but our
cause Megatron incurs extra communication costs between sequence parallelism can scale easily by splitting a sequence
pipeline stages. Megatron holds the activation for the full into multiple chunks. When using the same 16 GPUs, our
sequence on each device. Thus, it needs to split the activa- sequence parallelism still can achieve 1.4× larger sequence
tion, transmit the partial activation to the next device, and length than tensor parallelism. The gap is expected to widen
gather back the partial activation when sending the activa- if we use 32GB GPUs instead of 16GB GPUs.
tion between pipelines. This incurs less communication Sequence length upper bound To investigate the maximum
overhead compared to transmitting the whole activation sequence length our system can handle on the cluster with
between pipelines. However, this still brings more com- 32 P100 GPUs. we set both data and pipeline parallel size as
munication costs than ours, as no splitting and all-gather 1 and global batch size as 4. As efficient attention is widely
operation is required for our sub-sequence intermediate acti- used in long sequence training, we adapt Linformer (Wang
vation. Therefore, our sequence parallelism achieved better et al., 2020), i.e., one low-rank attention algorithm with
throughput when scaling along with pipeline parallel size. linear time and space complexity. Our sequence parallelism
is compatible with the efficient attention. More importantly,
3.3. Maximum sequence length as shown in Table 3, for memory usage in efficient attention
block, all terms including sequence length L is divided by
Sequence parallelism is designed for training Transformer-
number of devices N , which means we can scale the se-
based models with longer input sequences, so we investi-
quence length to infinite long if we use efficient attention
gated the maximum sequence length it can handle. Similarly,
with linear complexity. To investigate the sequence length
we still compared tensor parallelism without pipeline paral-
upper bound of sequence length on the efficient attention
lelism.
setting, we conduct experiments with both efficient and full
Compared with tensor parallelism We fixed batch size as attention. As shown in Figure 4(b), if we use efficient atten-

4
Sequence Parallelism: Long Sequence Training from System Perspective

tion on sequence parallelism, we can almost achieve ideal Hou, L., Cheng, Y., Shazeer, N., Parmar, N., Li, Y., Korfiatis,
scaling. With 32 P100 GPUs, our sequence parallelism P., Drucker, T. M., Blezek, D. J., and Song, X. High res-
with efficient attention can handle the sequence with 114K olution medical image analysis with spatial partitioning.
tokens, which is over 27× longer than recent sparse atten- arXiv preprint arXiv:1909.03108, 2019.
tion papers holding the whole sequence on a single device
Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, M. X.,
(Zaheer et al., 2020; Wang et al., 2020).
Chen, D., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al.
Gpipe: Efficient training of giant neural networks using
4. Conclusion pipeline parallelism. arXiv preprint arXiv:1811.06965,
2018.
In this paper, we proposed sequence parallelism for training
transformer with longer sequence. Sequence parallelism Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M.,
is designed to break the limitation of sequence length on a Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žı́dek,
single device. We have shown that sequence parallelism can A., Potapenko, A., et al. Highly accurate protein structure
handle longer sequence and is more memory-efficient than prediction with alphafold. Nature, 596(7873):583–589,
SoTA. In particular, sequence parallelism achieves 3.0× 2021.
maximum sequence length and 13.7× maximum batch size
than tensor parallelism when scaling up to 64 GPUs. Unlike Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y.,
both tensor and pipeline parallelism, sequence parallelism is Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling
not limited by the smaller hyper-parameters (e.g., number of giant models with conditional computation and automatic
attention heads, number of layers). Therefore, our sequence sharding. arXiv preprint arXiv:2006.16668, 2020.
parallelism can be adapted as long as the sequence length is Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Pat-
divisible by sequence parallel size. With efficient attention, wary, M., Korthikanti, V., Vainbrand, D., Kashinkunti, P.,
sequence parallelism can handle sequence with over 114K Bernauer, J., Catanzaro, B., et al. Efficient large-scale
tokens, which is over 27× longer than existing efficient language model training on gpu clusters. arXiv preprint
attention works holding the whole sequence on a single arXiv:2104.04473, 2021.
device. We used a language model (i.e., BERT) to evaluate
our system, but it can also be adapted to vision tasks. This Ni, J., Young, T., Pandelea, V., Xue, F., Adiga, V., and
work paves the way to process large images (Hou et al., Cambria, E. Recent advances in deep learning based
2019) by ViT (Dosovitskiy et al., 2020) as a larger image dialogue systems: A systematic survey. arXiv preprint
means more patches or longer sequences. arXiv:2105.04387, 2021.
Qu, C., Yang, L., Qiu, M., Croft, W. B., Zhang, Y., and
References Iyyer, M. Bert with history answer embedding for conver-
sational question answering. In Proceedings of the 42nd
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, International ACM SIGIR Conference on Research and
J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Development in Information Retrieval, pp. 1133–1136,
Askell, A., et al. Language models are few-shot learners. 2019.
arXiv preprint arXiv:2005.14165, 2020.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Sutskever, I. Language models are unsupervised multitask
Pre-training of deep bidirectional transformers for lan- learners. 2019.
guage understanding. arXiv preprint arXiv:1810.04805,
Rajbhandari, S., Ruwase, O., Rasley, J., Smith, S., and He, Y.
2018.
Zero-infinity: Breaking the gpu memory wall for extreme
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, scale deep learning. arXiv preprint arXiv:2104.07857,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., 2021.
Heigold, G., Gelly, S., et al. An image is worth 16x16 Rasley, J., Rajbhandari, S., Ruwase, O., and He, Y. Deep-
words: Transformers for image recognition at scale. arXiv speed: System optimizations enable training deep learn-
preprint arXiv:2010.11929, 2020. ing models with over 100 billion parameters. In Proceed-
ings of the 26th ACM SIGKDD International Conference
Elnaggar, A., Heinzinger, M., Dallago, C., Rihawi, G.,
on Knowledge Discovery & Data Mining, pp. 3505–3506,
Wang, Y., Jones, L., Gibbs, T., Feher, T., Angerer, C.,
2020.
Steinegger, M., et al. Prottrans: Towards cracking the lan-
guage of life’s code through self-supervised deep learn- Ren, J., Rajbhandari, S., Aminabadi, R. Y., Ruwase, O.,
ing and high performance computing. arXiv preprint Yang, S., Zhang, M., Li, D., and He, Y. Zero-offload:
arXiv:2007.06225, 2020. Democratizing billion-scale model training, 2021.

5
Sequence Parallelism: Long Sequence Training from System Perspective

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, A. Limitations
J., and Catanzaro, B. Megatron-lm: Training multi-
billion parameter language models using model paral- In order to perform communication between sub-sequences
lelism. arXiv preprint arXiv:1909.08053, 2019. during training, the use of sequence parallelism can re-
sult in increased communication costs, which in turn can
Wang, Q., Wang, B., Xu, Z., Wu, J., Zhao, P., Li, Z., Wang, slow down the training process. However, by combining
S., Huang, J., and Cui, S. Pssm-distil: Protein secondary sequence parallelism with pipeline parallelism, this issue
structure prediction (pssp) on low-quality pssm by knowl- can be alleviated and the communication cost can be made
edge distillation with contrastive learning. 2021. comparable to advanced forms of model parallelism such
as tensor parallelism. Nonetheless, sequence parallelism
Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. still incurs higher communication costs than vanilla data
Linformer: Self-attention with linear complexity. arXiv parallelism.
preprint arXiv:2006.04768, 2020.
While sequence parallelism is effective for training of unidi-
Xu, Y., Lee, H., Chen, D., Hechtman, B., Huang, Y., Joshi, rectional attention models as well as training and inference
R., Krikun, M., Lepikhin, D., Ly, A., Maggioni, M., et al. of bidirectional attention models, it poses a challenge for
Gspmd: General and scalable parallelization for ml com- unidirectional attention models inference due to the autore-
putation graphs. arXiv preprint arXiv:2105.04663, 2021. gressive decoding process. This means that different devices
cannot compute in parallel, resulting in reduced throughput
Xue, F., Sun, A., Zhang, H., and Chng, E. S. An embar-
and decreased GPU utilization.
rassingly simple model for dialogue relation extraction.
arXiv preprint arXiv:2012.13873, 2020a.
B. Discussion
Xue, F., Sun, A., Zhang, H., and Chng, E. S. Gdpnet:
Refining latent multi-view graph for relation extraction. Although there are other related works including Deep-
arXiv preprint arXiv:2012.06780, 2020b. Speed (Rasley et al., 2020), GShard (Lepikhin et al., 2020),
GSPMD (Xu et al., 2021), etc., they are not our direct base-
Yang, Z., Garcia, N., Chu, C., Otani, M., Nakashima, Y., line in experiments. DeepSpeed is an efficient method to
and Takemura, H. Bert representations for video ques- optimize memory footprint in data parallel training by us-
tion answering. In Proceedings of the IEEE/CVF Win- ing ZeRO Optimizer (Rajbhandari et al., 2021) and ZeRO-
ter Conference on Applications of Computer Vision, pp. Offload (Ren et al., 2021). DeepSpeed and our method
1556–1565, 2020. optimize training in different dimensions and they are actu-
Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Al- ally compatible with each other. Our method is orthogonal
berti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., to DeepSpeed just as how DeepSpeed can be integrated with
Yang, L., et al. Big bird: Transformers for longer se- Megatron. Thus, Megatron should be our baseline.
quences. Advances in Neural Information Processing GShard and GSPMD are two libraries built for the Ten-
Systems, 33, 2020. sorFlow community to partition model parameters in dis-
tributed training. GSPMD is developed based on GShard.
Zhang, H., Sun, A., Jing, W., and Zhou, J. T. Span-
These two methods rely on the static computation graph
based localizing network for natural language video lo-
of TensorFlow to train larger models while we provide a
calization. In Proceedings of the 58th Annual Meet-
plug-and-play tool based on PyTorch’s dynamic computa-
ing of the Association for Computational Linguistics,
tion graph to train on longer sequences. The difference in
pp. 6543–6554, Online, July 2020. Association for
the computation paradigms makes them unsuitable as our
Computational Linguistics. doi: 10.18653/v1/2020.
baseline.
acl-main.585. URL https://www.aclweb.org/
anthology/2020.acl-main.585. We also highlight again that, although sequence parallelism
can perform decent on large model training, a more highly
Zhang, H., Sun, A., Jing, W., Zhen, L., Zhou, J. T., and Goh, important use case is training mid-scale but very long se-
R. S. M. Natural language video localization: A revisit in quence. One example is AlphaFold (Jumper et al., 2021),
span-based question answering framework. IEEE Trans- which uses only 86M parameters but is required to be trained
actions on Pattern Analysis and Machine Intelligence, with very long sequence (from 1K to 4K).
2021. doi: 10.1109/TPAMI.2021.3060449.
Zhou, W., Huang, K., Ma, T., and Huang, J. Document-level C. Multi-head attnetion
relation extraction with adaptive thresholding and local-
ized context pooling. arXiv preprint arXiv:2010.11304, Multi-head attention is designed to jointly consider the infor-
2020. mation from different subspaces of embedding. Compared

6
Sequence Parallelism: Long Sequence Training from System Perspective

with self-attention below, multi-head attention has h query, Multi-head attention block We compared the memory
key and value embeddings instead of the single one, where usage of multi-head attention block in Table 2. Tensor par-
h denotes the number of heads. We obtain these embed- allelism splits the attention heads here, but our sequence
dings with identical shapes by linear transformations. The parallelism still splits the length dimension of the sequence
multi-head attention can be described as: data. By comparing the memory usages of multi-head at-
tention block of the two parallelisms, we can find sequence
parallelism is more memory-efficient if BL > 16AZ. As
M ultiHead(Q, K, V ) = Concat(head1 , ..., headh )W O (1) for communication, tensor parallelism needs an all-reduce
operation in both the forward pass and backward pass when
where headi = Attention(Qi , Ki , Vi ) and W denotes the calculating the attention output. In our RSA, to facilitate
linear transformations. All heads are concatenated and fur- tensor exchange between devices, our communication is
ther projected by linear transformation W O . equivalent to 2 all-reduce operations in the forward pass
and 4 all-reduce operations in the backward pass. The extra
communication cost of RSA can be offset by the lack of
D. Modeling communication cost in the MLP block.
We analyzed and compared our sequence parallelism with In both MLP block and multi-head attention block, sequence
tensor parallelism in both theoretical modeling and experi- parallelism is more memory-efficient when we train Trans-
ments, although tensor parallelism is not our direct baseline. former with a longer sequence and a larger batch size.
To our best knowledge, sequence parallelism is the first sys-
tem designed for breaking the length limitation of sequence, D.2. Communication cost
so there is actually no direct baseline for sequence paral-
lelism. Therefore, as a distributed training system designed Megatron-LM uses all-reduce in its MLP layer and self-
for attention-based models, we compare it with a SoTA attention layer while the communication overhead in se-
model parallelism. Tensor parallelism (Narayanan et al., quence parallelism mainly lies in the self-attention layer.
2021) is compatible with data parallelism, pipeline paral- Using the same notation as given above, we are able to cal-
lelism. Our sequence parallelism is also compatible with culate the amount of data transferred in sequence parallelism
them. We expect our system can outperform tensor paral- and tensor parallelism.
lelism with and without pipeline parallelism. We leave inte- In sequence parallelism, there is no communication in the
grating sequence parallelism with data parallelism, pipeline MLP layer and communication only occurs in the self atten-
parallelism and tensor parallelism into 4D parallelism as our tion module. There are two ring-style P2P communication
future work. Here, we mainly focus on memory usage and in the forward pass for calculating the attention score and
communication cost of tensor parallelism and our sequence attention output respectively. In the backward pass, there
parallelism. are two all-reduce collective communication and two ring-
style P2P communication. The amount of data transferred
D.1. Memory usage is 2(N − 1) ∗ B ∗ Z ∗ (L/N ) ∗ A in the forward pass and
For memory usage, according to the architecture of Trans- 6(N − 1) ∗ B ∗ Z ∗ (L/N ) ∗ A in the backward pass. The
former, the comparison is divided into two parts, MLP block combined amount of data transferred in calculating QK T
and attention block. In this part, we consider multi-head and AV will be 8(N − 1) ∗ B ∗ Z ∗ (L/N ) ∗ A.
attention instead of self-attention for a fair and accurate In tensor parallelism of Megatron-LM, the amount of data
comparison. We assume the optimizer is Adam used in transferred in the forward pass and backward pass is the
Megatron. same as given by 2(N − 1) ∗ B ∗ Z ∗ (L/N ) ∗ A. Since there
MLP block As shown in Table 1, for the MLP blocks, tensor are 4 collective communication in the forward and backward
parallelism stores the matrices after row or column-style passes of the MLP layer and self-attention layer, the total
splitting of the whole sequence. Our sequence parallelism communication cost will be 8(N − 1) ∗ B ∗ Z ∗ (L/N ) ∗ A.
stores the matrices without row or column-style splitting of Thus, sequence parallelism has the same communication
only one single sub-sequence on each GPU. If we assume overhead compared with tensor parallelism in Megatron-
that our sequence parallelism is more memory-efficient: LM. However, please note sequence parallelism has better
compatibility with pipeline parallelism, which would further
32H2 4BLH 5BLH reduce the communication budget of sequence parallelism.
+ + BLH > 32H2 + (2) In tensor parallelism, to save the communication bandwidth
N N N
between pipeline stages which are often over different nodes,
We can find that, in MLP blocks, sequence parallelism is the tensor is split before transmitting to the next stage and
more memory-efficient when BL > 32H.

7
Sequence Parallelism: Long Sequence Training from System Perspective

Table 1: MLP block memory usage comparison. M1 means the matrix before linear layer, and M2 is the trainable matrix of
linear layer.

GEMM M1 M2 output Memory


4H 4H
1st linear (B, L, H) (H, ) (B, L, ) 32H2 4BLH
Tensor parallelism N N + + BLH
4H 4H N N
2nd linear (B, L, ) ( , H) (B, L, H)
N N
L L
1st linear (B, , H) (H, 4H) (B, , 4H) 5BLH
Sequence parallelism N N 32H2 +
L L N
2nd linear (B, , 4H) (4H, H) (B, , H)
N N
Table 2: Multi-head attention block memory usage comparison

Operation M1 M2 output Memory


ZA Z
Q/K/V (B, L, H) (H, ) (B, , L, A)
N N
Tensor Z Z Z 16AZH 4BLZA
parallelism QK T (B, , L, A) (B, , L, A) (B, , L, L) +
N N N N N
Z Z Z BZL2
AV (B, , L, L) (B, , L, A) (B, , L, A) + + BLH
N N N N
Z AZ
Linear (B, , L, A) ( , H) (B, L, H)
N N
L L
Q/K/V (B, , H) (H, AZ) (B, Z, , A)
N N
Sequence L L L 4BZLA
parallelism Ring-QK T (B, Z, , A) (B, Z, , A) (B, Z, , L) 16AZH +
N N N N
L L L BZL2 BLH
Ring-AV (B, Z, , L) (B, Z, , A) (B, Z, , A) + +
N N N N N
L L
Linear (B, Z, , A) (AZ, H) (B, , H)
N N
all-gathered after transmission. As tensor has already been 7.5

7.2
Megatron
Ours

split along the sequence dimension in sequence parallelism, 7.0


MLM loss

6.8

there is no need to split and all-gather between pipeline 6.5

6.2

stages. Thus, sequence parallelism can have one less all- 6.0

5.8

gather operation per pipeline stage. 0 10000 20000 30000


Iteration
40000 50000

(a) Convergence perfor-


D.3. Memory Usage of Efficient Attention mance of MLM loss
0.8 Megatron

E. Convergence performance 0.7

0.6
Ours
SOP loss

0.5

We verified the convergence performance of sequence paral- 0.4

lelism. Since sequence parallelism is just a distributed im- 0.3

0.2

plementation of long sequence training, there is no change 0 10000 20000 30000


Iteration
40000 50000

in model architecture, We expect sequence parallelism can (b) Convergence perfor-


achieve the same accuracy and convergence performance mance of SOP loss
as training without sequence parallelism. We used the
Figure 5: Convergence performance comparison between
Wikipedia dataset (Devlin et al., 2018) and evaluated Mega-
tensor parallelism and ours
tron and our model on the development set every 1k itera-
tions. We trained the BERT Large model for 50k iterations lelism in our model. No pipeline was used for both models.
with the default hyper-parameters used by Megatron. Our In Figure 5, Our sequence parallelism shows good conver-
goal here is to verify the correctness of our implementation gence on both the masked language modeling (MLM) loss
so we trained the model for fewer steps. We set parallel size and the sentence order prediction (SOP) loss. Compared
as 4 for tensor parallelism in Megatron and sequence paral- with Megatron, sequence parallelism has a similar trend in

8
Sequence Parallelism: Long Sequence Training from System Perspective

Table 3: Efficient attention block memory usage. K is the projection dimension in Linformer (Wang et al., 2020)

Operation M1 M2 output Memory


L L
Q/K/V (B, , H) (H, AZ) (B, Z, , A)
N N
Linformer L L 2BZLA
Sequence Projection (B, Z, , A) ( , K) (B, Z, K, A) 2AZH +
N N N
parallelism L L BZLK BLH
Ring-QK T (B, Z, , A) (B, Z, K, A) (B, Z, , K) + +
N N N N
L L
Ring-AV (B, Z, , K) (B, Z, K, A) (B, Z, , A) +2BZKA
N N
L L
Linear (B, Z, , A) (AZ, H) (B, , H)
N N

convergence and achieved lower values for both MLM loss GPUs is 10.2 times larger than that of tensor parallelism
and SOP loss for 50k iterations. on 16 GPUs. In Figure 6(b), observe that our sequence
parallelism achieved comparable throughput with the same
F. Scaling with sequence/tensor parallelism parallel size, and more importantly, our system can extend
to a larger parallel size to achieve better performance.
500
Tensor parallelism G. Scaling with pipeline parallelism
400
Sequence parallelism
Maximum batch size

700

300 Tensor parallelism


600
Sequence parallelism
Maximum batch size

200 500

400
100
300

0
2 3 4 5 6 200
2 2 2 2 2
Tensor or sequence parallel size 100

(a) Maximum batch size of BERT Large scaling along ten- 0


sor or sequence parallel size 0 1 2 3
2 2 2 2
Pipeline parallel size
8500.0
Tensor parallelism
(a) Maximum batch size of BERT Large scaling along
Sequence parallelism
Throughput (Tokens/s)

8000.0 pipeline parallel size


7500.0
Tensor parallelism
7000.0
25000.0 Sequence parallelism
Throughput (Tokens/s)

6500.0
20000.0

6000.0

15000.0
5500.0

2 3 4 5 6
2 2 2 2 2 10000.0
Tensor or sequence parallel size
(b) Throughput of BERT Large scaling along tensor or se- 5000.0
quence parallel size 0 1 2 3
2 2 2 2
Pipeline parallel size
Figure 6: Scaling with sequence/tensor parallelism
Compared with BERT Base setting, the only difference is, (b) Throughput of BERT Large scaling along pipeline par-
the tensor parallel size is a maximum of 16 for the BERT allel size
Large model in Megatron-LM. In Figure 6(a), our method Figure 7: Scaling with pipeline parallelism
achieved 2.7 times larger batch size for BERT Large on 16 For BERT Large, sequence parallelism achieved higher max-
GPUs, and the batch size of sequence parallelism on 64 imum batch size than tensor parallelism in Figure 7(a). Se-

9
Sequence Parallelism: Long Sequence Training from System Perspective

quence parallelism also performs better on throughput when


using more pipeline stages as shown in Figure 7(b).

H. Maximum sequence length

2500 Tensor parallelism


Maximum sequence length

2250 Sequence parallelism

2000

1750

1500

1250

1000

750
2 3 4 5 6
2 2 2 2 2
Sequence parallel size

Figure 8: Maximum sequence length on BERT Large

BERT Large Similarly, we compared tensor parallelism


without pipeline parallelism. We fixed batch size as 16
for BERT Large and did not use pipeline parallelism. As
shown in Figure 8. When we scale up to 64 GPUs, we can
achieve around 2× maximum sequence length and scale
better through splitting a sequence into multiple chunks on
BERT Large.

I. Weak scaling
Strong scaling limits the upper bound of batch size and se-
quence length within a single device, so we mainly discuss
weak scaling in this section. We scale the batch size and
sequence length separately when increasing the number of
nodes. We fixed the pipeline parallelism size as 8. In Table 4,
sequence parallelism achieved almost constant memory us-
age when scaling along with the global batch size, which
outperforms tensor parallelism by a large margin. As for
weak scaling along the sequence length, our method still
uses much less memory with comparable throughput.

10
Sequence Parallelism: Long Sequence Training from System Perspective

Table 4: Weak scaling results. Parallel size is the tensor or sequence parallel size. Batch size denotes global batch size,
respectively. Memory and Token/sec denote max allocated memory/MB and tokens processed per second. OOM means that
CUDA out of memory occurs.

Tensor parallelism Sequence parallelism


Parallel size Batch size Sequence length
Memory Token/sec Memory Token/sec
1 64 512 8477.28 9946.15 8477.53 9261.04
2 128 512 9520.47 15510.19 8478.76 13938.22
4 256 512 12232.52 20701.96 8481.26 21269.91
8 512 512 OOM OOM 8490.75 26401.64
1 64 256 3707.39 9752.61 3707.01 9340.13
2 64 512 4993.43 14195.17 4670.64 13144.16
4 64 1024 8175.93 19879.27 6601.88 18243.82
8 64 2048 14862.09 22330.5 10536.38 21625.51

11

You might also like