3 Sequence Parallelism Long Sequ
3 Sequence Parallelism Long Sequ
1
Sequence Parallelism: Long Sequence Training from System Perspective
Device 4 Query Key Value QK^T Output Device 2 Query Key Value QK^T Output
Device 4 Query Key Value QK^T Output Device 2 Query Key Value QK^T Output
(b) Transmitting value embeddings among devices to calculate the output of attention layers.
Figure 1: Ring Self-Attention.
2
Sequence Parallelism: Long Sequence Training from System Perspective
sure the query embeddings of each sub-sequence can mul- 1600 Tensor parallelism
tiply all the key embeddings. To be more specific, each 1400 Sequence parallelism
Throughput (Tokens/s)
22000.0
of key embeddings in a P similar way. For On , we calculate
N
S V by O = S V = i=1 Sin Vi , where Vi = V n , Sin is
n n n
20000.0
S after column splitting, which means Sin ∈ RL/N ×L/N
n
16000.0
3. Experiments
2 3 4 5 6
2 2 2 2 2
3.1. Experimental setup Tensor or sequence parallel size
We conducted our experiments on the Piz Daint supercom- (b) Throughput of BERT Base scaling along tensor or
puter provided by Swiss National Supercomputing Center sequence parallel size
(CSCS). The Piz Daint supercomputer provides one P100 Figure 2: Scaling with sequence/tensor parallelism
GPU (16GB GPU RAM) for each compute node and the
compute nodes are connected by a high-bandwidth network. is used. The tensor parallelism size in Megatron is limited
We chose two bidirectional language models, namely BERT by the number of attention heads and hidden size, because
Base and BERT Large, to evaluate our sequence parallelism. these two hyper-parameters are required to be divisible by
We also verified the convergence performance of sequence the tensor parallelism size. Among them, the number of
parallelism (see Appendix E). Since we are using the origi- attention heads is small so it limits the tensor parallelism.
nal model but different systems, the accuracy should be the Thus, tensor parallelism size is a maximum of 12 for the
same. The slight differences are from randomness. BERT Base model in Megatron. In contrast, for our se-
quence parallelism, only the sequence length is required to
3.2. Maximum batch size be divisible by the sequence parallelism size, so that we can
Since our sequence parallelism is memory-efficient to han- scale sequence parallelism to a larger size since it is a much
dle larger batch sizes, we first investigated the maximum larger hyper-parameter than the number of attention heads.
batch size we can reach with sequence parallelism. In this For BERT Base, our sequence parallelism outperforms ten-
section, for a comprehensive comparison, we scaled with sor parallelism in terms of memory consumption. Fig-
tensor or sequence parallelism on BERT Base and BERT ure 2(a) shows that our system on 64 GPUs can achieve
Large. We also fixed the tensor or parallel size and then 13.7× larger batch size than Megatron on 12 GPUs. Even if
scale them with pipeline parallelism to evaluate the verify we combine data parallelism and tensor parallelism to scale
the compatibility with pipeline parallelism. We used tokens up to 64 GPUs for Megatron, our system would still support
per second as the metric for throughput. To this end, we a larger batch size. In Figure 2(b), we can observe sequence
trained BERT Base and BERT Large for 150 iterations in parallelism achieved comparable throughput with the same
total, and then we calculate the mean tokens processed per parallel size, and our system can extend to a larger parallel
second within the last 100 iterations. size to achieve better performance. For the results on BERT
Scaling with sequence/tensor parallelism We fixed all Large, please see Appendix F for details.
hyper-parameters except the batch size and the tensor paral- Scaling with pipeline parallelism To verify the compatibil-
lelism or sequence parallelism size. We trained the model ity with pipeline parallelism, we fixed the tensor parallelism
with a sequence length of 512 and no pipeline parallelism and sequence parallelism size as 4 and scale the pipeline
3
Sequence Parallelism: Long Sequence Training from System Perspective
1250 1750
1000 1500
1250
750
1000
500
750
250
500
0
0 1 2 3 2 3 4 5 6
2 2 2 2 2 2 2 2 2
Pipeline parallel size Sequence parallel size
(a) Maximum batch size of BERT base scaling along (a) Maximum sequence length on BERT base.
pipeline parallel size. 140000
Full attention
100000
22000.0
80000
20000.0 60000
40000
18000.0
20000
0
16000.0
0 1 2 3 4 5
2 2 2 2 2 2
2 3 4 5 6
2 2 2 2 2 Sequence parallel size
Tensor or sequence parallel size
(b) Sequence length upper bound.
(b) Throughput of BERT base scaling along pipeline
parallel size. Figure 4: Scaling with sequence length.
Figure 3: Scaling with pipeline parallelism
64 for BERT Base and no pipeline parallelism was used. We
parallel size. For BERT Base, we can observe that sequence show the maximum sequence length in Figure 4(a). If we
parallelism outperforms tensor parallelism on the maximum scale up to 64 GPUs, we can achieve around 3× maximum
batch size in Figure 3(a). It can be noted that sequence sequence length on BERT Base. Another observation is
parallelism also achieved higher throughput when using splitting along the number of attention heads limits the input
more pipeline stages as shown in Figure 3(b). This is be- sequence length of tensor parallelism in Megatron, but our
cause Megatron incurs extra communication costs between sequence parallelism can scale easily by splitting a sequence
pipeline stages. Megatron holds the activation for the full into multiple chunks. When using the same 16 GPUs, our
sequence on each device. Thus, it needs to split the activa- sequence parallelism still can achieve 1.4× larger sequence
tion, transmit the partial activation to the next device, and length than tensor parallelism. The gap is expected to widen
gather back the partial activation when sending the activa- if we use 32GB GPUs instead of 16GB GPUs.
tion between pipelines. This incurs less communication Sequence length upper bound To investigate the maximum
overhead compared to transmitting the whole activation sequence length our system can handle on the cluster with
between pipelines. However, this still brings more com- 32 P100 GPUs. we set both data and pipeline parallel size as
munication costs than ours, as no splitting and all-gather 1 and global batch size as 4. As efficient attention is widely
operation is required for our sub-sequence intermediate acti- used in long sequence training, we adapt Linformer (Wang
vation. Therefore, our sequence parallelism achieved better et al., 2020), i.e., one low-rank attention algorithm with
throughput when scaling along with pipeline parallel size. linear time and space complexity. Our sequence parallelism
is compatible with the efficient attention. More importantly,
3.3. Maximum sequence length as shown in Table 3, for memory usage in efficient attention
block, all terms including sequence length L is divided by
Sequence parallelism is designed for training Transformer-
number of devices N , which means we can scale the se-
based models with longer input sequences, so we investi-
quence length to infinite long if we use efficient attention
gated the maximum sequence length it can handle. Similarly,
with linear complexity. To investigate the sequence length
we still compared tensor parallelism without pipeline paral-
upper bound of sequence length on the efficient attention
lelism.
setting, we conduct experiments with both efficient and full
Compared with tensor parallelism We fixed batch size as attention. As shown in Figure 4(b), if we use efficient atten-
4
Sequence Parallelism: Long Sequence Training from System Perspective
tion on sequence parallelism, we can almost achieve ideal Hou, L., Cheng, Y., Shazeer, N., Parmar, N., Li, Y., Korfiatis,
scaling. With 32 P100 GPUs, our sequence parallelism P., Drucker, T. M., Blezek, D. J., and Song, X. High res-
with efficient attention can handle the sequence with 114K olution medical image analysis with spatial partitioning.
tokens, which is over 27× longer than recent sparse atten- arXiv preprint arXiv:1909.03108, 2019.
tion papers holding the whole sequence on a single device
Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, M. X.,
(Zaheer et al., 2020; Wang et al., 2020).
Chen, D., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al.
Gpipe: Efficient training of giant neural networks using
4. Conclusion pipeline parallelism. arXiv preprint arXiv:1811.06965,
2018.
In this paper, we proposed sequence parallelism for training
transformer with longer sequence. Sequence parallelism Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M.,
is designed to break the limitation of sequence length on a Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žı́dek,
single device. We have shown that sequence parallelism can A., Potapenko, A., et al. Highly accurate protein structure
handle longer sequence and is more memory-efficient than prediction with alphafold. Nature, 596(7873):583–589,
SoTA. In particular, sequence parallelism achieves 3.0× 2021.
maximum sequence length and 13.7× maximum batch size
than tensor parallelism when scaling up to 64 GPUs. Unlike Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y.,
both tensor and pipeline parallelism, sequence parallelism is Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling
not limited by the smaller hyper-parameters (e.g., number of giant models with conditional computation and automatic
attention heads, number of layers). Therefore, our sequence sharding. arXiv preprint arXiv:2006.16668, 2020.
parallelism can be adapted as long as the sequence length is Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Pat-
divisible by sequence parallel size. With efficient attention, wary, M., Korthikanti, V., Vainbrand, D., Kashinkunti, P.,
sequence parallelism can handle sequence with over 114K Bernauer, J., Catanzaro, B., et al. Efficient large-scale
tokens, which is over 27× longer than existing efficient language model training on gpu clusters. arXiv preprint
attention works holding the whole sequence on a single arXiv:2104.04473, 2021.
device. We used a language model (i.e., BERT) to evaluate
our system, but it can also be adapted to vision tasks. This Ni, J., Young, T., Pandelea, V., Xue, F., Adiga, V., and
work paves the way to process large images (Hou et al., Cambria, E. Recent advances in deep learning based
2019) by ViT (Dosovitskiy et al., 2020) as a larger image dialogue systems: A systematic survey. arXiv preprint
means more patches or longer sequences. arXiv:2105.04387, 2021.
Qu, C., Yang, L., Qiu, M., Croft, W. B., Zhang, Y., and
References Iyyer, M. Bert with history answer embedding for conver-
sational question answering. In Proceedings of the 42nd
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, International ACM SIGIR Conference on Research and
J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Development in Information Retrieval, pp. 1133–1136,
Askell, A., et al. Language models are few-shot learners. 2019.
arXiv preprint arXiv:2005.14165, 2020.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Sutskever, I. Language models are unsupervised multitask
Pre-training of deep bidirectional transformers for lan- learners. 2019.
guage understanding. arXiv preprint arXiv:1810.04805,
Rajbhandari, S., Ruwase, O., Rasley, J., Smith, S., and He, Y.
2018.
Zero-infinity: Breaking the gpu memory wall for extreme
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, scale deep learning. arXiv preprint arXiv:2104.07857,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., 2021.
Heigold, G., Gelly, S., et al. An image is worth 16x16 Rasley, J., Rajbhandari, S., Ruwase, O., and He, Y. Deep-
words: Transformers for image recognition at scale. arXiv speed: System optimizations enable training deep learn-
preprint arXiv:2010.11929, 2020. ing models with over 100 billion parameters. In Proceed-
ings of the 26th ACM SIGKDD International Conference
Elnaggar, A., Heinzinger, M., Dallago, C., Rihawi, G.,
on Knowledge Discovery & Data Mining, pp. 3505–3506,
Wang, Y., Jones, L., Gibbs, T., Feher, T., Angerer, C.,
2020.
Steinegger, M., et al. Prottrans: Towards cracking the lan-
guage of life’s code through self-supervised deep learn- Ren, J., Rajbhandari, S., Aminabadi, R. Y., Ruwase, O.,
ing and high performance computing. arXiv preprint Yang, S., Zhang, M., Li, D., and He, Y. Zero-offload:
arXiv:2007.06225, 2020. Democratizing billion-scale model training, 2021.
5
Sequence Parallelism: Long Sequence Training from System Perspective
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, A. Limitations
J., and Catanzaro, B. Megatron-lm: Training multi-
billion parameter language models using model paral- In order to perform communication between sub-sequences
lelism. arXiv preprint arXiv:1909.08053, 2019. during training, the use of sequence parallelism can re-
sult in increased communication costs, which in turn can
Wang, Q., Wang, B., Xu, Z., Wu, J., Zhao, P., Li, Z., Wang, slow down the training process. However, by combining
S., Huang, J., and Cui, S. Pssm-distil: Protein secondary sequence parallelism with pipeline parallelism, this issue
structure prediction (pssp) on low-quality pssm by knowl- can be alleviated and the communication cost can be made
edge distillation with contrastive learning. 2021. comparable to advanced forms of model parallelism such
as tensor parallelism. Nonetheless, sequence parallelism
Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. still incurs higher communication costs than vanilla data
Linformer: Self-attention with linear complexity. arXiv parallelism.
preprint arXiv:2006.04768, 2020.
While sequence parallelism is effective for training of unidi-
Xu, Y., Lee, H., Chen, D., Hechtman, B., Huang, Y., Joshi, rectional attention models as well as training and inference
R., Krikun, M., Lepikhin, D., Ly, A., Maggioni, M., et al. of bidirectional attention models, it poses a challenge for
Gspmd: General and scalable parallelization for ml com- unidirectional attention models inference due to the autore-
putation graphs. arXiv preprint arXiv:2105.04663, 2021. gressive decoding process. This means that different devices
cannot compute in parallel, resulting in reduced throughput
Xue, F., Sun, A., Zhang, H., and Chng, E. S. An embar-
and decreased GPU utilization.
rassingly simple model for dialogue relation extraction.
arXiv preprint arXiv:2012.13873, 2020a.
B. Discussion
Xue, F., Sun, A., Zhang, H., and Chng, E. S. Gdpnet:
Refining latent multi-view graph for relation extraction. Although there are other related works including Deep-
arXiv preprint arXiv:2012.06780, 2020b. Speed (Rasley et al., 2020), GShard (Lepikhin et al., 2020),
GSPMD (Xu et al., 2021), etc., they are not our direct base-
Yang, Z., Garcia, N., Chu, C., Otani, M., Nakashima, Y., line in experiments. DeepSpeed is an efficient method to
and Takemura, H. Bert representations for video ques- optimize memory footprint in data parallel training by us-
tion answering. In Proceedings of the IEEE/CVF Win- ing ZeRO Optimizer (Rajbhandari et al., 2021) and ZeRO-
ter Conference on Applications of Computer Vision, pp. Offload (Ren et al., 2021). DeepSpeed and our method
1556–1565, 2020. optimize training in different dimensions and they are actu-
Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Al- ally compatible with each other. Our method is orthogonal
berti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., to DeepSpeed just as how DeepSpeed can be integrated with
Yang, L., et al. Big bird: Transformers for longer se- Megatron. Thus, Megatron should be our baseline.
quences. Advances in Neural Information Processing GShard and GSPMD are two libraries built for the Ten-
Systems, 33, 2020. sorFlow community to partition model parameters in dis-
tributed training. GSPMD is developed based on GShard.
Zhang, H., Sun, A., Jing, W., and Zhou, J. T. Span-
These two methods rely on the static computation graph
based localizing network for natural language video lo-
of TensorFlow to train larger models while we provide a
calization. In Proceedings of the 58th Annual Meet-
plug-and-play tool based on PyTorch’s dynamic computa-
ing of the Association for Computational Linguistics,
tion graph to train on longer sequences. The difference in
pp. 6543–6554, Online, July 2020. Association for
the computation paradigms makes them unsuitable as our
Computational Linguistics. doi: 10.18653/v1/2020.
baseline.
acl-main.585. URL https://www.aclweb.org/
anthology/2020.acl-main.585. We also highlight again that, although sequence parallelism
can perform decent on large model training, a more highly
Zhang, H., Sun, A., Jing, W., Zhen, L., Zhou, J. T., and Goh, important use case is training mid-scale but very long se-
R. S. M. Natural language video localization: A revisit in quence. One example is AlphaFold (Jumper et al., 2021),
span-based question answering framework. IEEE Trans- which uses only 86M parameters but is required to be trained
actions on Pattern Analysis and Machine Intelligence, with very long sequence (from 1K to 4K).
2021. doi: 10.1109/TPAMI.2021.3060449.
Zhou, W., Huang, K., Ma, T., and Huang, J. Document-level C. Multi-head attnetion
relation extraction with adaptive thresholding and local-
ized context pooling. arXiv preprint arXiv:2010.11304, Multi-head attention is designed to jointly consider the infor-
2020. mation from different subspaces of embedding. Compared
6
Sequence Parallelism: Long Sequence Training from System Perspective
with self-attention below, multi-head attention has h query, Multi-head attention block We compared the memory
key and value embeddings instead of the single one, where usage of multi-head attention block in Table 2. Tensor par-
h denotes the number of heads. We obtain these embed- allelism splits the attention heads here, but our sequence
dings with identical shapes by linear transformations. The parallelism still splits the length dimension of the sequence
multi-head attention can be described as: data. By comparing the memory usages of multi-head at-
tention block of the two parallelisms, we can find sequence
parallelism is more memory-efficient if BL > 16AZ. As
M ultiHead(Q, K, V ) = Concat(head1 , ..., headh )W O (1) for communication, tensor parallelism needs an all-reduce
operation in both the forward pass and backward pass when
where headi = Attention(Qi , Ki , Vi ) and W denotes the calculating the attention output. In our RSA, to facilitate
linear transformations. All heads are concatenated and fur- tensor exchange between devices, our communication is
ther projected by linear transformation W O . equivalent to 2 all-reduce operations in the forward pass
and 4 all-reduce operations in the backward pass. The extra
communication cost of RSA can be offset by the lack of
D. Modeling communication cost in the MLP block.
We analyzed and compared our sequence parallelism with In both MLP block and multi-head attention block, sequence
tensor parallelism in both theoretical modeling and experi- parallelism is more memory-efficient when we train Trans-
ments, although tensor parallelism is not our direct baseline. former with a longer sequence and a larger batch size.
To our best knowledge, sequence parallelism is the first sys-
tem designed for breaking the length limitation of sequence, D.2. Communication cost
so there is actually no direct baseline for sequence paral-
lelism. Therefore, as a distributed training system designed Megatron-LM uses all-reduce in its MLP layer and self-
for attention-based models, we compare it with a SoTA attention layer while the communication overhead in se-
model parallelism. Tensor parallelism (Narayanan et al., quence parallelism mainly lies in the self-attention layer.
2021) is compatible with data parallelism, pipeline paral- Using the same notation as given above, we are able to cal-
lelism. Our sequence parallelism is also compatible with culate the amount of data transferred in sequence parallelism
them. We expect our system can outperform tensor paral- and tensor parallelism.
lelism with and without pipeline parallelism. We leave inte- In sequence parallelism, there is no communication in the
grating sequence parallelism with data parallelism, pipeline MLP layer and communication only occurs in the self atten-
parallelism and tensor parallelism into 4D parallelism as our tion module. There are two ring-style P2P communication
future work. Here, we mainly focus on memory usage and in the forward pass for calculating the attention score and
communication cost of tensor parallelism and our sequence attention output respectively. In the backward pass, there
parallelism. are two all-reduce collective communication and two ring-
style P2P communication. The amount of data transferred
D.1. Memory usage is 2(N − 1) ∗ B ∗ Z ∗ (L/N ) ∗ A in the forward pass and
For memory usage, according to the architecture of Trans- 6(N − 1) ∗ B ∗ Z ∗ (L/N ) ∗ A in the backward pass. The
former, the comparison is divided into two parts, MLP block combined amount of data transferred in calculating QK T
and attention block. In this part, we consider multi-head and AV will be 8(N − 1) ∗ B ∗ Z ∗ (L/N ) ∗ A.
attention instead of self-attention for a fair and accurate In tensor parallelism of Megatron-LM, the amount of data
comparison. We assume the optimizer is Adam used in transferred in the forward pass and backward pass is the
Megatron. same as given by 2(N − 1) ∗ B ∗ Z ∗ (L/N ) ∗ A. Since there
MLP block As shown in Table 1, for the MLP blocks, tensor are 4 collective communication in the forward and backward
parallelism stores the matrices after row or column-style passes of the MLP layer and self-attention layer, the total
splitting of the whole sequence. Our sequence parallelism communication cost will be 8(N − 1) ∗ B ∗ Z ∗ (L/N ) ∗ A.
stores the matrices without row or column-style splitting of Thus, sequence parallelism has the same communication
only one single sub-sequence on each GPU. If we assume overhead compared with tensor parallelism in Megatron-
that our sequence parallelism is more memory-efficient: LM. However, please note sequence parallelism has better
compatibility with pipeline parallelism, which would further
32H2 4BLH 5BLH reduce the communication budget of sequence parallelism.
+ + BLH > 32H2 + (2) In tensor parallelism, to save the communication bandwidth
N N N
between pipeline stages which are often over different nodes,
We can find that, in MLP blocks, sequence parallelism is the tensor is split before transmitting to the next stage and
more memory-efficient when BL > 32H.
7
Sequence Parallelism: Long Sequence Training from System Perspective
Table 1: MLP block memory usage comparison. M1 means the matrix before linear layer, and M2 is the trainable matrix of
linear layer.
7.2
Megatron
Ours
6.8
6.2
stages. Thus, sequence parallelism can have one less all- 6.0
5.8
0.6
Ours
SOP loss
0.5
0.2
8
Sequence Parallelism: Long Sequence Training from System Perspective
Table 3: Efficient attention block memory usage. K is the projection dimension in Linformer (Wang et al., 2020)
convergence and achieved lower values for both MLM loss GPUs is 10.2 times larger than that of tensor parallelism
and SOP loss for 50k iterations. on 16 GPUs. In Figure 6(b), observe that our sequence
parallelism achieved comparable throughput with the same
F. Scaling with sequence/tensor parallelism parallel size, and more importantly, our system can extend
to a larger parallel size to achieve better performance.
500
Tensor parallelism G. Scaling with pipeline parallelism
400
Sequence parallelism
Maximum batch size
700
200 500
400
100
300
0
2 3 4 5 6 200
2 2 2 2 2
Tensor or sequence parallel size 100
6500.0
20000.0
6000.0
15000.0
5500.0
2 3 4 5 6
2 2 2 2 2 10000.0
Tensor or sequence parallel size
(b) Throughput of BERT Large scaling along tensor or se- 5000.0
quence parallel size 0 1 2 3
2 2 2 2
Pipeline parallel size
Figure 6: Scaling with sequence/tensor parallelism
Compared with BERT Base setting, the only difference is, (b) Throughput of BERT Large scaling along pipeline par-
the tensor parallel size is a maximum of 16 for the BERT allel size
Large model in Megatron-LM. In Figure 6(a), our method Figure 7: Scaling with pipeline parallelism
achieved 2.7 times larger batch size for BERT Large on 16 For BERT Large, sequence parallelism achieved higher max-
GPUs, and the batch size of sequence parallelism on 64 imum batch size than tensor parallelism in Figure 7(a). Se-
9
Sequence Parallelism: Long Sequence Training from System Perspective
2000
1750
1500
1250
1000
750
2 3 4 5 6
2 2 2 2 2
Sequence parallel size
I. Weak scaling
Strong scaling limits the upper bound of batch size and se-
quence length within a single device, so we mainly discuss
weak scaling in this section. We scale the batch size and
sequence length separately when increasing the number of
nodes. We fixed the pipeline parallelism size as 8. In Table 4,
sequence parallelism achieved almost constant memory us-
age when scaling along with the global batch size, which
outperforms tensor parallelism by a large margin. As for
weak scaling along the sequence length, our method still
uses much less memory with comparable throughput.
10
Sequence Parallelism: Long Sequence Training from System Perspective
Table 4: Weak scaling results. Parallel size is the tensor or sequence parallel size. Batch size denotes global batch size,
respectively. Memory and Token/sec denote max allocated memory/MB and tokens processed per second. OOM means that
CUDA out of memory occurs.
11