Accelerating Transformer Inference for Translation via Parallel Decoding
Accelerating Transformer Inference for Translation via Parallel Decoding
Seidel while having mathematical guarantees on between translation quality and speed is an active
the quality of the translation. A high-level descrip- research direction, with current methods trying to
tion of the method is available in (Fig. 1). Our fill the gap in terms of translation quality (Geng
contributions can be summarized as the following: et al., 2021; Savinov et al., 2022). Nevertheless,
all proposed NAT models are learning-based and
• We reframe the standard greedy autoregres-
require different tricks to reach the quality of
sive decoding procedure in MT with a parallel
autoregressive models (Gu and Kong, 2021). The
formulation, introducing three parallel decod-
most common is the sequence-level knowledge
ing algorithms (PJ, PGJ, HGJ) and a stopping
distillation of large autoregressive models into
condition that preserves translation quality.
parallel models (Kim and Rush, 2016). Other
• We perform extensive experiments with dif- approaches include defining alternative training
ferent transformer sizes (base and large) and objectives (Ghazvininejad et al., 2020a; Saharia
datasets, showing speedups up to 38% in time, et al., 2020; Du et al., 2021; Huang et al., 2021),
obtaining a nearly 2× speedup when scaling architectures that model dependencies between
the model on parallel resources while preserv- output sentence tokens (Ghazvininejad et al., 2019;
ing quality. To the best of our knowledge, Qian et al., 2021; Song et al., 2021a; Gu and Kong,
this is one of the first studies to introduce a 2021; Song et al., 2022) or multi-iteration methods
speedup in multilingual machine translation. (Ghazvininejad et al., 2020b; Kasai et al., 2020;
Hao et al., 2021; Geng et al., 2021; Savinov et al.,
• We introduce a decoding dependency graph 2022; Huang et al., 2022; Xia et al., 2022) that
visualizer (DDGviz) to inspect the learned to- apply iterative refinements to a translation, trading
kens’ conditional dependence and when paral- some speed for greater quality. In our approach,
lel decoding is effective. we also employ iterative refinements of solutions
to non-linear equations, but we do not perform
All the code is publicly released3 .
any training or modification to the model. Other
2 Related Work works that require retraining or modifications to
the model add additional decoding heads (Stern
Gu et al. (2018) first introduced Non- et al., 2018) or use shallow decoders (Kasai et al.,
Autoregressive Translation models (NAT) as 2021). We refer the reader to Xiao et al. (2022)
ad-hoc trained models capable of producing the for a thorough survey on NAT methods. Further
translation all at once in parallel. With NATs, it orthogonal approaches use specialized hardware
is possible to consistently reduce the latency and (TPU) with low-precision calculations (Wu et al.,
speed up the translation at the expense of a slightly 2016) or software optimizations (Kim et al., 2019).
worse translation quality due to the multimodality In the context of Grammatical Error Correction,
problem (i.e., we lose the dependency between Sun et al. (2021) recently proposed aggressive
tokens in the target output). Finding a tradeoff parallel decoding, assuming that the model output
3
https://github.com/teelinsan/ is similar to the input. More recently, inspiring
parallel-decoding
PJ PGJ HGJ
Figure 2: Parallel Decoding algorithms: PJ resolves the whole sequence in parallel iteratively. PGJ resolves
blocks in parallel; once a block is finished, it moves on to the next one and decodes it again in parallel (in figure
b = 3). HGJ decodes the sentence in parallel as PGJ up to a certain length h; afterwards, it goes autoregressively
until [EOS] token is generated. Decoding actually happens in sub-word tokens (not depicted here).
our work, Song et al. (2021b) showed that it is Different sampling strategies are employed (e.g.,
possible to parallelize feedforward computations Greedy, Top-K, Top-p (Kool et al., 2020; Holtzman
by thinking of them as a system of non-linear et al., 2020)) alongside search strategies that esti-
equations. They parallelized the backpropagation mate the total conditional probability (e.g., Greedy
of RNNs, feedforward layers and autoregressive search, Beam search (Reddy, 1977)). The most
generative models on images. We extend the straightforward strategy, Greedy Search, selects
approach defined on dense pixel prediction to the the element yi of a sequence with:
discrete conditional token generation in MT. While
this work was under submission and anonymity yi = arg max pθ (yi | y1:i−1 , x). (2)
period, Leviathan et al. (2022), Chen et al. (2023)
and Kim et al. (2023) concurrently proposed Given the formalization above, a standard autore-
decoding approaches that speed up inference gressive setting runs m inference steps sequentially
of a large transformer model by using another to generate an output sequence of m elements.
smaller model to draft tokens. Compared to these Parallel Decoding. Given Equation (2), it is pos-
approaches our method requires just an existing sible to write the greedy decoding procedure on all
autoregressive model (no matter the size) and tokens as:
mathematically guarantees the output quality. In
the next Section we describe the method.
y1 = arg max pθ (y1 | x)
y2 = arg max pθ (y2 | y1 , x)
3 Method .. (3)
.
In this Section, we introduce notations, develop the
ym = arg max pθ (ym | y1:m−1 , x)
theory behind Parallel Decoding, present three al-
gorithms (Fig. 2), and discuss the initialization and Defining f (yi , y1:i−1 , x) = yi − arg max pθ (yi |
stopping conditions for the proposed approaches. y1:i−1 , x) , we can rewrite the system of Equations
(3) as:
3.1 Notation
The goal of MT is to translate a sentence x in a
f (y1 , x) = 0
f (y2 , y1 , x) = 0
source language (e.g., Italian) with its translation
.. (4)
y in the target language (e.g., English). Source
.
and target sentences are generally tokenized in
f (ym , y1:m−1 , x) = 0
words or subwords (Kudo and Richardson, 2018;
Schuster and Nakajima, 2012; Sennrich et al., 2016; This system has m non-linear equations (each equa-
Kudo, 2018); here, we use the subfix notation tion employ a neural network) with m variables.
x = (x1 , . . . , xn ) and y = (y1 , . . . , ym ) to in-
dicate specific tokens in the sequence. We also 3.2 Parallel Decoding Algorithms
use the notation x1:n to indicate a slice of a se- The autoregressive decoding implicitly solves the
quence as a shorthand of x = (x1 , . . . , xn ). From system of Equations (4) by substitution, i.e., given
a probabilistic perspective, an MT model estimates the [BOS] token and the input sentence x, it solves
pθ (y | x). Once an MT model has been trained, equations from first to last, progressively replacing
the inference phase is traditionally performed by the resolved variables. In this paper, we rely on
sampling tokens from the model probability con- Jacobi and Gauss-Seidel (GS) fixed-point iteration
ditioned on the input sequence x and previously methods (Ortega and Rheinboldt, 1970) to solve
generated tokens (y1 , . . . , yi−1 ): in parallel system (4) until a stopping condition is
reached. This formulation is particularly flexible
pθ (yi | y1 , . . . , yi−1 , x) . (1) and has several advantages: Firstly, it is completely
用jacobi iterate解equations, 其實就是從一個init seed逼近找最佳解的過程. 換句話說是一個帶入得解
然後與新解與舊解比差值的過程. 其實都沒有變數了. 也因此arg max()這樣的function在運算中並不造成困擾
因為就真的可以在真實算出的dist.中取arg max.
agnostic to the underlying MT model used; Sec- Algorithm 1 Parallel Jacobi Decoding
ondly, it can be analyzed with analytical tools and Input: x = (x1 , . . . , xn ), pθ
has guarantees of convergence to the exact solu- Output: y = (y1 , . . . , ym )
1: y ← I NIT T(x)
tion for system (4); Thirdly, it can be potentially 2: m ← len(y)
extended by drawing from the numerical methods 3: for i = 1 to m do
literature for non-linear equations solving methods 4: o ← copy(y1:m )
5: y1:m ← arg max(pθ (y1:m |y1:m , x))
(Saad, 2003). We see that, with the proper stopping 6: stop ← S TOP C(o, y1:m )
condition, it is possible to have quality guarantees 7: if stop then
over the output. We present here three algorithms 8: break
9: end if
(PJ, PGJ, HGJ) that leverage these fixed-point iter- 10: end for
ation methods to speedup decoding in MT. 11: return y
Table 1: Comparison of parallel decoding algorithms (highlighted in grey) with sequential decoding using Opus
(CPU) and MBart50 (GPU) on WMT14 and WMT16. Speed is measured in time w.r.t. the autoregressive baseline.
Table 2: Comparison over different languages in terms of speedup and iterations on MBart50. Arrows indicate the
direction of translation. Qualitative results and BLEU scores are available in the appendix D.
equations with m variables, this characterization In the standard autoregressive decoding this graph
allows to state an important property. is a fully-connected chain where the i-th token is
Proposition 1. Algorithms 1, 2, 3 converge and connected to all the previous tokens, starting from
yield the same results of greedy autoregressive de- the encoding x: to decode yi you need to decode
coding in at most m parallel iterations, for any first y1 , . . . , yi−1 . Instead we show that there are
initialization and providing stopping condition (5). skipping connections between independent tokens
that can be visualized with DGGviz. We detail
We refer the reader to Song et al. (2021b) for DGGviz with an example in section 4.3.
a formal proof. Intuitively, with m steps the al-
gorithm used the same number of iterations of au- 4 Experiments
toregressive, hence the final solution is the same
regardless the initialization. In this worst case, the 4.1 Experimental Settings
wall-clock time is the same but in general the al- Datasets. We evaluate our approach using stan-
gorithm reach the stopping condition earlier with a dard evaluation datasets proposed for parallel MT
lower wall-clock time and overall speedup. (Gu et al., 2018): WMT14 English-German [En-
3.5 DDGviz De], WMT16 English-Romanian [En-Ro] (Bo-
jar et al., 2014, 2016). Additionally, we tested
Equation 1 models the dependency between tokens our method on different language pairs with vary-
in the decoding phase. In the classical autoregres- ing (low-medium) resources: IWSLT15 (English-
sive mode, each token depends on all the previous Vietnamese [En-Vi]) (Tran et al., 2015), IITB
ones for the generation. However, it is possible to (English-Hindi [En-Hi]) (Kunchukuttan et al.,
show that this dependency is actually relaxed (i.e., 2018), WMT17 (English-Finnish [En-Fi]) (Bojar
not all tokens depends on all the previous ones), et al., 2017), FLORES-101 (English-Italian [En-It];
thus it would be interesting to visualize the actual English-French [En-Fr]) (Goyal et al., 2022). All
distribution pθ (yi | ·, x) learned by an existing MT the datasets are evaluated in both directions.
model. To this end, we build the Decoding Depen-
dency Graph visualizer (DGGviz) to visualize the Evaluation. All the evaluations are performed
dependency graph of tokens in the decoding phase. using the official test split for each dataset, down-
Requirements WMT14 Efficiency
Method
Arch Loss seq-KD Speed ↑ BLEU ↑ Train FLOPs ↓ Total FLOPs ↓ FLOPs / Speed ↓
Parallel Decoding - HGJ (Ours) No No No 1.34× 28.24 0 2.53e+13 1.89e+13
SUNDAE † (Savinov et al., 2022) Yes No No 1.4× 28.46 5.27e+21 5.27e+21 3.77e+21
ShallowDec (12-1) (Kasai et al., 2021) Yes No No 1.4× 26.90 1.02e+19 1.02e+19 7.30e+18
Semi-NAT (Wang et al., 2018) Yes No Yes 1.5× 26.90 1.55e+17 1.55e+17 1.03e+17
DisCo (Kasai et al., 2020) Yes Yes Yes, Big 3.5× 27.34 4.06e+19 4.06e+19 1.16e+19
DSLP (Huang et al., 2021) Yes Yes Yes 14.8× 27.02 1.93e+19 1.93e+19 1.31e+18
F-VAE (Gu and Kong, 2021) Yes Yes Yes, Big 16.5× 27.49 4.06e+19 4.06e+19 2.46e+18
Table 3: Comparison of different methods for parallel MT on WMT14 En-De. Results are ordered by speed,
highlighted in green the two highest BLEU scores, † indicates diffusion models. Existing methods require training,
architecture modifications, additional losses to force parallel translation, and distillation from an additional MT
transformer model ("Big" indicates the size). Details on FLOPs computation are available in the Appendix C.
loaded using Huggingface dataset library (Lhoest for the scaling experiments. Additional specifica-
et al., 2021). No training or hyperparameters tun- tions are available in Appendix B
ing is performed. We use SacreBLEU to evalu-
ate the translation quality (Papineni et al., 2002; 4.2 Algorithms Comparison
Post, 2018). We measure speedup in wall-clock In Table 1 we compare the proposed parallel decod-
time and iterations w.r.t. the same autoregressive ing algorithms with the standard sequential autore-
model. GPU times are calculated after calling gressive decoding baselines. As we can observe,
torch.cuda.synchronize(). All the ex- the fastest algorithms are PGJ Decoding (b=3) and
periments were performed by caching the past Keys HGJ Decoding (b=3) which are up to 34% and 38%
and Values of the transformer to further speed up times faster on Opus and up to 5% and 8% faster on
the computation (Ramachandran et al., 2017) and MBart50, depending on the language pair. We note
in the online inference setting with batch size equal also that results empirically show that all the paral-
to 1. For the Jacobi and GS-Jacobi algorithms, we lel decoding algorithms guarantee the same quality
assume to know beforehand the length m of the of greedy autoregressive decoding, as evidenced by
target and measure the speedup in the ideal condi- the unchanged BLEU scores. This is an experimen-
tion. For the Hybrid GS-Jacobi algorithm, we set h tal verification of the formal Proposition 1. The
equal to the maximum (i.e., the stopping condition table also shows that the Beam Search algorithm
is triggered within a parallel block) to decouple the with a beam size of 5 generally performs better in
effective speedup regardless of the length produced terms of BLEU score, although at a cost of speed.
by the initialization function (see Section 3.2). We This difference in terms of BLEU is expected, as
remark that HGJ does not assume to know before- beam search is a heuristic search strategy, while
hand the target length and is applicable to real MT our method is a decoding algorithm. We discussed
translation scenarios. better this aspect in the "Beam Search" paragraph.
Nevertheless, beam search is ∼30% slower than
Model Configuration. We tested transformer greedy autoregressive and 63% to 68% slower than
models in the two standard configurations: base PGJ, depending on the model and language pair.
(512 model dimension, 6 attention layers for both This means that the proposed parallel algorithms
encoder and decoder) and big (1024 model dimen- allow trading a little translation quality (e.g., on
sion, 12 attention layers for both encoder and de- en→ro the difference between beam search and
coder). We used pretrained models of Opus (Tiede- parallel decoding algorithms in BLEU is just 0.20
mann and Thottingal, 2020) for the former and points) for greater decoding speed.
MBart50 (Tang et al., 2020) for the latter. Opus is a Another aspect to note is that the algorithms PJ
transformer base model (74M parameters) trained and PGJ (b=5) are sometimes slower than greedy
on language pairs from the homonymous dataset autoregressive. There are several factors that can
(Zhang et al., 2020). MBart50 is a large multilin- influence the actual wall-clock time like how the
gual transformer model fine-tuned for translation underlying hardware schedule and execute the vari-
on 50 languages (610M parameters). We tested the ous operations, which might vary according to the
models on CPU since this is the default environ- architecture and the workload. In particular, longer
ment for MT models in production, except for the sequences (e.g., the whole sentence in PJ or blocks
model MBart50 which runs on GPU. We run the of 5 tokens in PGJ) may require more memory
experiments on a standard 16-core machine, except to store, and the CPU/GPU may have to perform
more memory accesses, which can slow down the
computation (although theoretically it should hap-
pen in parallel). In the end, these computational
overheads slow down the actual execution. This
is also the case for the difference in speedups be-
tween MBart50 and Opus. We better investigated
this aspect in the section "Computational Scaling"
and report in the appendix results on a different
architecture, with also results in terms of iterations
speedups which are architecture agnostic.
Anoop Kunchukuttan, Pratik Mehta, and Pushpak Bhat- Matt Post. 2018. A call for clarity in reporting BLEU
tacharyya. 2018. The IIT Bombay English-Hindi scores. In Proceedings of the Third Conference on
parallel corpus. In Proceedings of the Eleventh In- Machine Translation: Research Papers, pages 186–
ternational Conference on Language Resources and 191, Belgium, Brussels. Association for Computa-
Evaluation (LREC 2018), Miyazaki, Japan. Euro- tional Linguistics.
pean Language Resources Association (ELRA).
Emilian Postolache, Giorgio Mariani, Michele Man-
Yaniv Leviathan, Matan Kalman, and Yossi Matias. cusi, Andrea Santilli, Cosmo Luca, Emanuele
2022. Fast inference from transformers via specu- Rodola, et al. 2023. Latent autoregressive source
lative decoding. separation. In Proceedings of the AAAI Conference
on Artificial Intelligence.
Quentin Lhoest, Albert Villanova del Moral, Yacine
Jernite, Abhishek Thakur, Patrick von Platen, Suraj Lihua Qian, Hao Zhou, Yu Bao, Mingxuan Wang, Lin
Patil, Julien Chaumond, Mariama Drame, Julien Plu, Qiu, Weinan Zhang, Yong Yu, and Lei Li. 2021.
Lewis Tunstall, Joe Davison, Mario Šaško, Gun- Glancing transformer for non-autoregressive neural
jan Chhablani, Bhavitvya Malik, Simon Brandeis, machine translation. In Proceedings of the 59th An-
Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas nual Meeting of the Association for Computational
Patry, Angelina McMillan-Major, Philipp Schmid, Linguistics and the 11th International Joint Confer-
Sylvain Gugger, Clément Delangue, Théo Matus- ence on Natural Language Processing (Volume 1:
sière, Lysandre Debut, Stas Bekman, Pierric Cis- Long Papers), pages 1993–2003, Online. Associa-
tac, Thibault Goehringer, Victor Mustar, François tion for Computational Linguistics.
Lagunas, Alexander Rush, and Thomas Wolf. 2021.
Datasets: A community library for natural language Prajit Ramachandran, Tom Le Paine, Pooya Khor-
processing. In Proceedings of the 2021 Conference rami, Mohammad Babaeizadeh, Shiyu Chang, Yang
on Empirical Methods in Natural Language Process- Zhang, Mark A. Hasegawa-Johnson, Roy H. Camp-
ing: System Demonstrations, pages 175–184, On- bell, and Thomas S. Huang. 2017. Fast genera-
line and Punta Cana, Dominican Republic. Associ- tion for convolutional autoregressive models. CoRR,
ation for Computational Linguistics. abs/1704.06001.
Xiaodong Liu, Kevin Duh, Liyuan Liu, and Jianfeng Raj Reddy. 1977. Speech understanding systems: A
Gao. 2020. Very deep transformers for neural ma- summary of results of the five-year research effort.
chine translation. arXiv preprint arXiv:2008.07772. Carnegie Mellon University.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Yousef Saad. 2003. Iterative methods for sparse linear
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, systems. SIAM.
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap- Chitwan Saharia, William Chan, Saurabh Saxena, and
proach. arXiv preprint arXiv:1907.11692. Mohammad Norouzi. 2020. Non-autoregressive ma-
chine translation with latent alignments. In Proceed-
J.M. Ortega and W.C. Rheinboldt. 1970. Iterative So- ings of the 2020 Conference on Empirical Methods
lution of Nonlinear Equations in Several Variables. in Natural Language Processing (EMNLP), pages
Classics in Applied Mathematics. Society for Indus- 1098–1108, Online. Association for Computational
trial and Applied Mathematics (SIAM, 3600 Market Linguistics.
Street, Floor 6, Philadelphia, PA 19104).
Nikolay Savinov, Junyoung Chung, Mikolaj
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Binkowski, Erich Elsen, and Aaron van den
Jing Zhu. 2002. Bleu: a method for automatic eval- Oord. 2022. Step-unrolled denoising autoencoders
uation of machine translation. In Proceedings of for text generation. In International Conference on
the 40th Annual Meeting of the Association for Com- Learning Representations.
putational Linguistics, pages 311–318, Philadelphia,
Pennsylvania, USA. Association for Computational Beatrice Savoldi, Marco Gaido, Luisa Bentivogli, Mat-
Linguistics. teo Negri, and Marco Turchi. 2021. Gender Bias in
Machine Translation. Transactions of the Associa-
Adam Paszke, Sam Gross, Francisco Massa, Adam tion for Computational Linguistics, 9:845–874.
Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca Mike Schuster and Kaisuke Nakajima. 2012. Japanese
Antiga, Alban Desmaison, Andreas Kopf, Edward and korean voice search. In 2012 IEEE Interna-
Yang, Zachary DeVito, Martin Raison, Alykhan Te- tional Conference on Acoustics, Speech and Signal
jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Processing, ICASSP 2012, Kyoto, Japan, March 25-
Junjie Bai, and Soumith Chintala. 2019. Pytorch: 30, 2012, pages 5149–5152. IEEE.
Rico Sennrich, Barry Haddow, and Alexandra Birch. gela Fan. 2020. Multilingual translation with exten-
2016. Neural machine translation of rare words sible multilingual pretraining and finetuning. CoRR,
with subword units. In Proceedings of the 54th An- abs/2008.00401.
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1715– Jörg Tiedemann and Santhosh Thottingal. 2020.
1725, Berlin, Germany. Association for Computa- OPUS-MT — Building open translation services for
tional Linguistics. the World. In Proceedings of the 22nd Annual Con-
ferenec of the European Association for Machine
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Translation (EAMT), Lisbon, Portugal.
Li, Max Ryabinin, Daniel Y Fu, Zhiqiang Xie, Beidi
Chen, Clark Barrett, Joseph E Gonzalez, et al. 2023. Viet Hong Tran, Huyen Vu Thong, Nguyen Van-Vinh,
High-throughput generative inference of large lan- and Trung Le Tien. 2015. The English-Vietnamese
guage models with a single gpu. arXiv preprint machine translation system for IWSLT 2015. In Pro-
arXiv:2303.06865. ceedings of the 12th International Workshop on Spo-
ken Language Translation: Evaluation Campaign,
Jongyoon Song, Sungwon Kim, and Sungroh Yoon. pages 80–83, Da Nang, Vietnam.
2021a. AligNART: Non-autoregressive neural ma-
chine translation by jointly learning to estimate Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
alignment and translate. In Proceedings of the 2021 Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Conference on Empirical Methods in Natural Lan- Kaiser, and Illia Polosukhin. 2017. Attention is all
guage Processing, pages 1–14, Online and Punta you need. Advances in neural information process-
Cana, Dominican Republic. Association for Compu- ing systems, 30.
tational Linguistics.
Chunqi Wang, Ji Zhang, and Haiqing Chen. 2018.
Yang Song, Chenlin Meng, Renjie Liao, and Stefano Semi-autoregressive neural machine translation. In
Ermon. 2021b. Accelerating feedforward computa- Proceedings of the 2018 Conference on Empirical
tion via parallel nonlinear equation solving. In In- Methods in Natural Language Processing, pages
ternational Conference on Machine Learning, pages 479–488, Brussels, Belgium. Association for Com-
9791–9800. PMLR. putational Linguistics.
Zhenqiao Song, Hao Zhou, Lihua Qian, Jingjing Xu, Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Shanbo Cheng, Mingxuan Wang, and Lei Li. 2022. Chaumond, Clement Delangue, Anthony Moi, Pier-
switch-GLAT: Multilingual parallel machine transla- ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
tion via code-switch decoder. In International Con- icz, Joe Davison, Sam Shleifer, Patrick von Platen,
ference on Learning Representations. Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Teven Le Scao, Sylvain Gugger, Mariama Drame,
2018. Blockwise parallel decoding for deep autore- Quentin Lhoest, and Alexander Rush. 2020. Trans-
gressive models. In Advances in Neural Information formers: State-of-the-art natural language process-
Processing Systems, volume 31. Curran Associates, ing. In Proceedings of the 2020 Conference on Em-
Inc. pirical Methods in Natural Language Processing:
System Demonstrations, pages 38–45, Online. Asso-
Emma Strubell, Ananya Ganesh, and Andrew McCal- ciation for Computational Linguistics.
lum. 2019. Energy and policy considerations for
deep learning in NLP. In Proceedings of the 57th Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Annual Meeting of the Association for Computa- Le, Mohammad Norouzi, Wolfgang Macherey,
tional Linguistics, pages 3645–3650, Florence, Italy. Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Association for Computational Linguistics. Macherey, et al. 2016. Google’s neural machine
translation system: Bridging the gap between hu-
Xin Sun, Tao Ge, Furu Wei, and Houfeng Wang. man and machine translation. arXiv preprint
2021. Instantaneous grammatical error correction arXiv:1609.08144.
with shallow aggressive decoding. In Proceedings of
the 59th Annual Meeting of the Association for Com- Heming Xia, Tao Ge, Furu Wei, and Zhifang Sui. 2022.
putational Linguistics and the 11th International Lossless speedup of autoregressive translation with
Joint Conference on Natural Language Processing generalized aggressive decoding.
(Volume 1: Long Papers), pages 5937–5947, Online.
Association for Computational Linguistics. Yisheng Xiao, Lijun Wu, Junliang Guo, Juntao Li,
Min Zhang, Tao Qin, and Tie-yan Liu. 2022. A
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. survey on non-autoregressive generation for neural
Sequence to sequence learning with neural networks. machine translation and beyond. arXiv preprint
Advances in neural information processing systems, arXiv:2204.09269.
27.
Biao Zhang, Philip Williams, Ivan Titov, and Rico Sen-
Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na- nrich. 2020. Improving massively multilingual neu-
man Goyal, Vishrav Chaudhary, Jiatao Gu, and An- ral machine translation and zero-shot translation. In
Algorithm 2 Parallel GS-Jacobi Decoding Dataset # Test
WMT 14 De-En (Bojar et al., 2014) 3003
Input: x = (x1 , . . . , xn ), pθ , b WMT 16 Ro-En (Bojar et al., 2016) 1999
Output: y = (y1 , . . . , ym ) WMT 17 Fi-En (Bojar et al., 2017) 3002
IWSLT 15 En-Vi (Tran et al., 2015) 1046
1: y ← I NIT T(x) IITB En-Hi (Kunchukuttan et al., 2018) 2507
2: m ← len(y) FLORES-101 En-It (Goyal et al., 2022) 1012
FLORES-101 En-Fr (Goyal et al., 2022) 1012
3: i ← 1
4: while i 6 m do Table 4: Data Statistic
5: o ← copy(yi:i+b )
6: yi:i+b ← arg max(pθ (yi:i+b |y1:i+b , x))
7: stop ← S TOP C(o, yi:i+b ) (referred to as c2d-standard-XX, where XX
8: if stop then is the number of used cores). Experiments with
9: i←i+b MBart50 on table 1, 2 and 6 are performed on a
10: break Desktop machine with Ubuntu 20.04.4 LTS, AMD
11: end if Ryzen 9 3900X 12-Core Processor, 32GB of RAM,
12: end while and a Palit Nvidia 3090 GPU. Additional experi-
13: return y ments with Opus in table 6 are also performed on
this machine. Models are implemented in Pytorch
1.11.0 (Paszke et al., 2019) and the Huggingface
Proceedings of the 58th Annual Meeting of the Asso- Transformer library (Wolf et al., 2020). We used
ciation for Computational Linguistics, pages 1628– python 3.8 and NVIDIA-SMI Drivers 510.73.05
1639, Online. Association for Computational Lin-
guistics. with CUDA version 11.6. For OPUS we used Hug-
gingface models available on the hub under the tag
A Algorithms details Helsinki-NLP/opus-mt-{src}-{tgt}
except for the language pair Ro-
We propose here the pseudocode of Algorithms 2
En where we used the model
and 3 due to space limitations in the main body of
Helsinki-NLP/opus-mt-roa-en and
the paper.
the pair En-De where we used the check-
The function copy(yi:i+b ) creates a copy of the
point opus-2021-02-22 4 . For the model
tensor in input detached from the source. This
MBart50, we used the facebook pre-trained
is done in practice to avoid the overwriting of
model available on the hub with the tag
pointers to the same memory location. Function
mbart-large-50-many-to-many-mmt.
C HECK EOS(yi:i+b ) returns the index of the token
Since this is a multilingual model, we prepend
EOS in the block if present, else −1. Function
the source and target language tag corresponding
C HECK EOS(yi ) returns T rue if the tokes in ex-
properly to the language pair to be translated.
actly the token EOS, else F alse. The function
We report results for a single run over the test
arg max selects from the model distribution over
dataset since we found low variance in estimates
the vocabulary the index (token) with maximum
with multiple runs which can be calculated by
probability. This procedure is done for all the to-
simply varying the corresponding parameter in the
kens in parallel, in the case of parallel decoding, or
config.yaml file. For each dataset, we used
for just a single token in the case of autoregressive
the official test split via the Huggingface dataset
decoding. Generally, the output is the prediction
library (Lhoest et al., 2021). Datasets statistics are
for the next token; hence it should be shifted left
reported in table 4.
before the reassignment to a variable. We omitted
this implementation detail for clarity. C FLOPs calculation details
B Additional implementation details We measured computational complexity using float-
We run Opus experiments in table 1 on an AMD ing point operations (FLOPs), which, as the name
EPYC Milan with 16 cores at 2.45 GHz and imply, counts the number of floating point opera-
64GB of RAM (accessible on Google Cloud tion performed by a model. This is a standard met-
- c2d-standard-16). For the scalability ric used in literature to measure hardware-agnostic
experiment in figure 3, we also used Google Cloud 4
https://object.pouta.csc.fi/Tatoeba-MT-models/eng-
instances with an increasing number of cores deu/opus-2021-02-22.zip
Algorithm 3 Hybrid GS-Jacobi Decoding Model Train FLOPs Infer. FLOPs Total FLOPs
Semi-NAT 1.55e17 2.08e13 1.55e17
Input: x = (x1 , . . . , xn ), pθ , b Shallow Dec. 1.02e19 1.15e13 1.02e19
Output: y = (y1 , . . . , ym ) DSLP 1.93e19 1.58e13 1.93e19
F-VAE 4.06e19 1.58e13 4.06e19
1: y ← I NIT T(x) DisCo 4.06e19 1.58e13 4.06e19
2: h ← len(y) SUNDAE 5.27e21 1.58e14 5.27e21
3: i ← 1 BERT base 6.43e19 - -
BERT large 1.92e20 - -
4: eos_cond ← F alse RoBERTa 3.19e21 - -
5: while i 6 h do
6: o ← copy(yi:i+b ) Table 5: FLOPs comparison with other models.
7: yi:i+b ← arg max(pθ (yi:i+b |y1:i+b , x))
8: stop ← S TOP C(o, yi:i+b ) on a standard desktop CPU with also the speedup
9: eos_ind ← C HECK EOS(yi:i+b ) in terms of iterations. It is possible to observe that
10: if stop and eos_ind > −1 then in the case of MBart50 and PGJ there is a speedup
11: y ← y1:eos_ind of 8 − 11% in terms of iterations compare to a
12: eos_cond ← T rue time speedup of 3 − 8%. This means that there
13: break is room for improvement for our algorithm. Fur-
14: end if thermore, results show that the time speedups are
15: if stop then consistent also with standard desktop hardware. Ta-
16: i←i+b ble 7 shows the BLEU scores for the cross-lingual
17: break experiment. It is possible to observe that parallel
18: end if decoding algorithms guarantee quality compared
19: end while to greedy autoregressive and are not so distant from
20: while eos_cond ! = T rue do beam search. We show also here in table 5 some
21: yi ← arg max(pθ (yi |yi−1 , x)) qualitative results for the experiments in table 2.
22: i←i+1 Finally, we propose additional visualizations using
23: eos_cond ← I S EOS(yi ) DGGviz in Figure 6.
24: end while
25: return y
D Additional results
We propose here additional results to the experi-
ments in the paper that were omitted due to limita-
tions constraints. Table 6 shows the same experi-
ments of Table 1 in the main paper, proposed here
5
https://github.com/google-
research/electra/blob/master/flops_computation.py
en→de de→en en→ro ro→en
Decoding Algorithm
Time Iters Time Iters Time Iters Time Iters
Opus
Greedy Autoregressive 1.00× 1.00× 1.00× 1.00× 1.00× 1.00× 1.00× 1.00×
Beam Search (beam = 5) 0.71× 1.00× 0.71× 1.00× 0.70× 1.00× 0.72× 1.00×
PJ Decoding 0.72× 1.03× 0.74× 1.04× 0.69× 1.04× 0.67× 1.03×
PGJ Decoding (b = 3) 1.16× 1.04× 1.19× 1.07× 1.17× 1.05× 1.17× 1.03×
HGJ Decoding (b = 3) 1.16× 1.04× 1.19× 1.06× 1.17× 1.05× 1.17× 1.03×
MBart50
Greedy Autoregressive 1.00× 1.00× 1.00× 1.00× 1.00× 1.00× 1.00× 1.00×
Beam Search (beam = 5) 0.76× 1.00× 0.77× 1.00× 0.77× 1.00× 0.76× 1.00×
PJ Decoding 0.88× 1.03× 0.88× 1.03× 0.86× 1.04× 0.85× 1.03×
PGJ Decoding (b = 3) 1.06× 1.10× 1.08× 1.11× 1.03× 1.08× 1.04× 1.11×
HGJ Decoding (b = 3) 1.05× 1.07× 1.07× 1.01× 1.01× 1.02× 1.02× 1.08×
Table 6: Comparison of parallel decoding algorithms (highlighted in grey) with sequential decoding using Opus
(CPU) and MBart50 (GPU) on WMT14 and WMT16. Speed is showed here both in Time and Iterations w.r.t. the
greedy autoregressive baseline.
Table 7: Translation examples generated with the autoregressive (A) and the different decoding algorithms pro-
posed (PJ, PGJ, HGJ) on Opus (WMT datasets) and MBart50. The decoding time is shown in seconds.
(a) En-De: "Lack of Scots title race bores Dutch - de (b) De-En: "Private Fachgeschafte und auch den Großhan-
Boer"→"Fehlende Schottentitelrennen bohrt Niederlan- del gibt es fast nicht mehr."→"Private specialist shops and
disch - de Boer" wholesale trade are almost no longer available."
(c) Ro-En: "Un prim contract de lucrări a fost reziliat în (d) En-Ro: "‘Shot in Joburg’: Homeless youth trained
aprilie 2012, după ce se efectuaseră lucrări de 4,5 milioane as photographers"→ "“Fotografii in Joburg”: Tineri fără
lei."→ "A first contract of employment was terminated in adăpost formaţi ca fotografi"
April 2012, after a work of 4.5 million lei."
(e) De-En: "Einige sind nach der Installation auf Prob- (f) Ro-En: "Se pare că va fi acuzat de fugă de la locul
leme gestoßen, da sie eine Fehlermeldung erhalten, die accidentului, neoferirea primului ajutor s, i alte infract, iuni
mitteilt, dass die “Software-Aktualisierung fehlgeschla- rutiere."→ "Apparently he’ll be charged with running
gen” ist."→"Some have encountered problems after instal- from the scene of the accident, the first aid and other road
lation, as they receive an error message that tells us that crimes."
“software update has failed”."