0% found this document useful (0 votes)

10 views

Accelerating Transformer Inference for Translation via Parallel Decoding

This paper presents a novel approach to accelerate transformer inference for machine translation by introducing parallel decoding algorithms that enhance efficiency without requiring modifications to existing autoregressive models. The proposed methods leverage Jacobi and Gauss-Seidel fixed-point iteration techniques, achieving speedups of up to 38% in translation time while maintaining quality. Additionally, the authors introduce a decoding dependency graph visualizer to analyze the model's learned token dependencies during the decoding process.

Uploaded by

Weafon Tsao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Accelerating Transformer Inference for Translation via Parallel Decoding

Uploaded by

Weafon Tsao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Accelerating Transformer Inference for Translation via Parallel Decoding

Andrea Santilli1 , Silvio Severino1 , Emilian Postolache1 , Valentino Maiorca1 ,

Michele Mancusi1 , Riccardo Marin2,3 , Emanuele Rodolà1
1
Sapienza University of Rome 2 University of Tübingen
3
Tübingen AI Center
[email protected]

Abstract and portability. Considering that these systems are

extensively used in production multiple times to
Autoregressive decoding limits the efficiency
of transformers for Machine Translation (MT). produce new translations (e.g., Google Translate1 ,
DeepL Translator2 ), even a minor speedup would
arXiv:2305.10427v1 [cs.CL] 17 May 2023

The community proposed specific network

architectures and learning-based methods to be beneficial in the long run, especially if the trans-
solve this issue, which are expensive and relation is done on embedded devices.
quire changes to the MT model, trading in- To address this issue, the community proposed
ference speed at the cost of the translation ad-hoc trained models specific for parallel ma-
quality. In this paper, we propose to address
chine translation under the umbrella term of Non-
the problem from the point of view of decod-
ing algorithms, as a less explored but rather
Autoregressive Machine Translation models (NAT)
compelling direction. We propose to reframe (Gu et al., 2018). These models produce the trans-
the standard greedy autoregressive decoding lation in parallel but require (i) a complete reengi-
of MT with a parallel formulation leveraging neering of the MT system, (ii) extensive training
Jacobi and Gauss-Seidel fixed-point iteration resources and (iii) complex design choices like dis-
methods for fast inference. This formulation tillation from larger autoregressive models. These
allows to speed up existing models without requirements are quite demanding and not easily
training or modifications while retaining trans-
satisfiable. For example, production systems are
lation quality. We present three parallel de-
coding algorithms and test them on different heavily optimized for hardware and software and
languages and models showing how the par- even introducing a minimal modification requires
allelization introduces a speedup up to 38% non-trivial human effort (Wu et al., 2016; Kim et al.,
w.r.t. the standard autoregressive decoding and 2019). Furthermore, training a new model from
nearly 2x when scaling the method on parallel scratch is not always possible due to non-released
resources. Finally, we introduce a decoding training data or low-resource languages having few
dependency graph visualizer (DDGviz) that let
or lacking parallel corpora.
us see how the model has learned the condi-
tional dependence between tokens and inspect In this paper, we propose to address the problem
the decoding procedure. of parallel machine translation with an orthogo-
nal approach consisting in novel decoding algo-
1 Introduction rithms that work in parallel and can be used on
In recent years there have been dramatic improve- top of existing autoregressive models for MT. We
ments in Machine Translation (MT) (Edunov et al., overcome previous limitations with a flexible and
2018; Liu et al., 2020) thanks to the transition to generic method that does not require any modifica-
neural models and the advent of the Transformer ar- tion to the model or costly retraining. Specifically,
chitecture (Vaswani et al., 2017). These models can inspired by previous successes in speeding up feed-
produce high-quality translations while being ex- forward computation for image generation (Song
tremely parallelizable during training. However, et al., 2021b), we reframe the greedy autoregres-
Transformers are used sequentially at inference sive decoding for MT as a system of nonlinear
time, generating one token per time (i.e., sending equations solvable in parallel. This simple formu-
each token as input for the next autoregressive itera- lation speeds up the decoding procedure by using
tion). This process of autoregressive inference ham- fixed-point iteration methods like Jacobi and Gauss-
pers the efficiency of neural machine translation 1
https://translate.google.com/
2
systems in terms of latency, limiting applications https://www.deepl.com/
Figure 1: On the left, the classical Autoregressive Decoding for MT. The target sentence is produced token-by-
token sequentially, sending the partial result as input for the next autoregressive iteration up to the length m of
the target. On the right Parallel Decoding proposed in this paper. This method changes only the decoding algo-
rithm (orange block) and is usable on top of any autoregressive model without modifications. Parallel Decoding
algorithms resolve the whole sentence or a block of b tokens in parallel: initial tokens (PAD tokens) are gradually
refined with k steps until a stopping condition is reached. Crucially, k 6 m with quality guarantees and overall
decoding speedups.

Seidel while having mathematical guarantees on between translation quality and speed is an active
the quality of the translation. A high-level descrip- research direction, with current methods trying to
tion of the method is available in (Fig. 1). Our fill the gap in terms of translation quality (Geng
contributions can be summarized as the following: et al., 2021; Savinov et al., 2022). Nevertheless,
all proposed NAT models are learning-based and
• We reframe the standard greedy autoregres-
require different tricks to reach the quality of
sive decoding procedure in MT with a parallel
autoregressive models (Gu and Kong, 2021). The
formulation, introducing three parallel decod-
most common is the sequence-level knowledge
ing algorithms (PJ, PGJ, HGJ) and a stopping
distillation of large autoregressive models into
condition that preserves translation quality.
parallel models (Kim and Rush, 2016). Other
• We perform extensive experiments with dif- approaches include defining alternative training
ferent transformer sizes (base and large) and objectives (Ghazvininejad et al., 2020a; Saharia
datasets, showing speedups up to 38% in time, et al., 2020; Du et al., 2021; Huang et al., 2021),
obtaining a nearly 2× speedup when scaling architectures that model dependencies between
the model on parallel resources while preserv- output sentence tokens (Ghazvininejad et al., 2019;
ing quality. To the best of our knowledge, Qian et al., 2021; Song et al., 2021a; Gu and Kong,
this is one of the first studies to introduce a 2021; Song et al., 2022) or multi-iteration methods
speedup in multilingual machine translation. (Ghazvininejad et al., 2020b; Kasai et al., 2020;
Hao et al., 2021; Geng et al., 2021; Savinov et al.,
• We introduce a decoding dependency graph 2022; Huang et al., 2022; Xia et al., 2022) that
visualizer (DDGviz) to inspect the learned to- apply iterative refinements to a translation, trading
kens’ conditional dependence and when paral- some speed for greater quality. In our approach,
lel decoding is effective. we also employ iterative refinements of solutions
to non-linear equations, but we do not perform
All the code is publicly released3 .
any training or modification to the model. Other
2 Related Work works that require retraining or modifications to
the model add additional decoding heads (Stern
Gu et al. (2018) first introduced Non- et al., 2018) or use shallow decoders (Kasai et al.,
Autoregressive Translation models (NAT) as 2021). We refer the reader to Xiao et al. (2022)
ad-hoc trained models capable of producing the for a thorough survey on NAT methods. Further
translation all at once in parallel. With NATs, it orthogonal approaches use specialized hardware
is possible to consistently reduce the latency and (TPU) with low-precision calculations (Wu et al.,
speed up the translation at the expense of a slightly 2016) or software optimizations (Kim et al., 2019).
worse translation quality due to the multimodality In the context of Grammatical Error Correction,
problem (i.e., we lose the dependency between Sun et al. (2021) recently proposed aggressive
tokens in the target output). Finding a tradeoff parallel decoding, assuming that the model output
3
https://github.com/teelinsan/ is similar to the input. More recently, inspiring
parallel-decoding
PJ PGJ HGJ
Figure 2: Parallel Decoding algorithms: PJ resolves the whole sequence in parallel iteratively. PGJ resolves
blocks in parallel; once a block is finished, it moves on to the next one and decodes it again in parallel (in figure
b = 3). HGJ decodes the sentence in parallel as PGJ up to a certain length h; afterwards, it goes autoregressively
until [EOS] token is generated. Decoding actually happens in sub-word tokens (not depicted here).

our work, Song et al. (2021b) showed that it is Different sampling strategies are employed (e.g.,
possible to parallelize feedforward computations Greedy, Top-K, Top-p (Kool et al., 2020; Holtzman
by thinking of them as a system of non-linear et al., 2020)) alongside search strategies that esti-
equations. They parallelized the backpropagation mate the total conditional probability (e.g., Greedy
of RNNs, feedforward layers and autoregressive search, Beam search (Reddy, 1977)). The most
generative models on images. We extend the straightforward strategy, Greedy Search, selects
approach defined on dense pixel prediction to the the element yi of a sequence with:
discrete conditional token generation in MT. While
this work was under submission and anonymity yi = arg max pθ (yi | y1:i−1 , x). (2)
period, Leviathan et al. (2022), Chen et al. (2023)
and Kim et al. (2023) concurrently proposed Given the formalization above, a standard autore-
decoding approaches that speed up inference gressive setting runs m inference steps sequentially
of a large transformer model by using another to generate an output sequence of m elements.
smaller model to draft tokens. Compared to these Parallel Decoding. Given Equation (2), it is pos-
approaches our method requires just an existing sible to write the greedy decoding procedure on all
autoregressive model (no matter the size) and tokens as:
mathematically guarantees the output quality. In 
the next Section we describe the method. 
 y1 = arg max pθ (y1 | x)
 y2 = arg max pθ (y2 | y1 , x)

3 Method .. (3)


 .
In this Section, we introduce notations, develop the 
ym = arg max pθ (ym | y1:m−1 , x)
theory behind Parallel Decoding, present three al-
gorithms (Fig. 2), and discuss the initialization and Defining f (yi , y1:i−1 , x) = yi − arg max pθ (yi |
stopping conditions for the proposed approaches. y1:i−1 , x) , we can rewrite the system of Equations
(3) as:
3.1 Notation 
The goal of MT is to translate a sentence x in a 
 f (y1 , x) = 0
 f (y2 , y1 , x) = 0

source language (e.g., Italian) with its translation
.. (4)
y in the target language (e.g., English). Source 
 .

and target sentences are generally tokenized in 
f (ym , y1:m−1 , x) = 0
words or subwords (Kudo and Richardson, 2018;
Schuster and Nakajima, 2012; Sennrich et al., 2016; This system has m non-linear equations (each equa-
Kudo, 2018); here, we use the subfix notation tion employ a neural network) with m variables.
x = (x1 , . . . , xn ) and y = (y1 , . . . , ym ) to in-
dicate specific tokens in the sequence. We also 3.2 Parallel Decoding Algorithms
use the notation x1:n to indicate a slice of a se- The autoregressive decoding implicitly solves the
quence as a shorthand of x = (x1 , . . . , xn ). From system of Equations (4) by substitution, i.e., given
a probabilistic perspective, an MT model estimates the [BOS] token and the input sentence x, it solves
pθ (y | x). Once an MT model has been trained, equations from first to last, progressively replacing
the inference phase is traditionally performed by the resolved variables. In this paper, we rely on
sampling tokens from the model probability con- Jacobi and Gauss-Seidel (GS) fixed-point iteration
ditioned on the input sequence x and previously methods (Ortega and Rheinboldt, 1970) to solve
generated tokens (y1 , . . . , yi−1 ): in parallel system (4) until a stopping condition is
reached. This formulation is particularly flexible
pθ (yi | y1 , . . . , yi−1 , x) . (1) and has several advantages: Firstly, it is completely
用jacobi iterate解equations, 其實就是從一個init seed逼近找最佳解的過程. 換句話說是一個帶入得解
然後與新解與舊解比差值的過程. 其實都沒有變數了. 也因此arg max()這樣的function在運算中並不造成困擾
因為就真的可以在真實算出的dist.中取arg max.
agnostic to the underlying MT model used; Sec- Algorithm 1 Parallel Jacobi Decoding
ondly, it can be analyzed with analytical tools and Input: x = (x1 , . . . , xn ), pθ
has guarantees of convergence to the exact solu- Output: y = (y1 , . . . , ym )
1: y ← I NIT T(x)
tion for system (4); Thirdly, it can be potentially 2: m ← len(y)
extended by drawing from the numerical methods 3: for i = 1 to m do
literature for non-linear equations solving methods 4: o ← copy(y1:m )
5: y1:m ← arg max(pθ (y1:m |y1:m , x))
(Saad, 2003). We see that, with the proper stopping 6: stop ← S TOP C(o, y1:m )
condition, it is possible to have quality guarantees 7: if stop then
over the output. We present here three algorithms 8: break
9: end if
(PJ, PGJ, HGJ) that leverage these fixed-point iter- 10: end for
ation methods to speedup decoding in MT. 11: return y

Parallel Jacobi (PJ) Decoding. First, we pro-

pose Algorithm 1. This algorithm works by ini- ken with standard autoregressive decoding. In this
tializing a draft translation for the whole target case, the length h regulates the trade-off between
sentence and then iteratively translating the whole parallel and sequential computation, limiting the
sentence in parallel until the stopping condition is waste of resources beyond [EOS].
triggered. This is equivalent to solving system (4)
3.3 Initialization and Stopping
with Jacobi, hence the name of the method.
Our algorithms share two components: the initial-
Parallel GS-Jacobi (PGJ) Decoding. Decoding ization procedure and the stopping condition.
the whole target sentence in parallel may intro-
duce difficulties in inferring long dependencies be- Initialization I NIT T(x). The initialization pro-
tween tokens since the underlying model is trained cedure is a function that inputs the source sentence
to model the conditional distribution of a token and produces an initial draft translation as output.
given the previous tokens. In general, we observed In this paper we experimented with a simple initial-
that shorter dependencies are easily predicted since ization procedure that initialize the translation with
decoding happens at the sub-word level, and the all [PAD] tokens. This choice is fast and doesn’t
model can decode sub-word unities in parallel depend on the underlying MT model. We leave as
rather than the whole sentence. To this end, we future work the research of different initialization
propose Algorithm 2, called GS-Jacobi, that splits procedures to further speedup the decoding.
the sentence into contiguous b-dimensional blocks. Stopping Condition S TOP C(yk−1 , yk ). The
Starting from the first one, it decodes in parallel stopping condition is a function that takes as in-
all its elements. Once a block is finished or the put the previous-iteration sentence yk−1 and the
stopping condition within the block is triggered, current-iteration sentence yk and decides whether
the algorithm performs a sequential (Gauss-Seidel) to stop the algorithm or not. This function is crucial
step and proceeds with (Jacobi) decoding on the since it regulates the trade-off between speedup and
next one. translation quality. In this paper we introduce as
Hybrid GS-Jacobi (HGJ) Decoding. Algo- stopping condition for MT:
rithms 1 and 2 assume to know beforehand the num- yk−1 − yk = 0 (5)
ber of equations m (i.e., the target length). This is
not usually the case for MT, where the model dy- i.e., the sentence from the previous step has not
namically controls the length through the emission changed. This stop condition allows for preserving
of a special end-of-sentence token [EOS]. To over- quality and quickening translations simultaneously.
come this issue, we propose a flexible Hybrid Algo-
rithm 3 that mixes PGJ computations with standard 3.4 Quality Guarantees
autoregressive decoding. This algorithm performs Compared to NAT methods which do not have any
parallel GS-Jacobi decoding up to a certain prefixed quality guarantee since a novel parallel model is
length h. If the [EOS] token is generated within trained from scratch, our formulation guarantees
a block, then the algorithm stops, returning the to have the same quality of using autoregressive
translation up to [EOS]. Otherwise, the algorithm decoding with the same MT model. System (4)
concludes the translation by reaching the [EOS] to- is known in literature as a triangular system of m
en→de de→en en→ro ro→en
Decoding Algorithm
Speed BLEU Speed BLEU Speed BLEU Speed BLEU
Opus
Greedy Autoregressive 1.00× 28.24 1.00× 33.10 1.00× 27.41 1.00× 37.01
Beam Search (beam = 5) 0.71× 28.68 0.72× 33.92 0.70× 27.61 0.72× 37.84
PJ Decoding 0.73× 28.24 0.75× 33.10 0.66× 27.41 0.66× 37.01
PGJ Decoding (b = 5) 1.28× 28.24 1.32× 33.10 1.33× 27.41 1.29× 37.01
PGJ Decoding (b = 3) 1.34× 28.24 1.37× 33.10 1.38× 27.41 1.35× 37.01
HGJ Decoding (b = 3) 1.34× 28.24 1.37× 33.10 1.38× 27.41 1.35× 37.01
MBart50
Greedy Autoregressive 1.00× 23.97 1.00× 31.58 1.00× 24.99 1.00× 34.77
Beam Search (beam = 5) 0.76× 24.93 0.77× 32.61 0.77× 25.31 0.76× 35.16
PJ Decoding 0.88× 23.97 0.88× 31.58 0.86× 24.99 0.85× 34.77
PGJ Decoding (b = 5) 0.98× 23.97 0.98× 31.58 0.97× 24.99 0.99× 34.77
PGJ Decoding (b = 3) 1.06× 23.97 1.08× 31.58 1.03× 24.99 1.04× 34.77
HGJ Decoding (b = 3) 1.05× 23.97 1.07× 31.58 1.01× 24.99 1.02× 34.77

Table 1: Comparison of parallel decoding algorithms (highlighted in grey) with sequential decoding using Opus
(CPU) and MBart50 (GPU) on WMT14 and WMT16. Speed is measured in time w.r.t. the autoregressive baseline.

WMT17 IITB IWSLT15 FLORES

En-Fi En-Hi En-Vi En-It En-Fr
Dec. Algorithm Speed ← → ← → ← → ← → ← →
Iters 1.04× 1.04× 1.04× 1.04 × 1.06× 1.03× 1.02× 1.04× 1.03× 1.03×
PJ
Time 0.86× 0.88× 0.89× 0.89× 0.87× 0.86× 0.85× 0.86× 0.85× 0.85×
Iters 1.07× 1.09× 1.09× 1.09× 1.10× 1.07 × 1.07× 1.08× 1.08× 1.11×
PGJ (b=3)
Time 1.01× 1.05× 1.05× 1.07× 1.04× 1.02× 1.02× 1.03× 1.03× 1.05×
Iters 1.05× 1.07× 1.07× 1.07× 1.07× 1.06× 1.07× 1.06× 1.05× 1.07×
HGJ (b=3)
Time 1.01× 1.03× 1.04× 1.05× 1.03× 1.01× 1.01× 1.02× 1.01× 1.03×

Table 2: Comparison over different languages in terms of speedup and iterations on MBart50. Arrows indicate the
direction of translation. Qualitative results and BLEU scores are available in the appendix D.

equations with m variables, this characterization In the standard autoregressive decoding this graph
allows to state an important property. is a fully-connected chain where the i-th token is
Proposition 1. Algorithms 1, 2, 3 converge and connected to all the previous tokens, starting from
yield the same results of greedy autoregressive de- the encoding x: to decode yi you need to decode
coding in at most m parallel iterations, for any first y1 , . . . , yi−1 . Instead we show that there are
initialization and providing stopping condition (5). skipping connections between independent tokens
that can be visualized with DGGviz. We detail
We refer the reader to Song et al. (2021b) for DGGviz with an example in section 4.3.
a formal proof. Intuitively, with m steps the al-
gorithm used the same number of iterations of au- 4 Experiments
toregressive, hence the final solution is the same
regardless the initialization. In this worst case, the 4.1 Experimental Settings
wall-clock time is the same but in general the al- Datasets. We evaluate our approach using stan-
gorithm reach the stopping condition earlier with a dard evaluation datasets proposed for parallel MT
lower wall-clock time and overall speedup. (Gu et al., 2018): WMT14 English-German [En-
3.5 DDGviz De], WMT16 English-Romanian [En-Ro] (Bo-
jar et al., 2014, 2016). Additionally, we tested
Equation 1 models the dependency between tokens our method on different language pairs with vary-
in the decoding phase. In the classical autoregres- ing (low-medium) resources: IWSLT15 (English-
sive mode, each token depends on all the previous Vietnamese [En-Vi]) (Tran et al., 2015), IITB
ones for the generation. However, it is possible to (English-Hindi [En-Hi]) (Kunchukuttan et al.,
show that this dependency is actually relaxed (i.e., 2018), WMT17 (English-Finnish [En-Fi]) (Bojar
not all tokens depends on all the previous ones), et al., 2017), FLORES-101 (English-Italian [En-It];
thus it would be interesting to visualize the actual English-French [En-Fr]) (Goyal et al., 2022). All
distribution pθ (yi | ·, x) learned by an existing MT the datasets are evaluated in both directions.
model. To this end, we build the Decoding Depen-
dency Graph visualizer (DGGviz) to visualize the Evaluation. All the evaluations are performed
dependency graph of tokens in the decoding phase. using the official test split for each dataset, down-
Requirements WMT14 Efficiency
Method
Arch Loss seq-KD Speed ↑ BLEU ↑ Train FLOPs ↓ Total FLOPs ↓ FLOPs / Speed ↓
Parallel Decoding - HGJ (Ours) No No No 1.34× 28.24 0 2.53e+13 1.89e+13
SUNDAE † (Savinov et al., 2022) Yes No No 1.4× 28.46 5.27e+21 5.27e+21 3.77e+21
ShallowDec (12-1) (Kasai et al., 2021) Yes No No 1.4× 26.90 1.02e+19 1.02e+19 7.30e+18
Semi-NAT (Wang et al., 2018) Yes No Yes 1.5× 26.90 1.55e+17 1.55e+17 1.03e+17
DisCo (Kasai et al., 2020) Yes Yes Yes, Big 3.5× 27.34 4.06e+19 4.06e+19 1.16e+19
DSLP (Huang et al., 2021) Yes Yes Yes 14.8× 27.02 1.93e+19 1.93e+19 1.31e+18
F-VAE (Gu and Kong, 2021) Yes Yes Yes, Big 16.5× 27.49 4.06e+19 4.06e+19 2.46e+18

Table 3: Comparison of different methods for parallel MT on WMT14 En-De. Results are ordered by speed,
highlighted in green the two highest BLEU scores, † indicates diffusion models. Existing methods require training,
architecture modifications, additional losses to force parallel translation, and distillation from an additional MT
transformer model ("Big" indicates the size). Details on FLOPs computation are available in the Appendix C.

loaded using Huggingface dataset library (Lhoest for the scaling experiments. Additional specifica-
et al., 2021). No training or hyperparameters tun- tions are available in Appendix B
ing is performed. We use SacreBLEU to evalu-
ate the translation quality (Papineni et al., 2002; 4.2 Algorithms Comparison
Post, 2018). We measure speedup in wall-clock In Table 1 we compare the proposed parallel decod-
time and iterations w.r.t. the same autoregressive ing algorithms with the standard sequential autore-
model. GPU times are calculated after calling gressive decoding baselines. As we can observe,
torch.cuda.synchronize(). All the ex- the fastest algorithms are PGJ Decoding (b=3) and
periments were performed by caching the past Keys HGJ Decoding (b=3) which are up to 34% and 38%
and Values of the transformer to further speed up times faster on Opus and up to 5% and 8% faster on
the computation (Ramachandran et al., 2017) and MBart50, depending on the language pair. We note
in the online inference setting with batch size equal also that results empirically show that all the paral-
to 1. For the Jacobi and GS-Jacobi algorithms, we lel decoding algorithms guarantee the same quality
assume to know beforehand the length m of the of greedy autoregressive decoding, as evidenced by
target and measure the speedup in the ideal condi- the unchanged BLEU scores. This is an experimen-
tion. For the Hybrid GS-Jacobi algorithm, we set h tal verification of the formal Proposition 1. The
equal to the maximum (i.e., the stopping condition table also shows that the Beam Search algorithm
is triggered within a parallel block) to decouple the with a beam size of 5 generally performs better in
effective speedup regardless of the length produced terms of BLEU score, although at a cost of speed.
by the initialization function (see Section 3.2). We This difference in terms of BLEU is expected, as
remark that HGJ does not assume to know before- beam search is a heuristic search strategy, while
hand the target length and is applicable to real MT our method is a decoding algorithm. We discussed
translation scenarios. better this aspect in the "Beam Search" paragraph.
Nevertheless, beam search is ∼30% slower than
Model Configuration. We tested transformer greedy autoregressive and 63% to 68% slower than
models in the two standard configurations: base PGJ, depending on the model and language pair.
(512 model dimension, 6 attention layers for both This means that the proposed parallel algorithms
encoder and decoder) and big (1024 model dimen- allow trading a little translation quality (e.g., on
sion, 12 attention layers for both encoder and de- en→ro the difference between beam search and
coder). We used pretrained models of Opus (Tiede- parallel decoding algorithms in BLEU is just 0.20
mann and Thottingal, 2020) for the former and points) for greater decoding speed.
MBart50 (Tang et al., 2020) for the latter. Opus is a Another aspect to note is that the algorithms PJ
transformer base model (74M parameters) trained and PGJ (b=5) are sometimes slower than greedy
on language pairs from the homonymous dataset autoregressive. There are several factors that can
(Zhang et al., 2020). MBart50 is a large multilin- influence the actual wall-clock time like how the
gual transformer model fine-tuned for translation underlying hardware schedule and execute the vari-
on 50 languages (610M parameters). We tested the ous operations, which might vary according to the
models on CPU since this is the default environ- architecture and the workload. In particular, longer
ment for MT models in production, except for the sequences (e.g., the whole sentence in PJ or blocks
model MBart50 which runs on GPU. We run the of 5 tokens in PGJ) may require more memory
experiments on a standard 16-core machine, except to store, and the CPU/GPU may have to perform
more memory accesses, which can slow down the
computation (although theoretically it should hap-
pen in parallel). In the end, these computational
overheads slow down the actual execution. This
is also the case for the difference in speedups be-
tween MBart50 and Opus. We better investigated
this aspect in the section "Computational Scaling"
and report in the appendix results on a different
architecture, with also results in terms of iterations
speedups which are architecture agnostic.

4.3 Analysis and Validation

Cross Languages. In order to demonstrate the
Figure 3: Scaling experiments on WMT16 En-De with
robustness of our decoding algorithms with respect
PGJ and HGJ blocks = 3. Increasing the number of
to the translation languages, we leveraged the mul- available resources (number of CPU cores) allows the
tilingual capabilities of the MBart50 model and methods to decrease the parallel overheads. As a result,
selected a diverse range of language pairs for eval- the speedup increases and the methods scale.
uation. The results, presented in Table 2, show that
both PGJ and HGJ achieve a consistent speedup in processing. With 122 cores, a substantial speedup
comparison to the autoregressive decoding method, of 1.98× and 1.99× is achieved for PGJ and HGJ
with an improvement ranging from 2-7% for PGJ respectively, while the autoregressive baseline is
and 1-5% for HGJ, regardless of the language pair bounded by sequential processing at 1.00×. It is
used. Additionally, we observed a speedup in terms important to note that this experiment does not
of iterations of 7-11% for PGJ and 5-7% for HGJ. simulate a real production system, but rather it is
These findings indicate that our algorithms have meant to show what results can be achieved when
the potential to match or surpass the speedup in the underlying computation is properly optimized
terms of wall-clock time by fully exploiting this to run in parallel. In our case, we simulated this
saving in terms of iterations. We note that, simi- setting with increasing cores, nevertheless similar
lar to the previous experiment, PJ suffers from an results can be achieved with additional software op-
overhead problem. To the best of our knowledge, timizations to further reduce latency and overheads
this is one of the first studies that have achieved a (Ahmed et al., 2022; Kim et al., 2019) and increase
speedup in multilingual machine translation, con- the speed gain with parallel-optimized computa-
current with the work of Song et al. (2022), while tions. Overall this experiment serves as a proof
this latter is significantly different in spirit and re- of concept for the capabilities of parallel decod-
quirements (NAT model). We leave BLEU scores ing in contexts with limited overhead and shows a
in the Appendix D for space constraints together promising direction for further improvements.
with qualitative results in different languages.
Comparison with NATs. Table 3 reports the
Computational Scaling. In Figure 3, we present comparison of our parallel decoding algorithm with
an analysis of the scalability of our proposed meth- a selection of NAT methods for parallel MT. Fol-
ods in relation to increasing computational re- lowing prior works, we report for each method the
sources. Starting with 8 cores, our methods demon- speedup relative to the autoregressive transformer
strate a slight improvement in terms of wall-clock base baseline from their original paper (Xiao et al.,
time for PGJ and HGJ, with speedups of 1.11 and 2022). It is worth noting that, although these meth-
1.09 respectively. On the other hand, this amount ods can achieve higher speedups, they are very
of resources is too restricting for PJ which needs to demanding in terms of computational resources
fit the whole sentence and thus achieve a score of which must be accounted for in a fair comparison.
0.46 due to the aforementioned overhead problem. To estimate quantitatively this cost, we evaluated
As the resources are increased, our method demon- the number of floating point operations (FLOPs)
strates the ability to effectively leverage hardware required for training and inference on WMT14.
and significantly reduce decoding time, while the Results show that our method HGJ uses the least
autoregressive baseline is constrained by sequential number of computational resources, even consid-
ering the additional cost at inference time. Relat-
ing the speedup obtained with the used resources
(FLOPs/speed), our method still achieves the best
cost-benefit ratio. Furthermore, NATs generally de-
grade the translation quality if compared to their au-
toregressive baseline. On the contrary, our method
mathematically guarantees the same quality of au-
toregressive decoding, which is higher than stan-
dard NAT models.
SUNDAE achieves BLEU of 28.46, but requires
more resources than training RoBERTa (Liu et al.,
2019) on 16 TPUs (see Appendix C). Other meth-
ods require further elaborate techniques like pro-
found architectural changes, additional losses to
force parallel translation and sequence-level distil-
lation from large autoregressive transformers (Gu
Figure 4: DDGviz. Visualization of the translation En-
and Kong, 2021). Our approach is a decoding Ro: "How satisfied are the Romanian couples: men
method that does not involve any training or modi- versus women"→"Cât de satisfacuti sunt cuplurile ro-
fication to the model and can be used to speed up manes, ti: bărbat, ii împotriva femeilor". (Highlighted
existing models on standard desktop hardware. tokens decoded in parallel). On top: the Decod-
ing Dependency Graph, omitting redundant edges on
Speedup Analysis. We provide here a prelim-
non-parallel tokens to ease visualization. On bottom:
inary analysis of the factors responsible for the DDGviz shows at each Parallel Jacobi iteration (verti-
observed speedup in our method. We first distin- cal axis) which tokens have been generated in paral-
guish between two types of speedup: wall-clock lel (horizontal axis) with the corresponding probability
speedup and iterations speedup. The former is (cell number).
primarily driven by the parallelization capability
ing is feasible. This suggests that the dependency
of our method, as demonstrated in the "Compu-
learned by the model between certain tokens is re-
tational Scaling" section. With parallel decoding,
laxed, as some tokens can be decoded in parallel.
underlying operations can be optimized and fused
Analyzing and understanding when this happens
to be executed fastly. Compared to Sheng et al.
allows shedding light on the behavior of existing
(2023), our method allows parallelizing sequence
models and a separate study focused on this is-
operations ("row-by-row" setting). The latter in-
sue would be needed. In this work, we lay the
stead may vary consequently to several factors
ground for a such study introducing the necessary
(e.g., model/vocabulary size, training data, lan-
inspection tools. While we have already introduced
guage, etc). For this reason, we experimented with
DDGviz in Section 3.5, in this experiment we show
several variations of these factors (models Trans-
how it works and how it can be used with a prac-
former Base vs. Big, vocabularies 58K Marian vs.
tical example. In summary, the DDGviz visual-
250K MBart50, languages, and hardware). While
izer allows to show the real decoding distribution
it is challenging to decouple different elements,
pθ (yi | ·, x) learned by a MT model. This decod-
our analysis point out several interesting insights.
ing distribution is plotted as a graph, where a con-
For example, we observed that iteration results on
nection indicates the dependency pθ (yi | ·), by
MBart50 are generally higher compared to Marian
using Parallel Jacobi decoding. At each PJ decod-
(Tables 2-6), possibly due to the finer-grained to-
ing iteration (vertical axis of Figure 4), DDGviz
kenization of MBart50. We also hypothesize that
keeps track of which tokens have been correctly
language and linguistic features, such as inflection-
decoded w.r.t. the gold autoregressive reference of
ally rich or agglutinative/gendered languages, may
the model, showing the tokens correctly decoded
influence iteration speedups. To facilitate this type
and the probability of each one (horizontal axis).
of analysis, we developed DDGviz, which we be-
Figure 4 shows DDGviz applied on an example.
lieve will be useful for research in this area.
The example shows that for y4 = _sa it is possible
Visualizing Parallel Decoding. In previous ex- to decode more than one token in parallel y5 = tis,
periments, we demonstrated that parallel decod- y6 = f a, hence here the decoding of y6 does not
depend on the decoding of y5 - pθ (y6 | y1:4 , x). n.2020TA3K9N "LEGO.AI". Riccardo Marin is
We observed this phenomenon frequently, explain- also supported by an Alexander von Humboldt
ing the speedups in the previous experiments. The Foundation Research Fellowship.
example also shows that the model is able to decode
five tokens in parallel after y7 = _cu. This is a pe- Limitations
culiar case since the model, given "How satisfi_", The proposed algorithms allow to speed up an exist-
is generating all at once "_ed are the Romanian ing model out-of-the-box, without any modification
couples" (proposed here in English for better read- or retraining. However, there are some considera-
ability, original version in Romanian is available in tions to bear in mind when using parallel decoding
Figure). This example indeed shows how DDGviz in order to have a speedup in terms of wall-clock
can be used to highlight possible biases encoded in time. Firstly, as the name implies, the method ex-
the model as it is not clear how the model can be ecutes the decoding phase in parallel. Therefore,
so confident (see cell probability) that after "satis- to appreciate the speedup one should be able to
fied" the most straightforward tokens to decode are run computations in parallel. Using parallel decod-
"Romanian couples" (Chang et al., 2019; Savoldi ing without parallel resources or parallel-optimized
et al., 2021). We leave other use cases for future software may increase wall-clock time due to over-
works and show in Appendix D several visualiza- heads, leading to a waste of computation. This is
tions with equally interesting phenomena. further discussed in Section 4.3 "Computational
Scaling". The reported wall-clock time results are
5 Conclusions
thus to be considered within the scope of the exper-
In this paper, we showed that is possible to speed imental setup proposed in this paper and they may
up existing machine translation models by simply vary depending on the underlying hardware and
changing the decoding procedure with a parallel software. Secondly, the method allows speedup of
formulation. We introduced three parallel decod- the decoding by scaling on parallel resources. This
ing methods which achieve consistent speedups implies an additional computational cost during the
without requiring any training, modifications, or inference phase to achieve a speedup. While using
quality loss. Our solution is orthogonal to previous parallel decoding, one should consider a trade-off
approaches proposed in literature which are de- between the desired acceleration and the utiliza-
manding in terms of data, computational resources, tion of computational resources. Thirdly, since our
and engineering effort. This makes it particularly method performs the decoding in parallel, as for
useful in limited-resource scenarios when one or all NAT systems, it is difficult to combine it with Beam
of these requirements are not satisfiable and alter- Search. Beam Search is inherently a dynamic pro-
natives like NATs cannot be deployed. While our gramming algorithm and it is not possible to effi-
method is not without shortcomings, it represents ciently maximize the joint probability of the large
a valuable first step in the development of paral- search space without using sequential intermediate
lel decoding algorithms for machine translation computations. We better explain this aspect in the
that can be seamlessly integrated with any model. next paragraph.
We believe that further advancements in this area,
Beam Search. Beam search is widely employed
including the exploration of optimal initialization
to enhance the translation quality in MT (Sutskever
procedures and stopping conditions, as well as the
et al., 2014; Bahdanau et al., 2015) as well as in
use of alternative parallel solvers for non-linear
other domains such as audio (Reddy, 1977; Pos-
equations, will close the gap with learning-based
tolache et al., 2023). However, it is an inherently
techniques and continue to improve the efficiency
sequential procedure that stores partial joint prob-
and effectiveness of parallel decoding algorithms.
abilities of the entire sequence (beams) while pro-
Acknowledgements gressing with autoregressive decoding. Determin-
ing the maximal joint probability of all sequences
We would like to thank Sébastien Bratières for in parallel is a challenging task, equivalent to a
his throughout feedback provided on this project. full maximum a posteriori (MAP) estimation. This
This work is supported by Translated with an is an open research problem and it is also an is-
Imminent Research Grant, ERC Starting Grant sue for NAT methods. NAT methods patch up this
No. 802554 (SPECGEO), and PRIN 2020 project limitation with sequence-level KD which has the
advantage of "not requiring any beam search at learning to align and translate. In 3rd Inter-
test-time" (Kim and Rush, 2016) thanks to learn- national Conference on Learning Representations,
ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
ing and distillation from large models. Since our
Conference Track Proceedings.
method is a decoding algorithm, we cannot use the
same approach without learning. Nevertheless, the Ondřej Bojar, Christian Buck, Christian Federmann,
quality guarantee allows our methods to have per- Barry Haddow, Philipp Koehn, Johannes Leveling,
Christof Monz, Pavel Pecina, Matt Post, Herve
formance on par with greedy autoregressive and Saint-Amand, Radu Soricut, Lucia Specia, and Aleš
generally better than a NAT model. We think of Tamchyna. 2014. Findings of the 2014 workshop on
our method, not as a replacement for beam search, statistical machine translation. In Proceedings of the
but rather as a way to obtain a speedup at inference Ninth Workshop on Statistical Machine Translation,
pages 12–58, Baltimore, Maryland, USA. Associa-
time that is a middle ground between autoregressive tion for Computational Linguistics.
greedy decoding (high quality, no requirements, no
speed) and NATs (quality compromises, increasing Ondřej Bojar, Rajen Chatterjee, Christian Federmann,
requirements with increasing speed). Future works Yvette Graham, Barry Haddow, Shujian Huang,
Matthias Huck, Philipp Koehn, Qun Liu, Varvara Lo-
might address the quality gap with beam search gacheva, Christof Monz, Matteo Negri, Matt Post,
by combining parallel decoding with alternative Raphael Rubino, Lucia Specia, and Marco Turchi.
techniques like Minimum Bayes Risk (Eikema and 2017. Findings of the 2017 conference on machine
Aziz, 2020). translation (wmt17). In Proceedings of the Sec-
ond Conference on Machine Translation, Volume 2:
Shared Task Papers, pages 169–214, Copenhagen,
Ethics Statement Denmark. Association for Computational Linguis-
tics.
Increasing the inference speed of MT can positively
impact society by giving people a fast and good Ondřej Bojar, Rajen Chatterjee, Christian Federmann,
translation. This will enable people from differ- Yvette Graham, Barry Haddow, Matthias Huck,
Antonio Jimeno Yepes, Philipp Koehn, Varvara
ent language backgrounds to communicate with
Logacheva, Christof Monz, Matteo Negri, Aure-
each other and help remove cultural and trade bar- lie Neveol, Mariana Neves, Martin Popel, Matt
riers. As demonstrated by comparing the number Post, Raphael Rubino, Carolina Scarton, Lucia Spe-
of FLOPs in Table 3, our method uses fewer re- cia, Marco Turchi, Karin Verspoor, and Marcos
sources compared to alternatives and thus has a Zampieri. 2016. Findings of the 2016 conference
on machine translation. In Proceedings of the First
smaller carbon footprint, making it a more sustain- Conference on Machine Translation, pages 131–198,
able choice (Strubell et al., 2019). Furthermore, Berlin, Germany. Association for Computational
since our method does not involve training proce- Linguistics.
dures or change the quality of results, we do not Kai-Wei Chang, Vinodkumar Prabhakaran, and Vi-
introduce any societal bias (e.g. racism, sexism, cente Ordonez. 2019. Bias and fairness in nat-
homophobia) into the translations. The latter, how- ural language processing. In Proceedings of the
ever, can be introduced through data in the training 2019 Conference on Empirical Methods in Natu-
ral Language Processing and the 9th International
of the backbone autoregressive models and NATs.
Joint Conference on Natural Language Processing
It is the task of those who train these models to (EMNLP-IJCNLP): Tutorial Abstracts, Hong Kong,
mitigate this problem. DDGviz can also help inves- China. Association for Computational Linguistics.
tigate and visualize some potential harmful biases
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving,
encoded in the model like in Figure 4. Jean-Baptiste Lespiau, Laurent Sifre, and John
Jumper. 2023. Accelerating large language model
decoding with speculative sampling.
References
Cunxiao Du, Zhaopeng Tu, and Jing Jiang. 2021.
Ibrahim Ahmed, Sahil Parmar, Matthew Boyd, Michael Order-agnostic cross entropy for non-autoregressive
Beidler, Kris Kang, Bill Liu, Kyle Roach, John machine translation. In International Conference on
Kim, and Dennis Abts. 2022. Answer fast: Ac- Machine Learning, pages 2849–2859. PMLR.
celerating bert on the tensor streaming processor.
In 2022 IEEE 33rd International Conference on Sergey Edunov, Myle Ott, Michael Auli, and David
Application-specific Systems, Architectures and Pro- Grangier. 2018. Understanding back-translation at
cessors (ASAP), pages 80–87. IEEE. scale. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing,
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- pages 489–500, Brussels, Belgium. Association for
gio. 2015. Neural machine translation by jointly Computational Linguistics.
Bryan Eikema and Wilker Aziz. 2020. Is MAP decod- Language Technologies, pages 3989–3996, Online.
ing all you need? the inadequacy of the mode in neu- Association for Computational Linguistics.
ral machine translation. In Proceedings of the 28th
International Conference on Computational Linguis- Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and
tics, pages 4506–4520, Barcelona, Spain (Online). Yejin Choi. 2020. The curious case of neural text de-
International Committee on Computational Linguis- generation. In International Conference on Learn-
tics. ing Representations.
Xinwei Geng, Xiaocheng Feng, and Bing Qin. 2021. Chenyang Huang, Hao Zhou, Osmar R. Zaïane, Lili
Learning to rewrite for non-autoregressive neural Mou, and Lei Li. 2021. Non-autoregressive transla-
machine translation. In Proceedings of the 2021 tion with layer-wise prediction and deep supervision.
Conference on Empirical Methods in Natural Lan- CoRR, abs/2110.07515.
guage Processing, pages 3297–3308, Online and
Punta Cana, Dominican Republic. Association for Xiao Shi Huang, Felipe Perez, and Maksims Volkovs.
Computational Linguistics. 2022. Improving non-autoregressive translation
models without distillation. In International Con-
Marjan Ghazvininejad, Vladimir Karpukhin, Luke
ference on Learning Representations.
Zettlemoyer, and Omer Levy. 2020a. Aligned cross
entropy for non-autoregressive machine translation.
In Proceedings of the 37th International Conference Jungo Kasai, James Cross, Marjan Ghazvininejad, and
on Machine Learning, volume 119 of Proceedings Jiatao Gu. 2020. Non-autoregressive machine trans-
of Machine Learning Research, pages 3515–3523. lation with disentangled context transformer. In
PMLR. Proceedings of the 37th International Conference
on Machine Learning, volume 119 of Proceedings
Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and of Machine Learning Research, pages 5144–5155.
Luke Zettlemoyer. 2019. Mask-predict: Parallel de- PMLR.
coding of conditional masked language models. In
Proceedings of the 2019 Conference on Empirical Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross,
Methods in Natural Language Processing and the and Noah Smith. 2021. Deep encoder, shallow
9th International Joint Conference on Natural Lan- decoder: Reevaluating non-autoregressive machine
guage Processing (EMNLP-IJCNLP), pages 6112– translation. In International Conference on Learn-
6121, Hong Kong, China. Association for Computa- ing Representations.
tional Linguistics.
Sehoon Kim, Karttikeya Mangalam, Jitendra Malik,
Marjan Ghazvininejad, Omer Levy, and Luke Zettle- Michael W. Mahoney, Amir Gholami, and Kurt
moyer. 2020b. Semi-autoregressive training im- Keutzer. 2023. Big little transformer decoder.
proves mask-predict decoding. arXiv preprint
arXiv:2001.08785. Yoon Kim and Alexander M. Rush. 2016. Sequence-
level knowledge distillation. In Proceedings of the
Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng- 2016 Conference on Empirical Methods in Natu-
Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr- ral Language Processing, pages 1317–1327, Austin,
ishnan, Marc’Aurelio Ranzato, Francisco Guzmán, Texas. Association for Computational Linguistics.
and Angela Fan. 2022. The Flores-101 evaluation
benchmark for low-resource and multilingual ma- Young Jin Kim, Marcin Junczys-Dowmunt, Hany Has-
chine translation. Transactions of the Association san, Alham Fikri Aji, Kenneth Heafield, Roman
for Computational Linguistics, 10:522–538. Grundkiewicz, and Nikolay Bogoychev. 2019. From
research to production and back: Ludicrously fast
Jiatao Gu, James Bradbury, Caiming Xiong, Vic-
neural machine translation. In Proceedings of the
tor O.K. Li, and Richard Socher. 2018. Non-
3rd Workshop on Neural Generation and Transla-
autoregressive neural machine translation. In Inter-
tion, pages 280–288, Hong Kong. Association for
national Conference on Learning Representations.
Computational Linguistics.
Jiatao Gu and Xiang Kong. 2021. Fully non-
autoregressive neural machine translation: Tricks of Wouter Kool, Herke van Hoof, and Max Welling. 2020.
the trade. In Findings of the Association for Compu- Ancestral gumbel-top-k sampling for sampling with-
tational Linguistics: ACL-IJCNLP 2021, pages 120– out replacement. Journal of Machine Learning Re-
133, Online. Association for Computational Linguis- search, 21(47):1–36.
tics.
Taku Kudo. 2018. Subword regularization: Improving
Yongchang Hao, Shilin He, Wenxiang Jiao, Zhaopeng neural network translation models with multiple sub-
Tu, Michael Lyu, and Xing Wang. 2021. Multi-task word candidates. In Proceedings of the 56th Annual
learning with shared encoder for non-autoregressive Meeting of the Association for Computational Lin-
machine translation. In Proceedings of the 2021 guistics (Volume 1: Long Papers), pages 66–75, Mel-
Conference of the North American Chapter of the bourne, Australia. Association for Computational
Association for Computational Linguistics: Human Linguistics.
Taku Kudo and John Richardson. 2018. Sentencepiece: An imperative style, high-performance deep learn-
A simple and language independent subword tok- ing library. In Advances in Neural Information Pro-
enizer and detokenizer for neural text processing. cessing Systems 32, pages 8024–8035. Curran Asso-
CoRR, abs/1808.06226. ciates, Inc.

Anoop Kunchukuttan, Pratik Mehta, and Pushpak Bhat- Matt Post. 2018. A call for clarity in reporting BLEU
tacharyya. 2018. The IIT Bombay English-Hindi scores. In Proceedings of the Third Conference on
parallel corpus. In Proceedings of the Eleventh In- Machine Translation: Research Papers, pages 186–
ternational Conference on Language Resources and 191, Belgium, Brussels. Association for Computa-
Evaluation (LREC 2018), Miyazaki, Japan. Euro- tional Linguistics.
pean Language Resources Association (ELRA).
Emilian Postolache, Giorgio Mariani, Michele Man-
Yaniv Leviathan, Matan Kalman, and Yossi Matias. cusi, Andrea Santilli, Cosmo Luca, Emanuele
2022. Fast inference from transformers via specu- Rodola, et al. 2023. Latent autoregressive source
lative decoding. separation. In Proceedings of the AAAI Conference
on Artificial Intelligence.
Quentin Lhoest, Albert Villanova del Moral, Yacine
Jernite, Abhishek Thakur, Patrick von Platen, Suraj Lihua Qian, Hao Zhou, Yu Bao, Mingxuan Wang, Lin
Patil, Julien Chaumond, Mariama Drame, Julien Plu, Qiu, Weinan Zhang, Yong Yu, and Lei Li. 2021.
Lewis Tunstall, Joe Davison, Mario Šaško, Gun- Glancing transformer for non-autoregressive neural
jan Chhablani, Bhavitvya Malik, Simon Brandeis, machine translation. In Proceedings of the 59th An-
Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas nual Meeting of the Association for Computational
Patry, Angelina McMillan-Major, Philipp Schmid, Linguistics and the 11th International Joint Confer-
Sylvain Gugger, Clément Delangue, Théo Matus- ence on Natural Language Processing (Volume 1:
sière, Lysandre Debut, Stas Bekman, Pierric Cis- Long Papers), pages 1993–2003, Online. Associa-
tac, Thibault Goehringer, Victor Mustar, François tion for Computational Linguistics.
Lagunas, Alexander Rush, and Thomas Wolf. 2021.
Datasets: A community library for natural language Prajit Ramachandran, Tom Le Paine, Pooya Khor-
processing. In Proceedings of the 2021 Conference rami, Mohammad Babaeizadeh, Shiyu Chang, Yang
on Empirical Methods in Natural Language Process- Zhang, Mark A. Hasegawa-Johnson, Roy H. Camp-
ing: System Demonstrations, pages 175–184, On- bell, and Thomas S. Huang. 2017. Fast genera-
line and Punta Cana, Dominican Republic. Associ- tion for convolutional autoregressive models. CoRR,
ation for Computational Linguistics. abs/1704.06001.

Xiaodong Liu, Kevin Duh, Liyuan Liu, and Jianfeng Raj Reddy. 1977. Speech understanding systems: A
Gao. 2020. Very deep transformers for neural ma- summary of results of the five-year research effort.
chine translation. arXiv preprint arXiv:2008.07772. Carnegie Mellon University.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Yousef Saad. 2003. Iterative methods for sparse linear
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, systems. SIAM.
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap- Chitwan Saharia, William Chan, Saurabh Saxena, and
proach. arXiv preprint arXiv:1907.11692. Mohammad Norouzi. 2020. Non-autoregressive ma-
chine translation with latent alignments. In Proceed-
J.M. Ortega and W.C. Rheinboldt. 1970. Iterative So- ings of the 2020 Conference on Empirical Methods
lution of Nonlinear Equations in Several Variables. in Natural Language Processing (EMNLP), pages
Classics in Applied Mathematics. Society for Indus- 1098–1108, Online. Association for Computational
trial and Applied Mathematics (SIAM, 3600 Market Linguistics.
Street, Floor 6, Philadelphia, PA 19104).
Nikolay Savinov, Junyoung Chung, Mikolaj
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Binkowski, Erich Elsen, and Aaron van den
Jing Zhu. 2002. Bleu: a method for automatic eval- Oord. 2022. Step-unrolled denoising autoencoders
uation of machine translation. In Proceedings of for text generation. In International Conference on
the 40th Annual Meeting of the Association for Com- Learning Representations.
putational Linguistics, pages 311–318, Philadelphia,
Pennsylvania, USA. Association for Computational Beatrice Savoldi, Marco Gaido, Luisa Bentivogli, Mat-
Linguistics. teo Negri, and Marco Turchi. 2021. Gender Bias in
Machine Translation. Transactions of the Associa-
Adam Paszke, Sam Gross, Francisco Massa, Adam tion for Computational Linguistics, 9:845–874.
Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca Mike Schuster and Kaisuke Nakajima. 2012. Japanese
Antiga, Alban Desmaison, Andreas Kopf, Edward and korean voice search. In 2012 IEEE Interna-
Yang, Zachary DeVito, Martin Raison, Alykhan Te- tional Conference on Acoustics, Speech and Signal
jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Processing, ICASSP 2012, Kyoto, Japan, March 25-
Junjie Bai, and Soumith Chintala. 2019. Pytorch: 30, 2012, pages 5149–5152. IEEE.
Rico Sennrich, Barry Haddow, and Alexandra Birch. gela Fan. 2020. Multilingual translation with exten-
2016. Neural machine translation of rare words sible multilingual pretraining and finetuning. CoRR,
with subword units. In Proceedings of the 54th An- abs/2008.00401.
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1715– Jörg Tiedemann and Santhosh Thottingal. 2020.
1725, Berlin, Germany. Association for Computa- OPUS-MT — Building open translation services for
tional Linguistics. the World. In Proceedings of the 22nd Annual Con-
ferenec of the European Association for Machine
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Translation (EAMT), Lisbon, Portugal.
Li, Max Ryabinin, Daniel Y Fu, Zhiqiang Xie, Beidi
Chen, Clark Barrett, Joseph E Gonzalez, et al. 2023. Viet Hong Tran, Huyen Vu Thong, Nguyen Van-Vinh,
High-throughput generative inference of large lan- and Trung Le Tien. 2015. The English-Vietnamese
guage models with a single gpu. arXiv preprint machine translation system for IWSLT 2015. In Pro-
arXiv:2303.06865. ceedings of the 12th International Workshop on Spo-
ken Language Translation: Evaluation Campaign,
Jongyoon Song, Sungwon Kim, and Sungroh Yoon. pages 80–83, Da Nang, Vietnam.
2021a. AligNART: Non-autoregressive neural ma-
chine translation by jointly learning to estimate Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
alignment and translate. In Proceedings of the 2021 Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Conference on Empirical Methods in Natural Lan- Kaiser, and Illia Polosukhin. 2017. Attention is all
guage Processing, pages 1–14, Online and Punta you need. Advances in neural information process-
Cana, Dominican Republic. Association for Compu- ing systems, 30.
tational Linguistics.
Chunqi Wang, Ji Zhang, and Haiqing Chen. 2018.
Yang Song, Chenlin Meng, Renjie Liao, and Stefano Semi-autoregressive neural machine translation. In
Ermon. 2021b. Accelerating feedforward computa- Proceedings of the 2018 Conference on Empirical
tion via parallel nonlinear equation solving. In In- Methods in Natural Language Processing, pages
ternational Conference on Machine Learning, pages 479–488, Brussels, Belgium. Association for Com-
9791–9800. PMLR. putational Linguistics.
Zhenqiao Song, Hao Zhou, Lihua Qian, Jingjing Xu, Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Shanbo Cheng, Mingxuan Wang, and Lei Li. 2022. Chaumond, Clement Delangue, Anthony Moi, Pier-
switch-GLAT: Multilingual parallel machine transla- ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
tion via code-switch decoder. In International Con- icz, Joe Davison, Sam Shleifer, Patrick von Platen,
ference on Learning Representations. Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Teven Le Scao, Sylvain Gugger, Mariama Drame,
2018. Blockwise parallel decoding for deep autore- Quentin Lhoest, and Alexander Rush. 2020. Trans-
gressive models. In Advances in Neural Information formers: State-of-the-art natural language process-
Processing Systems, volume 31. Curran Associates, ing. In Proceedings of the 2020 Conference on Em-
Inc. pirical Methods in Natural Language Processing:
System Demonstrations, pages 38–45, Online. Asso-
Emma Strubell, Ananya Ganesh, and Andrew McCal- ciation for Computational Linguistics.
lum. 2019. Energy and policy considerations for
deep learning in NLP. In Proceedings of the 57th Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Annual Meeting of the Association for Computa- Le, Mohammad Norouzi, Wolfgang Macherey,
tional Linguistics, pages 3645–3650, Florence, Italy. Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Association for Computational Linguistics. Macherey, et al. 2016. Google’s neural machine
translation system: Bridging the gap between hu-
Xin Sun, Tao Ge, Furu Wei, and Houfeng Wang. man and machine translation. arXiv preprint
2021. Instantaneous grammatical error correction arXiv:1609.08144.
with shallow aggressive decoding. In Proceedings of
the 59th Annual Meeting of the Association for Com- Heming Xia, Tao Ge, Furu Wei, and Zhifang Sui. 2022.
putational Linguistics and the 11th International Lossless speedup of autoregressive translation with
Joint Conference on Natural Language Processing generalized aggressive decoding.
(Volume 1: Long Papers), pages 5937–5947, Online.
Association for Computational Linguistics. Yisheng Xiao, Lijun Wu, Junliang Guo, Juntao Li,
Min Zhang, Tao Qin, and Tie-yan Liu. 2022. A
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. survey on non-autoregressive generation for neural
Sequence to sequence learning with neural networks. machine translation and beyond. arXiv preprint
Advances in neural information processing systems, arXiv:2204.09269.
27.
Biao Zhang, Philip Williams, Ivan Titov, and Rico Sen-
Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na- nrich. 2020. Improving massively multilingual neu-
man Goyal, Vishrav Chaudhary, Jiatao Gu, and An- ral machine translation and zero-shot translation. In
Algorithm 2 Parallel GS-Jacobi Decoding Dataset # Test
WMT 14 De-En (Bojar et al., 2014) 3003
Input: x = (x1 , . . . , xn ), pθ , b WMT 16 Ro-En (Bojar et al., 2016) 1999
Output: y = (y1 , . . . , ym ) WMT 17 Fi-En (Bojar et al., 2017) 3002
IWSLT 15 En-Vi (Tran et al., 2015) 1046
1: y ← I NIT T(x) IITB En-Hi (Kunchukuttan et al., 2018) 2507
2: m ← len(y) FLORES-101 En-It (Goyal et al., 2022) 1012
FLORES-101 En-Fr (Goyal et al., 2022) 1012
3: i ← 1
4: while i 6 m do Table 4: Data Statistic
5: o ← copy(yi:i+b )
6: yi:i+b ← arg max(pθ (yi:i+b |y1:i+b , x))
7: stop ← S TOP C(o, yi:i+b ) (referred to as c2d-standard-XX, where XX
8: if stop then is the number of used cores). Experiments with
9: i←i+b MBart50 on table 1, 2 and 6 are performed on a
10: break Desktop machine with Ubuntu 20.04.4 LTS, AMD
11: end if Ryzen 9 3900X 12-Core Processor, 32GB of RAM,
12: end while and a Palit Nvidia 3090 GPU. Additional experi-
13: return y ments with Opus in table 6 are also performed on
this machine. Models are implemented in Pytorch
1.11.0 (Paszke et al., 2019) and the Huggingface
Proceedings of the 58th Annual Meeting of the Asso- Transformer library (Wolf et al., 2020). We used
ciation for Computational Linguistics, pages 1628– python 3.8 and NVIDIA-SMI Drivers 510.73.05
1639, Online. Association for Computational Lin-
guistics. with CUDA version 11.6. For OPUS we used Hug-
gingface models available on the hub under the tag
A Algorithms details Helsinki-NLP/opus-mt-{src}-{tgt}
except for the language pair Ro-
We propose here the pseudocode of Algorithms 2
En where we used the model
and 3 due to space limitations in the main body of
Helsinki-NLP/opus-mt-roa-en and
the paper.
the pair En-De where we used the check-
The function copy(yi:i+b ) creates a copy of the
point opus-2021-02-22 4 . For the model
tensor in input detached from the source. This
MBart50, we used the facebook pre-trained
is done in practice to avoid the overwriting of
model available on the hub with the tag
pointers to the same memory location. Function
mbart-large-50-many-to-many-mmt.
C HECK EOS(yi:i+b ) returns the index of the token
Since this is a multilingual model, we prepend
EOS in the block if present, else −1. Function
the source and target language tag corresponding
C HECK EOS(yi ) returns T rue if the tokes in ex-
properly to the language pair to be translated.
actly the token EOS, else F alse. The function
We report results for a single run over the test
arg max selects from the model distribution over
dataset since we found low variance in estimates
the vocabulary the index (token) with maximum
with multiple runs which can be calculated by
probability. This procedure is done for all the to-
simply varying the corresponding parameter in the
kens in parallel, in the case of parallel decoding, or
config.yaml file. For each dataset, we used
for just a single token in the case of autoregressive
the official test split via the Huggingface dataset
decoding. Generally, the output is the prediction
library (Lhoest et al., 2021). Datasets statistics are
for the next token; hence it should be shifted left
reported in table 4.
before the reassignment to a variable. We omitted
this implementation detail for clarity. C FLOPs calculation details
B Additional implementation details We measured computational complexity using float-
We run Opus experiments in table 1 on an AMD ing point operations (FLOPs), which, as the name
EPYC Milan with 16 cores at 2.45 GHz and imply, counts the number of floating point opera-
64GB of RAM (accessible on Google Cloud tion performed by a model. This is a standard met-
- c2d-standard-16). For the scalability ric used in literature to measure hardware-agnostic
experiment in figure 3, we also used Google Cloud 4
https://object.pouta.csc.fi/Tatoeba-MT-models/eng-
instances with an increasing number of cores deu/opus-2021-02-22.zip
Algorithm 3 Hybrid GS-Jacobi Decoding Model Train FLOPs Infer. FLOPs Total FLOPs
Semi-NAT 1.55e17 2.08e13 1.55e17
Input: x = (x1 , . . . , xn ), pθ , b Shallow Dec. 1.02e19 1.15e13 1.02e19
Output: y = (y1 , . . . , ym ) DSLP 1.93e19 1.58e13 1.93e19
F-VAE 4.06e19 1.58e13 4.06e19
1: y ← I NIT T(x) DisCo 4.06e19 1.58e13 4.06e19
2: h ← len(y) SUNDAE 5.27e21 1.58e14 5.27e21
3: i ← 1 BERT base 6.43e19 - -
BERT large 1.92e20 - -
4: eos_cond ← F alse RoBERTa 3.19e21 - -
5: while i 6 h do
6: o ← copy(yi:i+b ) Table 5: FLOPs comparison with other models.
7: yi:i+b ← arg max(pθ (yi:i+b |y1:i+b , x))
8: stop ← S TOP C(o, yi:i+b ) on a standard desktop CPU with also the speedup
9: eos_ind ← C HECK EOS(yi:i+b ) in terms of iterations. It is possible to observe that
10: if stop and eos_ind > −1 then in the case of MBart50 and PGJ there is a speedup
11: y ← y1:eos_ind of 8 − 11% in terms of iterations compare to a
12: eos_cond ← T rue time speedup of 3 − 8%. This means that there
13: break is room for improvement for our algorithm. Fur-
14: end if thermore, results show that the time speedups are
15: if stop then consistent also with standard desktop hardware. Ta-
16: i←i+b ble 7 shows the BLEU scores for the cross-lingual
17: break experiment. It is possible to observe that parallel
18: end if decoding algorithms guarantee quality compared
19: end while to greedy autoregressive and are not so distant from
20: while eos_cond ! = T rue do beam search. We show also here in table 5 some
21: yi ← arg max(pθ (yi |yi−1 , x)) qualitative results for the experiments in table 2.
22: i←i+1 Finally, we propose additional visualizations using
23: eos_cond ← I S EOS(yi ) DGGviz in Figure 6.
24: end while
25: return y

complexity. This means that hardware and soft-

ware optimizations are not counted in the score
(Wu et al., 2016; Kim et al., 2019). We used the
ELECTRA flops calculator5 inserting the number
of parameters and the number of training step per-
formed for each model analyzed in table 3 accord-
ing to the training specification in each paper. For
inference FLOPs, we computed the decoding cost
of each sentence in the testset of WMT14 En-De
for each model. For a scale reference, we report
in here Table 5 training flops of other well-known
architecture. The code package contains the scripts
to replicate all the experiments.

D Additional results
We propose here additional results to the experi-
ments in the paper that were omitted due to limita-
tions constraints. Table 6 shows the same experi-
ments of Table 1 in the main paper, proposed here
5
https://github.com/google-
research/electra/blob/master/flops_computation.py
en→de de→en en→ro ro→en
Decoding Algorithm
Time Iters Time Iters Time Iters Time Iters
Opus
Greedy Autoregressive 1.00× 1.00× 1.00× 1.00× 1.00× 1.00× 1.00× 1.00×
Beam Search (beam = 5) 0.71× 1.00× 0.71× 1.00× 0.70× 1.00× 0.72× 1.00×
PJ Decoding 0.72× 1.03× 0.74× 1.04× 0.69× 1.04× 0.67× 1.03×
PGJ Decoding (b = 3) 1.16× 1.04× 1.19× 1.07× 1.17× 1.05× 1.17× 1.03×
HGJ Decoding (b = 3) 1.16× 1.04× 1.19× 1.06× 1.17× 1.05× 1.17× 1.03×
MBart50
Greedy Autoregressive 1.00× 1.00× 1.00× 1.00× 1.00× 1.00× 1.00× 1.00×
Beam Search (beam = 5) 0.76× 1.00× 0.77× 1.00× 0.77× 1.00× 0.76× 1.00×
PJ Decoding 0.88× 1.03× 0.88× 1.03× 0.86× 1.04× 0.85× 1.03×
PGJ Decoding (b = 3) 1.06× 1.10× 1.08× 1.11× 1.03× 1.08× 1.04× 1.11×
HGJ Decoding (b = 3) 1.05× 1.07× 1.07× 1.01× 1.01× 1.02× 1.02× 1.08×

Table 6: Comparison of parallel decoding algorithms (highlighted in grey) with sequential decoding using Opus
(CPU) and MBart50 (GPU) on WMT14 and WMT16. Speed is showed here both in Time and Iterations w.r.t. the
greedy autoregressive baseline.

WMT17 IITB IWSLT15 FLORES

En-Fi En-Hi En-Vi En-It En-Fr
Dec. Algorithm ← → ← → ← → ← → ← →
Autoregressive 17.55 25.34 16.50 24.70 31.92 33.94 22.78 26.38 39.51 38.90
Beam Search 18.39 26.04 16.87 25.24 32.14 34.59 23.52 26.80 39.59 39.21
PJ 17.54 25.35 16.50 24.69 31.92 33.94 22.78 26.38 39.50 38.90
PGJ (b=3) 17.55 25.35 16.50 24.70 31.93 33.94 22.78 26.38 39.51 38.90
HGJ (b=3) 17.55 25.35 16.50 24.70 31.93 33.94 22.78 26.38 39.51 38.90

Table 7: BLEU scores on MBart50.

Example 1 - Wmt16 En-Ro

Dl Corbyn va adresa primele dintre cele s, ase întrebări la care are dreptul la scurt timp după prânz; prestat, ia
TARGET Times (s) BLEU
sa va fi probabil analizată îndeaproape de mass-media s, i parlamentarii laburis, ti.
Dl Corbyn va ridica pentru a adresa prima dintre cele şase întrebări alocate la scurt timp după miezul zilei, iar
A 0.51 19.71
performanţa sa va fi probabil examinată îndeaproape de presă şi de parlamentarii laburişti.
Dl Corbyn va ridica pentru a adresa prima dintre cele şase întrebări alocate la scurt timp după miezul zilei, iar
PJ 0.56 19.71
performanţa sa va fi probabil examinată îndeaproape de presă şi de parlamentarii laburişti.
Dl Corbyn va ridica pentru a adresa prima dintre cele şase întrebări alocate la scurt timp după miezul zilei, iar
PGJ 0.45 19.71
performanţa sa va fi probabil examinată îndeaproape de presă şi de parlamentarii laburişti.
Dl Corbyn va ridica pentru a adresa prima dintre cele şase întrebări alocate la scurt timp după miezul zilei, iar
HGJ 0.44 19.71
performanţa sa va fi probabil examinată îndeaproape de presă şi de parlamentarii laburişti.

Example 2 - Flores En-It

Quando un piccolo gruppo di esseri viventi (una piccola popolazione) si separa dalla popolazione principale
alla quale appartiene (per esempio se si sposta oltre una catena montuosa o un fiume, o si sposta su una nuova
TARGET Times (s) BLEU
isola, rendendo quindi difficile un eventuale ritorno), esso si ritroverà probabilmente in un ambiente diverso da
quello in cui si trovava prima.
Quando un piccolo gruppo di esseri viventi si separa dalla popolazione principale da cui provengono, come se
A si muovano su una catena di montagne o su un fiume o se si trasferiscono su una nuova isola per non poter tornare 0.61 31.69
facilmente, si troveranno spesso in un ambiente diverso da quello in cui erano prima.
Quando un piccolo gruppo di esseri viventi si separa dalla popolazione principale da cui provengono, come se
PJ si muovano su una catena di montagne o su un fiume o se si trasferiscono su una nuova isola per non poter tornare 0.73 31.69
facilmente, si troveranno spesso in un ambiente diverso da quello in cui erano prima.
Quando un piccolo gruppo di esseri viventi si separa dalla popolazione principale da cui provengono, come se
PGJ si muovano su una catena di montagne o su un fiume o se si trasferiscono su una nuova isola per non poter tornare 0.58 31.69
facilmente, si troveranno spesso in un ambiente diverso da quello in cui erano prima.
Quando un piccolo gruppo di esseri viventi si separa dalla popolazione principale da cui provengono, come se
HGJ si muovano su una catena di montagne o su un fiume o se si trasferiscono su una nuova isola per non poter tornare 0.59 31.69
facilmente, si troveranno spesso in un ambiente diverso da quello in cui erano prima.
Example 3 - Wmt14 En-De
Bei der diesjährigen Veranstaltung gibt es Auftritte von Wanda Sykes, Kathy Griffin und Bill Maher sowie auch
TARGET von „Stand Up for Heroes“, einer jährlichen Musik- und Comedy-Benefizveranstaltung für Armeeveteranen im Times (s) BLEU
Madison Square Garden, bei der unter anderem Bruce Springsteen, Jon Stewart, Roger Waters und Bill Cosby auftreten.
Zu den diesjährigen Veranstaltungen gehören Auftritte von Wanda Sykes, Kathy Griffin und Bill Maher sowie
A "Stand Up for Heroes", ein jährlicher Musik- und Komödie-Vorteil für Militärveteranen, im Madison Square Garden, mit 1.30 47.04
u.a. Bruce Springsteen, Jon Stewart, Roger Waters und Bill Cosby.
Zu den diesjährigen Veranstaltungen gehören Auftritte von Wanda Sykes, Kathy Griffin und Bill Maher sowie
PJ "Stand Up for Heroes", ein jährlicher Musik- und Komödie-Vorteil für Militärveteranen, im Madison Square Garden, mit 2.43 47.04
u.a. Bruce Springsteen, Jon Stewart, Roger Waters und Bill Cosby.
Zu den diesjährigen Veranstaltungen gehören Auftritte von Wanda Sykes, Kathy Griffin und Bill Maher sowie
PGJ "Stand Up for Heroes", ein jährlicher Musik- und Komödie-Vorteil für Militärveteranen, im Madison Square Garden, mit 1.09 47.04
u.a. Bruce Springsteen, Jon Stewart, Roger Waters und Bill Cosby.
Zu den diesjährigen Veranstaltungen gehören Auftritte von Wanda Sykes, Kathy Griffin und Bill Maher sowie
HGJ "Stand Up for Heroes", ein jährlicher Musik- und Komödie-Vorteil für Militärveteranen, im Madison Square Garden, mit 1.08 47.04
u.a. Bruce Springsteen, Jon Stewart, Roger Waters und Bill Cosby.

Example 4 - Flores En-Fr

Cinq minutes après le début de l’exposition, un vent se met à souffler pour atteindre, environ une minute
TARGET plus tard, la vitesse de 70km/h... puis la pluie arrive, mais si forte et si grosse qu’elle frappe votre peau Times (s) BLEU
comme une aiguille, puis la grêle tombe du ciel, les gens paniquent, crient et se roulent dessus.
Cinq minutes après l’exposition, le vent commence à tourner, environ un minute plus tard, le vent atteint
A 70 km/h, puis la pluie arrive, mais si forte et si grande qu’elle vous frappe la peau comme une aiguille, puis 0.82 39.90
le hail tombe du ciel, les gens paniquent, s’expriment et se courent l’un sur l’autre.
Cinq minutes après l’exposition, le vent commence à tourner, environ un minute plus tard, le vent atteint
PJ 70 km/h, puis la pluie arrive, mais si forte et si grande qu’elle vous frappe la peau comme une aiguille, puis 0.94 39.90
le hail tombe du ciel, les gens paniquent, s’expriment et se courent l’un sur l’autre.
Cinq minutes après l’exposition, le vent commence à tourner, environ un minute plus tard, le vent atteint
PGJ 70 km/h, puis la pluie arrive, mais si forte et si grande qu’elle vous frappe la peau comme une aiguille, puis 0.73 39.90
le hail tombe du ciel, les gens paniquent, s’expriment et se courent l’un sur l’autre.
Cinq minutes après l’exposition, le vent commence à tourner, environ un minute plus tard, le vent atteint
HGJ 70 km/h, puis la pluie arrive, mais si forte et si grande qu’elle vous frappe la peau comme une aiguille, puis 0.72 39.90
le hail tombe du ciel, les gens paniquent, s’expriment et se courent l’un sur l’autre.

Table 7: Translation examples generated with the autoregressive (A) and the different decoding algorithms pro-
posed (PJ, PGJ, HGJ) on Opus (WMT datasets) and MBart50. The decoding time is shown in seconds.
(a) En-De: "Lack of Scots title race bores Dutch - de (b) De-En: "Private Fachgeschafte und auch den Großhan-
Boer"→"Fehlende Schottentitelrennen bohrt Niederlan- del gibt es fast nicht mehr."→"Private specialist shops and
disch - de Boer" wholesale trade are almost no longer available."

(c) Ro-En: "Un prim contract de lucrări a fost reziliat în (d) En-Ro: "‘Shot in Joburg’: Homeless youth trained
aprilie 2012, după ce se efectuaseră lucrări de 4,5 milioane as photographers"→ "“Fotografii in Joburg”: Tineri fără
lei."→ "A first contract of employment was terminated in adăpost formaţi ca fotografi"
April 2012, after a work of 4.5 million lei."

(e) De-En: "Einige sind nach der Installation auf Prob- (f) Ro-En: "Se pare că va fi acuzat de fugă de la locul
leme gestoßen, da sie eine Fehlermeldung erhalten, die accidentului, neoferirea primului ajutor s, i alte infract, iuni
mitteilt, dass die “Software-Aktualisierung fehlgeschla- rutiere."→ "Apparently he’ll be charged with running
gen” ist."→"Some have encountered problems after instal- from the scene of the accident, the first aid and other road
lation, as they receive an error message that tells us that crimes."
“software update has failed”."

Figure 6: DGGviz additional visualizations

Assignment 1 Q1
No ratings yet
Assignment 1 Q1
1 page
3.krugman Obstfeld Ch03
No ratings yet
3.krugman Obstfeld Ch03
41 pages
Non Autoregressive Neural MT
No ratings yet
Non Autoregressive Neural MT
13 pages
Blockwise Parallel Decoding for Deep Autoregressive Models
No ratings yet
Blockwise Parallel Decoding for Deep Autoregressive Models
10 pages
Assignment 2 Report
No ratings yet
Assignment 2 Report
10 pages
Neural Machine Translation A Review of Methods Resources and - 2020 - AI Ope
No ratings yet
Neural Machine Translation A Review of Methods Resources and - 2020 - AI Ope
17 pages
LangGragh
No ratings yet
LangGragh
14 pages
Multi-Task Learning For Multiple Language Translation
No ratings yet
Multi-Task Learning For Multiple Language Translation
10 pages
Deep Learning For Machine Translation: A Dramatic Turn of Paradigm
No ratings yet
Deep Learning For Machine Translation: A Dramatic Turn of Paradigm
36 pages
Neural Machine Translation by Jointly Learning to
No ratings yet
Neural Machine Translation by Jointly Learning to
16 pages
Improving Neural Machine Translation Models With Monolingual Data
No ratings yet
Improving Neural Machine Translation Models With Monolingual Data
11 pages
Challenges in NMT - 2004.05809
No ratings yet
Challenges in NMT - 2004.05809
22 pages
Seminar Review Assignment 3LT Eng To HAdiyisa
No ratings yet
Seminar Review Assignment 3LT Eng To HAdiyisa
11 pages
Co-Evolution of Metamodels and Model Transformations: An operator-based, stepwise approach for the impact resolution of metamodel evolution on model transformations.
From Everand
Co-Evolution of Metamodels and Model Transformations: An operator-based, stepwise approach for the impact resolution of metamodel evolution on model transformations.
Steffen Kruse
No ratings yet
Mastering Asynchronous C++: Modern Techniques for High-Performance Concurrent Programming
From Everand
Mastering Asynchronous C++: Modern Techniques for High-Performance Concurrent Programming
Aarav Joshi
No ratings yet
Translating Similar Languages: Role of Mutual Intelligibility in Multilingual Transformers
No ratings yet
Translating Similar Languages: Role of Mutual Intelligibility in Multilingual Transformers
7 pages
Cs224n 2020 Lecture08 NMT
No ratings yet
Cs224n 2020 Lecture08 NMT
77 pages
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
2017 - Unsupervised Neural Machine Translation PDF
No ratings yet
2017 - Unsupervised Neural Machine Translation PDF
11 pages
Machine Translation of Vedic Sanskrit Using Deep Learning Algorithm
No ratings yet
Machine Translation of Vedic Sanskrit Using Deep Learning Algorithm
4 pages
Understanding Back-Translation at Scale
No ratings yet
Understanding Back-Translation at Scale
12 pages
Language Translation
No ratings yet
Language Translation
15 pages
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Understanding Back-Translation at Scale
No ratings yet
Understanding Back-Translation at Scale
12 pages
Google PDF
No ratings yet
Google PDF
23 pages
Google Neural Machine Translation System
No ratings yet
Google Neural Machine Translation System
23 pages
Extremely Low Resource Neural Machine Translation For Asian Languages
No ratings yet
Extremely Low Resource Neural Machine Translation For Asian Languages
36 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
A Character-Level Decoder Without Explicit Segmentation For Neural Machine Translation
No ratings yet
A Character-Level Decoder Without Explicit Segmentation For Neural Machine Translation
11 pages
Neural Machine Translation PDF
No ratings yet
Neural Machine Translation PDF
15 pages
English-to-Malayalam_Machine_Translation_Framework_using_Transformers
No ratings yet
English-to-Malayalam_Machine_Translation_Framework_using_Transformers
5 pages
1804.00247v2
No ratings yet
1804.00247v2
28 pages
Concurrency and Multithreading in C: POSIX Threads and Synchronization
From Everand
Concurrency and Multithreading in C: POSIX Threads and Synchronization
Larry Jones
No ratings yet
W14-4012
No ratings yet
W14-4012
9 pages
Machine Tannslation On Low Resource Langugages Arabic Telugu Kannada
No ratings yet
Machine Tannslation On Low Resource Langugages Arabic Telugu Kannada
9 pages
Polynomial Expansion Paper
No ratings yet
Polynomial Expansion Paper
4 pages
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
From Everand
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Mustafa Al-Dori
5/5 (1)
Iterative Back-Translation For Neural Machine Translation: Real Real+synthetic
No ratings yet
Iterative Back-Translation For Neural Machine Translation: Real Real+synthetic
7 pages
A Teacher-Student Framework For Zero-Resource Neural Machine Translation
No ratings yet
A Teacher-Student Framework For Zero-Resource Neural Machine Translation
11 pages
R017819
No ratings yet
R017819
56 pages
Neural Machine Translation: Max Mustermann, and Hermann Ney
No ratings yet
Neural Machine Translation: Max Mustermann, and Hermann Ney
18 pages
Lect 07 _MT and Seq2seq
No ratings yet
Lect 07 _MT and Seq2seq
86 pages
Neubig 16 Afnlp
No ratings yet
Neubig 16 Afnlp
58 pages
Beam Search Strategies For Neural Machine Translation
No ratings yet
Beam Search Strategies For Neural Machine Translation
5 pages
Introduction to Quantum Computing & Machine Learning Technologies: 1, #1
From Everand
Introduction to Quantum Computing & Machine Learning Technologies: 1, #1
M. Sreedevi
No ratings yet
FN Paper 2
No ratings yet
FN Paper 2
13 pages
10 1016@j CSL 2017 03 001
No ratings yet
10 1016@j CSL 2017 03 001
16 pages
Unsupervised Neural Machine Translation With Weight Sharing
No ratings yet
Unsupervised Neural Machine Translation With Weight Sharing
11 pages
A Teacher-Student Framework For Zero-Resource Neural Machine Translation
No ratings yet
A Teacher-Student Framework For Zero-Resource Neural Machine Translation
11 pages
Investigating Backtranslation in Neural Machine TR
No ratings yet
Investigating Backtranslation in Neural Machine TR
11 pages
跨注意增强换能器网络用于同声翻译
No ratings yet
跨注意增强换能器网络用于同声翻译
17 pages
Accelerated DevOps with AI, ML & RPA: Non-Programmer’s Guide to AIOPS & MLOPS
From Everand
Accelerated DevOps with AI, ML & RPA: Non-Programmer’s Guide to AIOPS & MLOPS
Stephen Fleming
5/5 (2)
Advanced C++ Memory Management: From RAII Principles to Concurrent Programming and Domain-Specific Optimizations
From Everand
Advanced C++ Memory Management: From RAII Principles to Concurrent Programming and Domain-Specific Optimizations
Aarav Joshi
No ratings yet
Transformers in Deep Learning Architecture: Definitive Reference for Developers and Engineers
From Everand
Transformers in Deep Learning Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Challenges in NMT - 1907.05019
No ratings yet
Challenges in NMT - 1907.05019
27 pages
PaperReview
No ratings yet
PaperReview
41 pages
Unspervised MT D18-1399
No ratings yet
Unspervised MT D18-1399
11 pages
po
No ratings yet
po
2 pages
Transformers: Principles and Applications
From Everand
Transformers: Principles and Applications
Richard Johnson
No ratings yet
TransCoder - YDATA Seminar
No ratings yet
TransCoder - YDATA Seminar
32 pages
Mainframe Modernization with DevOps Mastery: Mainframes
From Everand
Mainframe Modernization with DevOps Mastery: Mainframes
Ricardo Nuqui
No ratings yet
Nyi Ageng Serang
No ratings yet
Nyi Ageng Serang
4 pages
Financial Statement - Nur H
No ratings yet
Financial Statement - Nur H
22 pages
Python ppt
No ratings yet
Python ppt
124 pages
The Functions of Social Support As Protective Factors For Suicidal Ideation
No ratings yet
The Functions of Social Support As Protective Factors For Suicidal Ideation
13 pages
Notice PDF
No ratings yet
Notice PDF
31 pages
A Multimodal Discourse Analysis of A Yoruba Song-Drama
No ratings yet
A Multimodal Discourse Analysis of A Yoruba Song-Drama
11 pages
THR Manual
No ratings yet
THR Manual
23 pages
Project Scope and Time Management
No ratings yet
Project Scope and Time Management
10 pages
Robinson v. Miralles, 510 SCRA 678 G.R. No. 163584 December 12, 2006
No ratings yet
Robinson v. Miralles, 510 SCRA 678 G.R. No. 163584 December 12, 2006
2 pages
C++ by Sunbeam
No ratings yet
C++ by Sunbeam
4 pages
Electrophilic Addition Reaction - II
No ratings yet
Electrophilic Addition Reaction - II
153 pages
Entry Level Cover Letter
100% (1)
Entry Level Cover Letter
5 pages
Sneaker - Premium range (1)
No ratings yet
Sneaker - Premium range (1)
34 pages
Model Viva Questions For "Name of The Lab: VISUAL BASIC "
100% (1)
Model Viva Questions For "Name of The Lab: VISUAL BASIC "
9 pages
What Is Wisdom 11
100% (1)
What Is Wisdom 11
35 pages
Chaman Nahal'S Azadi: The Theme of Partition: Original Article
No ratings yet
Chaman Nahal'S Azadi: The Theme of Partition: Original Article
5 pages
University of Latvia
No ratings yet
University of Latvia
35 pages
Film & Psychoanalysis Bibliography
No ratings yet
Film & Psychoanalysis Bibliography
10 pages
What Are the Major Breakdown of the Book of Job and What Make It a Book of Job
No ratings yet
What Are the Major Breakdown of the Book of Job and What Make It a Book of Job
5 pages
ES800
No ratings yet
ES800
8 pages
General Guidelines Catalog Template Guidelines: Step 1: Understand The Template
No ratings yet
General Guidelines Catalog Template Guidelines: Step 1: Understand The Template
69 pages
EEE1018 - Quantum Confinement
No ratings yet
EEE1018 - Quantum Confinement
17 pages
Petition for Mandatory Injunctive Relief FAMU Constituents - 25-06-15
No ratings yet
Petition for Mandatory Injunctive Relief FAMU Constituents - 25-06-15
154 pages
Yogananda Sagar K
No ratings yet
Yogananda Sagar K
4 pages
Zoo And Aquarium History Ancient Animal Collections To Conservation Centers 2nd Edition 2nd Vernon N Kisling download
No ratings yet
Zoo And Aquarium History Ancient Animal Collections To Conservation Centers 2nd Edition 2nd Vernon N Kisling download
88 pages
Anestesi Lokal
No ratings yet
Anestesi Lokal
12 pages
4 - Inkjet Printed Devices For Chemical and Biosensing Applications
No ratings yet
4 - Inkjet Printed Devices For Chemical and Biosensing Applications
201 pages
Progress Report
No ratings yet
Progress Report
3 pages

Accelerating Transformer Inference for Translation via Parallel Decoding

Uploaded by

Accelerating Transformer Inference for Translation via Parallel Decoding

Uploaded by

Accelerating Transformer Inference for Translation via Parallel Decoding

Andrea Santilli1 , Silvio Severino1 , Emilian Postolache1 , Valentino Maiorca1 ,

Abstract and portability. Considering that these systems are

The community proposed specific network

Parallel Jacobi (PJ) Decoding. First, we pro-

WMT17 IITB IWSLT15 FLORES

4.3 Analysis and Validation

complexity. This means that hardware and soft-

WMT17 IITB IWSLT15 FLORES

Table 7: BLEU scores on MBart50.

Example 1 - Wmt16 En-Ro

Example 2 - Flores En-It

Example 4 - Flores En-Fr

Figure 6: DGGviz additional visualizations

You might also like