0% found this document useful (0 votes)
19 views11 pages

BERT-NAR-BERT A Non-Autoregressive Pre-Trained Sequence-to-Sequence Model Leveraging BERT Checkpoints

Uploaded by

hr.shayan91tv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views11 pages

BERT-NAR-BERT A Non-Autoregressive Pre-Trained Sequence-to-Sequence Model Leveraging BERT Checkpoints

Uploaded by

hr.shayan91tv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Received 27 November 2023, accepted 16 December 2023, date of publication 25 December 2023,

date of current version 3 January 2024.


Digital Object Identifier 10.1109/ACCESS.2023.3346952

BERT-NAR-BERT: A Non-Autoregressive
Pre-Trained Sequence-to-Sequence Model
Leveraging BERT Checkpoints
MOHAMMAD GOLAM SOHRAB 1, MASAKI ASADA 1, MATĪSS RIKTERS1 ,
AND MAKOTO MIWA1,2
1 Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology, Tokyo 135-0064, Japan
2 Toyota Technological Institute, Nagoya 468-8511, Japan
Corresponding author: Mohammad Golam Sohrab ([email protected])
This work was supported by the New Energy and Industrial Technology Development Organization (NEDO) under Project JPNP20006.

ABSTRACT We introduce BERT-NAR-BERT (BnB) – a pre-trained non-autoregressive sequence-to-sequence


model, which employs BERT as the backbone for the encoder and decoder for natural language understanding
and generation tasks. During the pre-training and fine-tuning with BERT-NAR-BERT, two challenging aspects
are considered by adopting the length classification and connectionist temporal classification models to control
the output length of BnB. We evaluate it using a standard natural language understanding benchmark GLUE
and three generation tasks – abstractive summarization, question generation, and machine translation. Our
results show substantial improvements in inference speed (on average 10x faster) with only little deficiency
in output quality when compared to our direct autoregressive baseline BERT2BERT model. Our code is
publicly released on GitHub (https://github.com/aistairc/BERT-NAR-BERT) under the Apache 2.0 License.

INDEX TERMS Abstractive summarisation, language understanding, machine translation, natural language
processing, non-autoregressive modelling.

I. INTRODUCTION speed compared to AR decoding, but it can lead to


Sequence-to-sequence (S2S) models have recently been degradation of generation performance. Several S2S-based
widely used for natural language processing problems. NAR methods [7], [8] have been proposed to approach the
The S2S architecture is first introduced in the field of performance of AR methods.
machine translation (MT) [1] and later used for pre-trained In AR models, Rothe et al. [5] proposed BERT2BERT
generative language models (LM), such as BART [2], T5 [3], (B2B), which leveraged the publicly available Pre-trained
Optimus [4] and BERT2BERT [5] These models usually Language Model (PLM) parameter checkpoints such as
adopt an autoregressive (AR) decoding strategy to generate BERT [9] and GPT-2 [10] and showed state-of-the-art
texts from left to right, token by token. AR decoding can performance on several generation tasks by utilizing released
perform high-quality inference, but it has the limitation that it parameters for a S2S AR model without a new large-scale pre-
cannot decode tokens in parallel and requires more time and training. In NAR models, however, existing studies performed
computational cost for inference. In the growing demand for pre-training from scratch, and the effectiveness of using PLM
the real-time application use of S2S models, it is important to checkpoints for NAR decoding has not been investigated.
investigate approaches that can speed up the inference. In this paper, we propose a novel S2S NAR model based on
For fast inference, non-autoregressive (NAR) decoding existing Transformer [11] architectures to allow parameter
has been widely studied since it can predict target tokens initialization from publicly available PLM checkpoints.
in all positions simultaneously and independently during Specifically, we extend the B2B model to build a NAR S2S
inference [6]. The decoding thus can increase the inference model using BERT as the backbone for both encoder and
decoder models. The NAR modeling allows fast decoding
The associate editor coordinating the review of this manuscript and and generation of longer texts. Our model architecture is
approving it for publication was Arianna D’Ulizia . similar to that of BioNART [12], which is targeted towards
2023 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 23
M. G. Sohrab et al.: BnB: A Non-Autoregressive Pre-Trained Sequence-to-Sequence Model

the biomedical domain. Using BERT as the backbone, results in different tokenisation between input and output
we can start training with reliable parameters by loading as the model encoder is based on BERT and the decoder is
the pre-trained BERT checkpoints. In addition, unlike the GPT-2. In contrast, The input and output tokenisation become
B2B model, we perform one epoch of additional pre- consistent by using the same BERT architecture in our model.
training starting at the BERT checkpoints and investigate the Rothe et al. [5] developed Transformer-based sequence-
effectiveness of the additional pre-training.1 We adopt the to-sequence models by describing several combinations of
Length Classification (LC) [13] or Connectionist Temporal model initialization that include BERT2BERT, a BERT-
Classification (CTC) [14] models to control the output length initialized encoder paired with a BERT-initialized AR decoder.
of BnB. We call our NAR model BERT-NAR-BERT, or BnB, Our implementation of BnB is similar, except for the main
where the NAR (n) part specifies that the model is Non- differences of having a length prediction model, a latent
autoregressive and BERT on each side of ‘-’ represents that representation from the encoder output layer, and a NAR
the model is sequence-to-sequence. decoder. The NAR decoder can decode tokens in parallel
We fine-tune and evaluate the BnB model on a standard which drastically reduce the inference computational cost.
natural language understanding benchmark GLUE and three NAR models have been recently investigated in NLP tasks
generation tasks – abstractive summarization, question gener- due to their efficiency. For example, Gu et al. [6] introduced
ation, and machine translation. We compare its performance a NAR model for MT based on the transformer [11]. This
with several AR models and recent state-of-the-art NAR reduced latency by 90% and achieved competitive output
models. quality with only a slight decrease in translation performance
Our contributions are summarized as follows: compared to similar-sized AR models. Gu and Kong [17]
• We propose a novel NAR S2S method with encoders and further minimized the gap with AR models achieving SOTA
decoders, each compatible with publicly available PLMs, performance on several MT benchmarks. Li et al. [18]
with additional pretraining and modeling output length. introduce ELMER by leveraging the early exit technique
• Our S2S BnB model is 10.7x faster than the AR that enables token generation at different layers: if there is
counterpart, i.e., the direct baseline of our BERT2BERT sufficient confidence to generate a token at lower layers where
model. On average, our model allows 17x faster decoding the model is allowed to exit and make the prediction without
than autoregressive models and 7.7x faster than semi- feeding to upper layers. Experiment results show that ELMER
autoregressive models. outperforms other NAR models and narrows the performance
• We show that leveraging BERT checkpoints for our gap with AR models. The major limitation of ELMER is
NAR model is effective in improving performance. the model needs to define an appropriate early exit strategy.
It improves the Rouge-L scores by 1.9 and 4.8 points During pre-training, the model utilizes the layer permutation
on the summarization and question generation tasks, language modeling but in fine-tuning the model needs to
respectively. design a different early-exit strategy to decide at which layer
the model will exit. During pre-training and fine-tuning, our
II. RELATED WORK model follows the general architecture where early-exit is not
Pre-trained Language Models such as GPT-2 [10], XLNet [15], an option but the token is passed to all layers in the BnB.
and XLM [16] are neural networks trained on large-scale Sun et al. [19] proposed a grafting approach, where they
datasets that can be fine-tuned on problem-specific data. They stack multiple sets of BERT encoder layers that follow
became widely adopted after BERT [9], which reported SOTA autoregressive manner and GPT decoder decoder layers, and
results for 11 NLP tasks. Our PLM BnB is also trained on freeze selections of encoder or decoder parameters during
large-scale dataset and further fine-tuned on 12 task-specific fine-tuning. This approach leads to strong gains in output
NLP tasks where 10 NLP tasks are overlapped with BERT. quality, however, they do not compare model sizes, training
Li et al. [4] proposed the first large-scale variational auto or inference time directly with previous work. While freezing
encoder (VAE) language model, Optimus. They connect a some parameters during fine-tuning can optimize resource
BERT encoder and a GPT-2 decoder using a universal latent usage at the time, during inference the model would still be
embedding space. The model is first pre-trained on a large text twice as slow due to double the parameter size. Compared to
corpus and then fine-tuned for various language generation this work, our model follows non-autoregressive generation
and understanding tasks. It achieves SOTA on VAE language that opposes the autoregressive manner and maintains strong
modeling benchmarks. While the general idea of our work output quality, but vastly improves inference speed.
is similar, there are several core differences from this paper.
Our model does not have a VAE and instead of the GPT-2 III. BERT-NAR-BERT FRAMEWORK
decoder we use the same BERT as in the encoder. Optimus We propose BERT-NAR-BERT (BnB) - a non-autoregressive
is also autoregressive, while ours is non-autoregressive. S2S model in this section. The overview of BnB is shown
Besides, Optimus also reported that inconsistency in models in Figure 1. In this section, we will first briefly introduce
1 In this paper, we use the term additional pre-training for pre-training our the BERT2BERT model, then describe the differences in the
BnB model. We load the pre-trained PLM checkpoints and do not pre-train model architecture, and explain our training approaches for
the PLMs from scratch. the NAR modeling. Later, we describe the model architecture

24 VOLUME 12, 2024


M. G. Sohrab et al.: BnB: A Non-Autoregressive Pre-Trained Sequence-to-Sequence Model

FIGURE 1. The S2S BERT-NAR-BERT (BnB) architecture. The + sign to the right of the encoder box indicates, the input embeddings are the sum of the token
embeddings, the position embeddings, and the type embeddings where the decoder box indicates the sum of the position, the type, and the latent
embeddings.

of BnB and how we leverage BERT checkpoints for our NAR sequence X = x1 , . . . , xn to the BnB encoder layer. In the
model. Then we describe our additional pre-training scheme BnB embedding layers, the input representation is constructed
for NAR language modeling. by summing the corresponding token (X), position (P), and
type (T) embeddings.
A. BERT2BERT MODEL The embeddings are fed into the BERT self-attention and
The BERT2BERT (B2B) [5] model is an autoregressive S2S feed-forward layers. The hidden representation of the final
model using PLM checkpoints. All weights of the encoder layer; h is passed to the subsequent layer for obtaining latent
and decoder are initialized from the public BERT checkpoints representations.
except the cross attention, which is randomly initialized. Since
the model is based on an autoregressive manner, therefore, 2) LATENT REPRESENTATIONS
during the decoding task specific beam size is set for the We construct the latent representation z = W E h + b based
generation. Our model extends this model for a NAR model on token-level representation from the encoder hidden state
with additional pre-training and modeling output length. h where z ∈ RP is a P-dimensional vector and W E ∈ RP×H
We consider B2B as our direct AR baseline model. is the weight matrix. A visualization of BERT-NAR-BERT
encoder and the latent representations can be seen in the left
B. MODEL ARCHITECTURE part of Figure 1.
The BnB model is composed of a multi-layer Transformer-
based encoder and decoder, in which the embedding layer and 3) BNB DECODER
the stack of transformer layers are initialized with BERT [9]. The decoder part is also based on the BERT architecture, and
To leverage the expressiveness power of existing pre-trained we can directly initialize the decoder with the pre-trained
BERT models, we initialize our encoder and decoder parts BERT model. Following our direct baseline BERT2BERT
with the pre-trained BERT parameters. We denote the number model, the cross-attention mechanism is adopted, and the
of layers (i.e., Transformer blocks) as L, the hidden size as H , encoder hidden representation of the final layer h is used
and the number of self-attention heads as A. for cross-attention. Our model differs from the BERT2BERT
model in input representation and attention masks to enable
1) BNB ENCODER NAR decoding. In the AR decoding, all target tokens are
fed into the decoder with customized attention masks that
The encoder part of BnB is the same architecture as the
prevent the decoder from seeing the future tokens during
BERT model.2 The BnB model first feeds the source input
training. Then, in inference, the predicted token is fed to
2We share the same encoder both for BERT2BERT and BERT-NAR-BERT, the decoder autoregressively. In our BnB decoder, input
where we modify the decoder part by leveraging the latent representations representation is constructed without providing any target
and length classification for NAR generation. tokens. The input representation is constructed by summing

VOLUME 12, 2024 25


M. G. Sohrab et al.: BnB: A Non-Autoregressive Pre-Trained Sequence-to-Sequence Model

the corresponding position (P) and type (T) embeddings and D. MODELING OUTPUT LENGTH
the latent embedding z from the encoder. The attention masks To control the generation output length of BnB, we implement
are the normal masks that give access to all future tokens. The two length prediction models, namely: 1) Length Classifica-
resulting decoder output representations of the final layer are tion [13], and 2) CTC [14] that implicitly determines the target
fed to the subsequent generation layer. A visualization of the length from the token alignment.
BERT-NAR-BERT decoder can be seen in the right part of
Figure 1. Figure 2 shows the decoder architectural difference 1) LENGTH CLASSIFICATION
between our non-autoregressive model and the direct baseline The length classification formulates the length prediction as
autoregressive model. a classification task and utilizes the latent representations to
predict the target length:
X
pθ (y|z) = pθ (y, l|z)
l
= pθ y, ly |z = pθ y|z, ly pθ ly |z ,
  
(1)

where ly denotes the  length of y that is the gold length in


training and pθ ly |z = LL is the length predictor that predicts
the length of target sentence y. Once the sentence length is
predicted, the model predicts each output token with a token-
level categorical cross-entropy loss.

2) CONNECTIONIST TEMPORAL CLASSIFICATION


In contrast to LC, we also adopt the CTC [14] considering
FIGURE 2. The decoder architecture of the non-autoregressive (NAR) its superior performance and flexibility for latent alignment.
BERT-NAR-BERT (BnB) model VS our direct baseline autoregressive (AR) The CTC loss is independently computed after the decoder by
BERT2BERT (B2B) model. Inference speedup is computed both in CPU and
GPU over the 2000 sentences from WMT16 test data. replacing the length classification and CE loss.

E. POSSIBLE MODEL COMPONENT COMBINATIONS


C. ADDITIONAL PRE-TRAINING
Table 1 shows the possible model component combinations
This section describes the two unsupervised pre-training of the BERT-NAR-BERT model. Additional pre-training is
objectives of the BERT-NAR-BERT. Figure 3 shows the pre- performed by either of two different objectives: masked and
training objectives of BnB. permutation LM. For pre-training, the BnB is initialized with
random values or publicly available PLM checkpoints. For
fine-tuning, model parameters are initialized with random
values, public parameter checkpoints, or resulting additional
pre-trained parameters. For both additional pre-training and
fine-tuning, we choose LC or CTC for modeling the output
lengths.

IV. EXPERIMENTS
A. PRE-TRAINING: DATA AND TASK SETTINGS
We pre-training our BERT-NAR-BERT which we call
additional pre-training by loading the PLM checkpoints as
sequence-to-sequence task.
FIGURE 3. Masked and permutation LM objective functions of BnB.

1) ADDITIONAL PRE-TRAINING
1) MASKED LANGUAGE MODELING The additional pre-training procedure follows existing
The self-supervised learning strategy adopted in BERT. literature on PLM pre-training. The entire Wikipedia [20] data
We randomly mask the tokens in each sequence, and all dump with the version of 20220301.en from Huggingface
masked tokens are predicted in a non-autoregressive manner. datasets3 is used for pre-training without any data filtering
scheme. As a result, 6,458,670 input text that includes multiple
2) PERMUTATION LANGUAGE MODELING sequences are truncated with maximum sequence length of
We randomly permute tokens in a sequence and predict the 512, and 3.2B tokens are used for our additional pre-training.
original order of the tokens, which is inspired by the idea in
XLNet [15]. 3 https://huggingface.co/datasets/wikipedia

26 VOLUME 12, 2024


M. G. Sohrab et al.: BnB: A Non-Autoregressive Pre-Trained Sequence-to-Sequence Model

TABLE 1. Possible model components for additional pre-training and fine-tuning on BERT-NAR-BERT. Bold font options are adopted as our default options.

B. FINE-TUNING: DATA AND TASK SETTINGS English↔Romanian (RO) data from WMT 2016. We also
For fine-tuning, we describe the data and task settings of the experiment with distilled versions of these data sets, generated
benchmark downstream tasks including GLUE, abstractive by vanilla transformer models [11] trained on the normal data.
summarization, question generation, and machine translation. We load the WMT datasets from Huggingface datasets6,7 and
We initialize all the downstream tasks with additional pre- use them directly to train the models without filtering, back-
training checkpoints generated from BnB. translation, or any other kinds of synthetic data generation.
We evaluate the performance by computing BLEU [26] scores
1) LANGUAGE UNDERSTANDING using sacreBLEU [27].
We consider the General Language Understanding Evaluation
(GLUE) benchmark [21], which consists of nine general C. EXPERIMENTAL SETTINGS
language understanding tasks. We employ the Optimus Our implementation is based on the encoder-decoder
evaluation script4 to evaluate the scores and select the best model (i.e., EncoderDecoderModel) in Huggingface
performances among different runs to report all the scores Transformers [28], in combination with their BERT model.
following Optimus. We supplement the Huggingface Transformers framework
with a non-autoregressive version of the encoder-decoder.
2) ABSTRACTIVE SUMMARIZATION
Abstractive text summarization aims to produce a short version D. PRE-TRAINING BNB
of a document while preserving its salient information content. For additional pre-training, we split the text from the
We evaluate the models based on the BBC extreme [22] text field and ignored the title field in the English
(XSum) dataset. This is a news summarization dataset Wikipedia. We consider a document-level rather than a
containing 227K news articles and single-sentence summary shuffled sentence-level corpus by loading the data from
pairs. The evaluation metric is ROUGE [23], including Huggingface. During additional pre-training, we initialize
ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-L (R- BnB with bert-base-cased and update all the parameters
L). We adopted the Google Research reimplementation of of BnB’s encoder and decoder using Wikipedia, where the
ROUGE.5 checkpoints are saved from both encoder and decoder outputs.
We also perform latency comparison. We evaluate the time We set the 15% probability of masking for masked LM, and
required to generate all the samples in the validation set with 50% probability of permuting for permutation LM.
the same machine settings for our BnB and ELMER [18] and
calculate the ratios of latency. We include the reported latency E. FINE-TUNING BNB
values of other existing models from ELMER. For GLUE, following the fine-tuning setting in [4] and [9],
we use the learning rates [2, 3, 4, 5] × 10−5 with different
3) QUESTION GENERATION seeds following the Optimus setting. We train the model for
three epochs for MNLI, QQP, QNLI, and SST-2 datasets and
SQuAD v1.1 [24] is a dataset created for machine reading
ten epochs for comparatively very small datasets - COLA,
comprehension. The dataset contains 98K triples of {passage,
STS-B, MRPC, and RTE.
question, answer}. We use this data as a question generation
For abstractive summarization, we set the number of
dataset, in which a model receives an answer and a passage
training epochs to 100 and adopt early stopping. Our BnB is
and generates the corresponding question. We follow the
first initialized with the bert-based-cased pre-trained
same train, validation test data split setting as Du et al. [25].
parameters and all the parameters are fine-tuned over the
The evaluation metrics are ROUGE-L and BLEU-4 (B-4).
XSum label data. Later, the BnB is initialized with additional
We follow the same settings as the abstractive summarization
pre-training w.r.t bnb-base-cased parameters and fine-
task for latency comparison.
tuned the parameters over the masked and permutation LM
objectives.
4) MACHINE TRANSLATION
For question answering, input is formatted as {answer [SEP]
We evaluate our models using two popular benchmark data passage}, and the model generates the question sequences. Our
sets from the WMT shared tasks on news translation - models including the BnB initialize by bert-base-cased
English (EN) ↔German (DE) data from WMT 2014 and as the baseline model, and the BnB initialize by our
4 https://github.com/ChunyuanLI/Optimus 6 https://huggingface.co/datasets/wmt14
5 https://github.com/google-research/google-research/tree/master/rouge 7 https://huggingface.co/datasets/wmt16

VOLUME 12, 2024 27


M. G. Sohrab et al.: BnB: A Non-Autoregressive Pre-Trained Sequence-to-Sequence Model

TABLE 2. Comparison of OpenAI GPT, BERT, Optimus, and BnB on the validation set of GLUE. ACC and P.C stand for accuracy and Pearson correlation
coefficient, respectively. The OpenAI GPT scores are reported by Devlin et al. [9]. Bold and underlined scores denote the best and second-best results.

TABLE 3. Performance comparison on the XSum and SQuAD v1.1 datasets. R-1/2/L and B-4 stand for ROUGE-1/2/L and BLEU-4, respectively. Bold and
underlined scores denote the best and second-best results within NAR models. For the latency values of existing methods, we included the values reported
by Li et al. [18].

pre-training parameters w.r.t to bnb-base-cased are and VAE-based Optimus models. We cannot compare BERT-
compared by the top published works reported by Li et al. [18]. NAR-BERT with BERT2BERT [5] model as the model skips
For machine translation, we compare the scores with the language understanding task. Moreover, the BERT2BERT
autoregressive baselines - Transformer MT models and pre-trained checkpoint is not publicly available that further
BERT2BERT, initialized with random parameters or pre- narrowing to conduct the language understanding tasks over
trained multilingual BERT. We compare random initialization the BERT2BERT model. In this table, we compare the BnB
of parameters with initialization from mBERT, as well as based on the permutation LM objectives. Our best BnB model
knowledge distillation [29], which has previously proven to with permutation LM objective function outperforms the
be highly beneficial for NAR MT models [30]. OpenAI GPT [31], and BERT models on averaged scores, but
shows comparable results in terms of Optimus. The Optimus
V. RESULTS model follows the sentence-level pre-training procedure which
We evaluate the BnB by adopting the language understanding allows training the model longer than document-level. The
and generation tasks. We report the results of language GLUE benchmark consists of nine datasets because of the
understanding and generation tasks using the BnB with parallel problematic nature of the WNLI dataset BERT [9] literature
decoding in comparison to other models in Tables 2, 3, and 4. ignores this dataset. We evaluate the WNLI dataset and achieve
56.3% in terms of accuracy as mentioned in the score like
A. LANGUAGE UNDERSTANDING Optimus. We also notice the problematic nature of WNLI
To fine-tune the model on GLUE tasks, we consider applying where the score remains unchanged even though the dataset is
our pre-trained BnB model over the nine downstream tasks. fine-tuned with different learning rates and seeds.
In Table 2, we compare our fully non-autoregressive BnB
approach with the OpenAI GPT8 [31], encoder-based BERT, B. LANGUAGE GENERATION
For language generation based on BnB, we consider three
8 The first version since the model parameters are comparable with the tasks: 1) abstractive summarization, 2) question generation,
proposed BnB model. and 3) machine translation.

28 VOLUME 12, 2024


M. G. Sohrab et al.: BnB: A Non-Autoregressive Pre-Trained Sequence-to-Sequence Model

1) SUMMARIZATION AND QUESTION GENERATION (H=768), and 12 self-attention heads (A=12). Our direct
Table 3 shows the performance comparison of BnB over baseline BERT2BERT model follows the same parameters,
the XSum and SQuAD v1.1 datasets for summarization 220M totals for both the encoder and decoder models.
and question generation tasks respectively. Our best model
outperforms all the semi-autoregressive (semi-AR), and most VI. ABLATION STUDY
of the non-autoregressive (NAR) approaches, but closely In this section, we further conduct ablation studies to
competes with the AR approaches. Under the NAR setting, investigate the different objectives, hyper-parameter, and
we achieve second-best results over the XSum and SQuAD additional pre-training strategies of BnB. Generally, training
v1.1 datasets. Our model outperforms the ELMER-Hard [18] language models in different objectives and hyper-parameters
over the XSum and SQuAD v1.1 datasets where ELMER- are computationally expensive, therefore, we show the
Soft [18] remains state-of-the-art (SOTA) results. ELMER- differences in the performance of BnB to compare the settings
Hard and ELMER-soft denote fine-tuning ELMER [18] with in two possible scenarios: 1) judging the effect of additional
hard and soft early exit strategies, respectively. Unlike our pre-training over the masked and permutation LM objectives,
BnB that needs to determine length by integrating an extra and 2) selecting the better model from different length
length prediction model, ELMER dynamically adjusts the prediction models.
output length by emitting an end token (i.e., [EOS]) at any
position with early exit strategies. Our BnB can also consider TABLE 5. Comparison of the parameter initialization for BnB. Our baseline
model BnB is initialized with bert-base-cased (bBERT) where the models
integrating an early exit strategy by adopting the ELMER-Soft with autoencoder (AE), masked, and permutation (perm.) LM objectives are
idea. initialized with our additional pre-training (AP) or w.r.t bnb-base-cased.
The scores in subscript denotes our models performance being higher than
our baseline models.
2) MACHINE TRANSLATION
Results of machine translation experiments are summarized
in Table 4. The results show that our model can compete with
the baseline Transformer and B2B models9 after knowledge
distillation. It is expected as the BnBs are compared with the
autoregressive baseline translation models.

TABLE 4. Machine translation experiment results in BLEU scores of training


a vanilla Transformer model, BERT2BERT (B2B), and BERT-NAR-BERT (BnB)
initialized with multilingual BERT (mBERT); trained on original and distilled TABLE 6. Comparison of the length prediction mechanism for NAR
(dist.) data from WMT 2014 (German) and WMT 2016 (Romanian). Bold and decoding between Length Classification (LC) and Connectionist Temporal
underlined scores denote the best and second-best results. Classification (CTC).

A. EFFECT OF ADDITIONAL PRE-TRAINING


Table 5 shows the additional pre-training effect of BnB over
the summarization dataset. The results show that BnBs with
C. INFERENCE SPEEDUP additional pre-training along with masked and permutation
We also compare the differences in inference speed between LM objectives outperform our baseline models.
the models. While training speed is mostly similar for all,
BnB can generate output 17x faster on average due to its B. ABLATION ON LENGTH PREDICTION
non-autoregressive nature (tested on NVIDIA V100 and The fully NAR models need a target sequence length
A100 GPUs) as shown in the Latency column of Table 3. prediction model to generate all the tokens simultaneously.
For machine translation, translating 2,000 sentences in the We adopt the length classification, and Connectionist
WMT16 test data set with BnB takes 87 seconds on a GPU, and Temporal Classification models mentioned in Section III to
587 seconds on a CPU. The same took 234 seconds on a GPU facilitate the BnB. In Table 6, we fine-tune the BnB with LC
and 1,234 seconds on a CPU for an equivalent to the direct B2B and CTC mechanisms on the XSum dataset. Table 6 shows
baseline model. Following the BERT architecture, our BERT- that the CTC-based model outperforms the LC-based model.
NAR-BERT consists of 110M parameters each for encoder
and decoder that includes 12 layers (L=12), 768 hidden size C. EFFECT OF FEW-SHOT LEARNING
9 Transformer and B2B scores are from our implementation of the models We also conduct a few-shot experiment by selecting the
using Huggingface Transformers. Rothe et al. [5] reported uncased BLEU 1%, 10%, 30%, 50%, and 70% data of each task in the
scores on WMT14 of 30.1 for EN-DE and 32.7 for DE-EN. GLUE dataset. As in Figure 4, our pre-trained BnB model

VOLUME 12, 2024 29


M. G. Sohrab et al.: BnB: A Non-Autoregressive Pre-Trained Sequence-to-Sequence Model

FIGURE 4. Few-shot learning over the GLUE dataset. 1%, 10%, 30%, 50%, and 70% data are selected to conduct the experiments. CoLA, SSTB,
MRPC, and RTE are relatively very small dataset in comparison to MNLI, QQP, QNLI, and SST2.

shows competitive results even with only 1% data from each Table 8 shows that occasionally the BLEU score suffers due
task in the GLUE dataset. As the data scale increases to to paraphrasing of the same concepts, e.g., ‘‘rocket attacks’’
30%, we can see that very negligible performance gap with instead of ‘‘missile strikes’’. In this case, ‘‘rocket attacks’’
100% data. is a direct literal translation of the German compound word
‘‘Raketenangriffe’’, but not the correct English term for the
VII. ERROR ANALYSIS ON GENERATION TASK given context. Person names like ‘‘Volodymyr Zelenskyy’’
A. ABSTRACTIVE SUMMARIZATION are also a common struggle for any MT model, especially
Table 7 shows randomly selected generated examples from if not commonly found in training data. Location names
the XSum dataset. In the first example, our BnB successfully like ‘‘Kyiv’’ may have other spellings, such as ‘‘Kiev’’ in
generated a summary meaning that Yorkshire Diamonds have this example, where the former is the romanized official
appointed Paul Grayson as their head coach, but our model Ukrainian name for the city, and the latter used to be the
made incorrect modifiers such as ‘‘Super League champions’’, traditional English name for the city up to 2014, when
and ‘‘former’’. many Western media outlets switched to using the romanized
In the second example, our BnB successfully generated Ukrainian name after the outbreak of the Russo-Ukrainian
the text ‘‘A man has been arrested on suspicion of murder War.
after a man was’’ as in the gold reference, but the rest of the Table 9 shows an example of English to German with
sentence contains grammatical errors, and the model wrongly variations on how the words ‘‘ticket checker’’ and ‘‘tickets’’
generated the word ‘‘Wolverhampton’’ that are not mentioned are translated. Both models seem to have not chosen the correct
in the source text. On the other hand, the AR decoding baseline terms for the context, however, the meaning remains generally
BART generated a sentence with almost the same meaning as the same.
the gold reference. The final example in Table 10 exhibits some of the most
common problems that we noticed in BnB output during
B. MACHINE TRANSLATION the evaluation. While both models translate the German
We randomly sampled several examples to superficially word ‘‘Blatt’’ very literally into ‘‘sheet’’ or ‘‘leaf’’ instead
inspect the types of errors or differences in the output of our of the figurative correct translation ‘‘magazine’’, BnB seems
model compared to our baselines. In the evaluation we were to be struggling with overall longer inputs and sometimes
mainly looking for repeating patterns of differences in the generates shortened or even cut off translations. The example
translations between the models and strong differences with also exhibits the occasional hiccup of repeated words or
the reference. phrases.

30 VOLUME 12, 2024


M. G. Sohrab et al.: BnB: A Non-Autoregressive Pre-Trained Sequence-to-Sequence Model

TABLE 7. An example of abstractive summarization output comparing the baseline BART model output to our BnB. The main differences between the model
outputs are underlined.

TABLE 8. An example of German to English translation output comparing the baseline Transformer model output to our BnB. The main differences between
the model outputs are underlined.

TABLE 9. An example of English to German translation output comparing the baseline Transformer model output to our BnB. The main differences between
the model outputs are underlined.

TABLE 10. An example of German to English translation output where the BnB model struggles with some concepts and generates shortened output. The
main differences between the model outputs are underlined.

VIII. DISCUSSION BnB allows much faster decoding without noticeable


We present BERT-NAR-BERT, a novel, simple, and easy- sacrifice in comparison to the fully autoregressive approaches,
to-implement large-scale non-autoregressive pre-trained S2S and exhibits faster decoding with better performance in terms
model by leveraging BERT as the backbone of both encoder of semi-AR approaches. It is also noticeable that BnB even
and decoder models. We demonstrate strong performance of demonstrates faster decoding than existing NAR models,
language understanding with the nine GLUE tasks and three except ELMER [18]. This is expected because: 1) BnB (Total
language generation tasks, including summarization, question parameters 220M) is a relatively larger model than ELMER
generation, and machine translation. (Total parameters 140M); 2) BnB simultaneously predicts the

VOLUME 12, 2024 31


M. G. Sohrab et al.: BnB: A Non-Autoregressive Pre-Trained Sequence-to-Sequence Model

tokens only at the last layer like most NAR models, where on summarization datasets, the longer sequences are truncated,
the ELMER predicts at different layer by early exit technique. as BERT limits the maximum input to 512 tokens. One
The ELMER [18] reported that the model has an important may further train the models for a longer period with a
concern for defining an early exit technique. In pre-training sentence-level corpus of Wikipedia to achieve better context
ELMER follows layer permutation language modeling but representations.
during fine-tuning the model needs to design an effective and Second, the BERT-NAR-BERT model introduces token-
suitable method to decide at which layer the model will exit level latent space that connects the encoder and decoder of
and predict tokens. Our BnB does not rely on the early exit the model. Since hyper-parameter tuning on large language
technique but predicts the tokens from the final output layer. models is computationally very costly, we choose the latent
In our future work, we will adopt the early exit technique to size of 8 without conducting any empirical studies on selecting
observe the performance over long-form text generation along a latent size.
with a length classification model. Furthermore, the usual limitations of neural S2S generation
Furthermore, training and evaluation of language models models also apply to ours, where they tend to struggle with
with possible components stated in Table 1 is computationally generating less commonly words and phrases, such as the ones
very expensive. Therefore, our ablation studies over 1) the highlighted in our error analysis.
effect of additional pre-training using masked and permutation
LM objectives and 2) different length prediction methods help ETHICS STATEMENT
us choose the best learning parameters for BnB by considering We use only publicly available datasets and relatively low
a smaller number of training steps. compute amounts while conducting our experiments to enable
reproducibility. We do not perform any studies on other
IX. CONCLUSION humans or animals in this research.
This paper introduces an efficient non-autoregressive S2S
model BERT-NAR-BERT that outperforms baselines in most ACKNOWLEDGMENT
of the summarization and question generation tasks. Still, The authors thank all the anonymous reviewers for their
it remains competitive in the quality of outputs when evaluated insightful comments.
on machine translation tasks. However, our model with
distilled data shows improvement over the baseline approaches. REFERENCES
Furthermore, we find that using pre-trained BERT models [1] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,
as the encoder and decoder along with CTC for length H. Schwenk, and Y. Bengio, ‘‘Learning phrase representations using
RNN encoder–decoder for statistical machine translation,’’ in Proc. Conf.
prediction and knowledge distillation for machine translation Empirical Methods Natural Lang. Process. (EMNLP), Doha, Qatar, 2014,
helps improve the performance of language generation tasks. pp. 1724–1734.
We also separately evaluate the BnB on GLUE tasks and [2] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy,
V. Stoyanov, and L. Zettlemoyer, ‘‘BART: Denoising sequence-to-sequence
show strong performance of language understanding with nine pre-training for natural language generation, translation, and comprehen-
GLUE tasks. sion,’’ in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, 2020,
Despite overall good performance, our error analysis did pp. 7871–7880.
[3] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W.
highlight some reoccurring flaws in the output, which in most Li, and P. J. Liu, ‘‘Exploring the limits of transfer learning with a unified text-
cases seemed to be difficulties in dealing with less common to-text transformer,’’ J. Mach. Learn. Res., vol. 21, no. 1, pp. 5485–5551,
named entities (NE). This could potentially be addressed 2020.
[4] C. Li, X. Gao, Y. Li, B. Peng, X. Li, Y. Zhang, and J. Gao, ‘‘Optimus:
by incorporating dedicated NE processing either in the data Organizing sentences via pre-trained modeling of a latent space,’’ in
preparation pipeline or model training. Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2020,
In future work, we plan to experiment with replacing the pp. 4678–4699.
[5] S. Rothe, S. Narayan, and A. Severyn, ‘‘Leveraging pre-trained checkpoints
BERT models with other pre-trained language models which for sequence generation tasks,’’ Trans. Assoc. Comput. Linguistics, vol. 8,
can be used as encoders/decoders, as well as running broader pp. 264–280, Dec. 2020.
evaluations on other S2S NLP tasks. Besides, large language [6] J. Gu, J. Bradbury, C. Xiong, V. O. K. Li, and R. Socher, ‘‘Non-
autoregressive neural machine translation,’’ in Proc. Int. Conf. Learn.
models (LLMs) are computationally very expensive as they Represent., 2018, pp. 1–13.
require many dedicated GPUs and processing power than [7] J. Lee, E. Mansimov, and K. Cho, ‘‘Deterministic non-autoregressive neural
standard deep learning models. We will address LLMs or sequence modeling by iterative refinement,’’ in Proc. Conf. Empirical
Methods Natural Lang. Process., Brussels, Belgium, 2018, pp. 1173–1182.
decoder only models in our future direction.
[8] W. Qi, Y. Gong, J. Jiao, Y. Yan, W. Chen, D. Liu, K. Tang, H. Li, J. Chen,
R. Zhang, M. Zhou, and N. Duan, ‘‘BANG: Bridging autoregressive and
LIMITATIONS non-autoregressive generation with large scale pretraining,’’ in Proc. 38th
There are several limitations of the current BnB model. First, Int. Conf. Mach. Learn., M. M. T. Zhang, Ed., 2021, pp. 8630–8639.
[9] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-training of
BERT is the backbone for the encoder and the decoder of deep bidirectional transformers for language understanding,’’ in Proc. Conf.
BERT-NAR-BERT, it limits to take input sequence 512 tokens North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol.,
in length. In this work, we considered using a document- vol. 1. Minneapolis, MA, USA, Jun. 2019, pp. 4171–4186.
[10] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
level corpus by loading the dataset directly from Huggingface. ‘‘Language models are unsupervised multitask learners,’’ OpenAI Blog,
Therefore, during pre-training with Wikipedia and fine-tuning vol. 1, no. 8, p. 9, 2019.

32 VOLUME 12, 2024


M. G. Sohrab et al.: BnB: A Non-Autoregressive Pre-Trained Sequence-to-Sequence Model

[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [35] J. Gu, C. Wang, and J. Zhao, ‘‘Levenshtein transformer,’’ in Advances in
Ł. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv. Neural Information Processing Systems, vol. 32, H. Wallach, H. Larochelle,
Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–11. A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, Eds. Red Hook,
[12] M. Asada and M. Miwa, ‘‘BioNART: A biomedical non-AutoRegressive NY, USA: Curran Associates, 2019.
transformer for natural language generation,’’ in Proc. 22nd Workshop
Biomed. Natural Lang. Process. BioNLP Shared Tasks, Toronto, ON,
Canada, 2023, pp. 369–376.
[13] R. Shu, J. Lee, H. Nakayama, and K. Cho, ‘‘Latent-variable non- MOHAMMAD GOLAM SOHRAB received the
autoregressive neural machine translation with deterministic inference Doctor of Engineering degree from the Department
using a delta posterior,’’ in Proc. AAAI Conf. Artif. Intell., vol. 34, 2020, of Information Science and Intelligent Systems,
pp. 8846–8853. The University of Tokushima, in 2013.
[14] J. Libovický and J. Helcl, ‘‘End-to-end non-autoregressive neural machine
He is currently a Senior Researcher with the
translation with connectionist temporal classification,’’ in Proc. Conf.
Artificial Intelligence Research Center (AIRC),
Empirical Methods Natural Lang. Process., Brussels, Belgium, 2018,
pp. 3016–3021. National Institute of Advanced Industrial Science
[15] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. and Technology (AIST), Japan. Previously, he was
V. Le, ‘‘XLNet: Generalized autoregressive pretraining for language a Postdoctoral Fellow with the Computational Intel-
understanding,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019, ligent Laboratory, Toyota Technological Institute
pp. 5753–5763. (TTI). Beforehand, he was an E-Commerce Analyst and a Software Developer
[16] A. Conneau and G. Lample, ‘‘Cross-lingual language model pretraining,’’ of the subsidiary company of USA and Canada, respectively. He has
in Proc. Neural Inf. Process. Syst., vol. 32, 2019, pp. 7059–7069. participated in several biomedical shared-tasks and data mining challenges
[17] J. Gu and X. Kong, ‘‘Fully non-autoregressive neural machine translation: with achieving significant rank. His research interests include deep learning,
Tricks of the trade,’’ in Proc. IJCNLP, Aug. 2021, pp. 120–133. machine learning, language models, data/text mining, information extraction,
[18] J. Li, T. Tang, W. X. Zhao, J.-Y. Nie, and J.-R. Wen, ‘‘ELMER: A non- hierarchical text classification, big data, bio-informatics, extreme multi-
autoregressive pre-trained language model for efficient and effective text
label classification, and natural language processing. He has received the
generation,’’ in Proc. Conf. Empirical Methods Natural Lang. Process.,
Outstanding Paper Award from IEEE GCCE 2021, the Best Paper Award from
Abu Dhabi, United Arab Emirates, 2022, pp. 1044–1058.
[19] Z. Sun, M. Wang, and L. Li, ‘‘Multilingual translation via grafting pre- the IEEE Cloud Computing and Intelligent Systems, and the International
trained language models,’’ in Proc. EMNLP, Punta Cana, Dominican Research Exchange Award from The University of Tokushima in recognition
Republic, Nov. 2021, pp. 2735–2747. of outstanding research for the accomplishment of master’s degree.
[20] Wikimedia Foundation. (2023). Wikimedia Downloads. Accessed:
Sep. 19, 2023. [Online]. Available: https://dumps.wikimedia.org
[21] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman,
‘‘GLUE: A multi-task benchmark and analysis platform for natural MASAKI ASADA received the Ph.D. degree
language understanding,’’ in Proc. EMNLP, Brussels, Belgium, Nov. 2018, from the Toyota Technological Institute, in 2022.
pp. 353–355. He is currently a Researcher with the Artificial
[22] S. Narayan, S. B. Cohen, and M. Lapata, ‘‘Don’t give me the details, Intelligence Research Center, National Institute
just the summary! Topic-aware convolutional neural networks for extreme of Advanced Industrial Science and Technology,
summarization,’’ in Proc. Conf. Empirical Methods Natural Lang. Process., Tokyo, Japan. His research interests include
Brussels, Belgium, 2018, pp. 1797–1807. deep learning, natural language processing on
[23] C.-Y. Lin, ‘‘ROUGE: A package for automatic evaluation of summaries,’’ biomedical texts, and generative language models.
in Text Summarization Branches Out. Barcelona, Spain: Association for
Computational Linguistics, Jul. 2004, pp. 74–81.
[24] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, ‘‘SQuAD: 100,000+
questions for machine comprehension of text,’’ in Proc. Conf. Empirical
Methods Natural Lang. Process., 2016, pp. 2383–2392. MATĪSS RIKTERS received the Ph.D. degree in
[25] X. Du, J. Shao, and C. Cardie, ‘‘Learning to ask: Neural question generation
hybrid methods for machine translation from the
for reading comprehension,’’ in Proc. 55th Annu. Meeting Assoc. Comput.
Linguistics, Vancouver, BC, Canada, 2017, pp. 1342–1352.
University of Latvia, in 2019. He is currently
[26] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, ‘‘BLEU: A method for a Researcher with the Artificial Intelligence
automatic evaluation of machine translation,’’ in Proc. 40th Annu. Meeting Research Center, National Institute of Advanced
Assoc. Comput. Linguistics, Pennsylvania, PA, USA, 2001, pp. 311–318. Industrial Science and Technology (AIST), Tokyo,
[27] M. Post, ‘‘A call for clarity in reporting BLEU scores,’’ in Proc. 3rd Conf. Japan. Prior to AIST, he was a Postdoctoral
Mach. Transl., Res. Papers, Brussels, Belgium, 2018, pp. 186–191. Researcher with The University of Tokyo and a
[28] T. Wolf et al., ‘‘Transformers: State-of-the-art natural language processing,’’ Researcher/Developer with Tilde. His main field
in Proc. Conf. Empirical Methods Natural Lang. Process., Syst. Demon- of study is natural language processing (NLP) and
strations, Oct. 2020, pp. 38–45. machine translation (MT) in particular. His work has mainly focused on the
[29] G. Hinton, O. Vinyals, and J. Dean, ‘‘Distilling the knowledge in a neural improvement of MT for low-resource and morphologically rich languages.
network,’’ in Proc. NIPS Deep Learn. Workshop, 2014, pp. 1–9. More recently his focus has expanded towards context-aware MT, efficient
[30] Y. Kim and A. M. Rush, ‘‘Sequence-level knowledge distillation,’’ in Proc.
models, and other areas of NLP, such as sentiment analysis and social network
Conf. Empirical Methods Natural Lang. Process., Austin, TX, USA, 2016,
pp. 1317–1327.
analysis.
[31] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, ‘‘Improving
language understanding by generative pre-training,’’ OpenAI Blog, 2018.
[32] K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu, ‘‘MASS: Masked sequence
MAKOTO MIWA received the Ph.D. degree
to sequence pre-training for language generation,’’ in Proc. ICML, vol. 97,
K. C. R. Salakhutdinov, Ed., 2019, pp. 5926–5936. from The University of Tokyo, in 2008. He is
[33] W. Qi, Y. Yan, Y. Gong, D. Liu, N. Duan, J. Chen, R. Zhang, and currently an Associate Professor with the Toyota
M. Zhou, ‘‘ProphetNet: Predicting future n-gram for sequence-to- Technological Institute, Nagoya, Japan, and an
SequencePre-training,’’ in Findings of the Association for Computational Invited Researcher with the National Institute
Linguistics: EMNLP 2020, Nov. 2020, pp. 2401–2410. of Advanced Industrial Science and Technology,
[34] M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer, ‘‘Mask-predict: Tokyo, Japan. His research interests include
Parallel decoding of conditional masked language models,’’ in Proc. Conf. natural language processing, deep learning, and
Empirical Methods Natural Lang. Process. 9th Int. Joint Conf. Natural information extraction.
Lang. Process. (EMNLP-IJCNLP), 2019, pp. 6112–6121.

VOLUME 12, 2024 33

You might also like