0% found this document useful (0 votes)
81 views6 pages

Scheduled Sampling For Transformers

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views6 pages

Scheduled Sampling For Transformers

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Scheduled Sampling for Transformers

Tsvetomila Mihaylova André F. T. Martins


Instituto de Telecomunicações Instituto de Telecomunicações & Unbabel
Lisbon, Portugal Lisbon, Portugal
[email protected] [email protected]

Abstract A common approach for addressing the prob-


Scheduled sampling is a technique for avoid- lem with exposure bias is using a scheduled strat-
ing one of the known problems in sequence- egy for deciding when to use teacher forcing and
arXiv:1906.07651v1 [cs.CL] 18 Jun 2019

to-sequence generation: exposure bias. It con- when not to (Bengio et al., 2015). For a recur-
sists of feeding the model a mix of the teacher rent decoder, applying scheduled sampling is triv-
forced embeddings and the model predictions ial: for generation of each word, the model decides
from the previous step in training time. The whether to condition on the gold embedding from
technique has been used for improving the
the given target (teacher forcing) or the model pre-
model performance with recurrent neural net-
works (RNN). In the Transformer model, un-
diction from the previous step.
like the RNN, the generation of a new word In the Transformer model (Vaswani et al.,
attends to the full sentence generated so far, 2017), the decoding is still autoregressive, but un-
not only to the last word, and it is not straight- like the RNN decoder, the generation of each word
forward to apply the scheduled sampling tech- conditions on the whole prefix sequence and not
nique. We propose some structural changes only on the last word. This makes it non-trivial to
to allow scheduled sampling to be applied to apply scheduled sampling directly for this model.
Transformer architecture, via a two-pass de-
Since the Transformer achieves state-of-the-art re-
coding strategy. Experiments on two language
pairs achieve performance close to a teacher- sults and has become a default choice for many
forcing baseline and show that this technique natural language processing problems, it is inter-
is promising for further exploration. esting to adapt and explore the idea of scheduled
sampling for it, and, to our knowledge, no way of
1 Introduction doing this has been proposed so far.
Recent work in Neural Machine Translation Our contributions in this paper are:
(NMT) relies on a sequence-to-sequence model
• We propose a new strategy for using sched-
with global attention (Sutskever et al., 2014; Bah-
uled sampling in Transformer models by
danau et al., 2014), trained with maximum like-
making two passes through the decoder in
lihood estimation (MLE). These models are typ-
training time.
ically trained by teacher forcing, in which the
model makes each decision conditioned on the • We compare several approaches for condi-
gold history of the target sequence. This tends to tioning on the model predictions when they
lead to quick convergence but is dissimilar to the are used instead of the gold target.
procedure used at decoding time, when the gold
target sequence is not available and decisions are • We test the scheduled sampling with trans-
conditioned on previous model predictions. formers in a machine translation task on two
Ranzato et al. (2015) point out the problem that language pairs and achieve results close to
using teacher forcing means the model has never a teacher forcing baseline (with a slight im-
been trained on its own errors and may not be provement of up to 1 BLEU point for some
robust to them—a phenomenon called exposure models).
bias. This has the potential to cause problems at
2 Related Work
translation time, when the model is exposed to its
own (likely imperfect) predictions. Bengio et al. (2015) proposed scheduled sampling
for sequence-to-sequence RNN models: a method focus on methods which are comparable in train-
where the embedding used as the input to the de- ing time with a force-decoding baseline.
coder at time step t+1 is picked randomly between
the gold target and the argmax of the model’s 3 Scheduled Sampling with
output probabilities at step t. The Bernoulli prob- Transformers
ability of picking one or the other changes over
training epochs according to a schedule that makes In the case with recurrent neural networks (RNN)
the probability of choosing the gold target de- in the training phase we generate one word at a
crease across training steps. The authors propose time step, and we condition the generation of this
three different schedules: linear decay, exponen- word to the previous word from the gold target se-
tial decay and inverse sigmoid decay. quence. This sequential decoding makes it simple
to apply scheduled sampling - at each time step,
Goyal et al. (2017) proposed an approach with some probability, instead of using the previ-
based on scheduled sampling which backpropa- ous word in the gold sequence, we use the word
gates through the model decisions. At each step, predicted from the model on the previous step.
when the model decides to use model predictions, The Transformer model (Vaswani et al., 2017),
instead of the argmax, they use a weighted aver- which achieves state-of-the-art results for a lot of
age of all word embeddings, weighted by the pre- natural language processing tasks, is also an au-
diction scores. They experimented with two op- toregressive model. The generation of each word
tions: a softmax with a temperature parameter, and conditions on all previous words in the sequence,
a stochastic variant using Gumbel Softmax (Jang not only on the last generated word. The model
et al., 2016) with temperature. With this tech- is based on several self-attention layers, which di-
nique, they achieve better results than the standard rectly model relationships between all words in the
scheduled sampling. Our works extends Bengio sentence, regardless of their respective position.
et al. (2015) and Goyal et al. (2017) by adapting The order of the words is achieved by position em-
their frameworks to Transformer architectures. beddings which are summed with the correspond-
Ranzato et al. (2015) took ideas from sched- ing word embeddings. Using position masking in
uled sampling and the REINFORCE algorithm the decoder ensures that the generation of each
(Williams, 1992) and combine the teacher forcing word depends only on the previous words in the
training with optimization of the sequence level sequence and not on the following ones. Because
loss. In the first epochs, the model is trained with generation of a word in the Transformer conditions
teacher forcing and for the remaining epochs they on all previous words in the sequence and not just
start with teacher forcing for the first t time steps the last word, it is not trivial to apply scheduled
and use REINFORCE (sampling from the model) sampling to it, where, in training time, we need
until the end of the sequence. They decrease the to choose between using the gold target word or
time for training with teacher forcing t as training the model prediction. In order to allow usage of
continues until the whole sequence is trained with scheduled sampling with the Transformer model,
REINFORCE in the final epochs. In addition to we needed to make some changes in the Trans-
the work of Ranzato et al. (2015) other methods former architecture.
that are also focused on sequence-level training
are using for example actor-critic (Bahdanau et al., 3.1 Two-decoder Transformer
2016) or beam search optimization (Wiseman and The model we propose for applying scheduled
Rush, 2016). These methods directly optimize the sampling in transformers makes two passes on the
metric used at test time (e.g. BLEU). Another pro- decoder. Its architecture is illustrated on Figure 1.
posed approach to avoid exposure bias is SEARN We make no changes in the encoder of the model.
(Daumé et al., 2009). In SEARN, the model uses The decoding of the scheduled transformer has the
its own predictions at training time to produce se- following steps:
quence of actions, then a search algorithm deter-
mines the optimal action at each step and a policy 1. First pass on the decoder: get the model
is trained to predict that action. The main draw- predictions. On this step, the decoder condi-
back of these approaches is that the training be- tions on the gold target sequence and predicts
comes much slower. By contrast, in this paper we scores for each position as a standard trans-
Output Probabilities
Output Probabilities
Generator Softmax
Function
Linear
Linear

Add & Norm


Add & Norm
Feed
Feed
Forward
Forward

Add & Norm Add & Norm


Add & Norm
Feed Multi-Head
Multi-Head
Forward Attention
Attention

Add & Norm Add & Norm


Add & Norm
Masked Masked
Multi-Head Multi-Head Multi-Head
Attention Attention Attention

Position Encoding Position Encoding Position Encoding

Input Embedding Output Embedding Output Embedding

Inputs Outputs (Gold Target) +


Outputs (Gold Target)
Model Predictions

Figure 1: Transformer model adapted for use with scheduled sampling. The two decoders on the image share the
same parameters. The first pass on the decoder conditions on the gold target sequence and returns the model pre-
dictions. The second pass conditions on a mix of the target sequence and model predictions and returns the result.
The thicker lines show the path that is backpropagated in all experiments, i.e. we always make backpropagation
through the second decoder pass. The thin arrows are only backpropagated in a part of the experiments. (The
image is based on the transformer architecture from the paper of Vaswani et al. (2017).)

former model. Those scores are passed to the The outputs of this decoder pass are the ac-
next step. tual result from the models.
It is important to mention that the two decoders
2. Mix the gold target sequence with the pre-
are identical and share the same parameters. We
dicted sequence. After obtaining a sequence
are using the same decoder for the first pass, where
representing the prediction from the model
we condition on the gold sequence and the second
for each position, we imitate scheduled sam-
pass, where we condition on the mix between the
pling by mixing the target sequence with the
gold sequence and the model predictions.
model predictions: For each position in the
sequence, we select with a given probability 3.2 Embedding Mix
whether to use the gold token or the predic- For each position in the sequence, the first decoder
tion from the model. The probability for us- pass gives a score for each vocabulary word. We
ing teacher forcing (i.e. the gold token) is explore several ways of using those scores when
a function of the training step and is calcu- the model predictions are used.
lated with a selected schedule. We pass this
“new reference sequence” as the reference for • The most obvious case is to not mix the em-
the second decoder. The vectors used from beddings at all and pass the argmax from the
the model predictions can be either the em- model predictions, i.e. use the embedding of
bedding of the highest-scored word, or a mix the vocabulary word with the highest score
of the embeddings according to their scores. from the decoder.
Several variants of building the vector from • We also experiment with mixing the top-k
the model predictions for each position are embeddings. In our experiments, we use the
described below. weighted average of the embeddings of the
top-5 scored vocabulary words.
3. Second pass on the decoder: the final pre-
dictions. The second pass of the decoder • Inspired by the work of Goyal et al. (2017),
uses as output target the mix of words in we experiment with passing a mix of the em-
the gold sequence and the model predictions. beddings with softmax with temperature.
Using a higher temperature parameter makes Encoder model type Transformer
a better approximation of the argmax. Decoder model type Transformer
# Enc. & dec. layers 6
X exp(αsi−1 (y)) Heads 8
ēi−1 = e(y) P 0
y y 0 exp(αsi−1 (y )) Hidden layer size 512
Word embedding size 512
where ēi−1 is the vector which will be used Batch size 32
at the current position, obtained by a sum Optimizer Adam
of the embeddings of all vocabulary words, Learning rate 1.0
weighted by a softmax of the scores si−1 . Warmup steps 20,000
Maximum training steps 300,000
• An alternative of using argmax is sampling
Validation steps 10,000
an embedding from the softmax distribu-
Position Encoding True
tion. Also based on the work of Goyal et al.
Share Embeddings True
(2017), we use the Gumbel Softmax (Maddi-
Share Decoder Embeddings True
son et al., 2016; Jang et al., 2016) approxima-
Dropout 0.2 (DE-EN)
tion to sample the embedding:
Dropout 0.1 (JA-EN)
X exp(α(si−1 (y)) + Gy )
ēi−1 = e(y) P 0 Table 1: Hyperparameters shared across models
y y 0 exp(α(si−1 (y ) + Gy 0 ))

where U ∼ Uniform(0, 1) and G = • KFTT Japanese−English (JA−EN, Neubig


− log(− log U ). (2011)).

• Finally, we experiment with passing a We use byte pair encoding (BPE; (Sennrich et al.,
sparsemax mix of the embeddings (Mar- 2016)) with a joint segmentation with 32,000
tins and Astudillo, 2016). merges for both language pairs.
3.3 Weights update Hyperparameters used across experiments are
shown in Table 1. All models were implemented
We calculate Cross Entropy Loss based on the out- in a fork of OpenNMT-py (Klein et al., 2017). We
puts from the second decoder pass. For the cases compare our model to a teacher forcing baseline,
where all vocabulary words are summed (Softmax, i.e. a standard transformer model, without sched-
Gumbel softmax, Sparsemax), we try two variants uled sampling, with the hyperparameters given in
of updating the model weights. Table 1. We did hyperparameter tuning by trying
• Only backpropagate through the decoder several different values for dropout and warmup
which makes the final predictions, based on steps, and choosing the best BLEU score on the
mix between the gold target and the model validation set for the baseline model.
predictions. With the scheduled sampling method, the
teacher forcing probability continuously decreases
• Backpropagate through the second, as well as over the course of training according to a prede-
through the first decoder pass which predicts fined function of the training steps. Among the
the model outputs. This setup resembles the decay strategies proposed for scheduled sampling,
differentiable scheduled sampling proposed we found that linear decay is the one that works
by Goyal et al. (2017). best for our data:
4 Experiments t(i) = max{, k − ci}, (1)
We report experiments with scheduled sampling
for Transformers for the task of machine trans- where 0 ≤  < 1 is the minimum teacher forc-
lation. We run the experiments on two language ing probability to be used in the model and k and
pairs: c provide the offset and slope of the decay. This
function determines the teacher forcing ratio t for
• IWSLT 2017 German−English (DE−EN, training step i, that is, the probability of doing
Cettolo et al. (2017)). teacher forcing at each position in the sequence.
Experiment DE−EN JA−EN
Dev Test Dev Test
Teacher Forcing Baseline 35.05 29.62 18.00 19.46
No backprop
Argmax 23.99 20.57 12.88 15.13
Top-k mix 35.19 29.42 18.46 20.24
Softmax mix α = 1 35.07 29.32 17.98 20.03
Softmax mix α = 10 35.30 29.25 17.79 19.67
Gumbel Softmax mix α = 1 35.36 29.48 18.31 20.21
Gumbel Softmax mix α = 10 35.32 29.58 17.94 20.87
Sparsemax mix 35.22 29.28 18.14 20.15
Backprop through model decisions
Softmax mix α = 1 33.25 27.60 15.67 17.93
Softmax mix α = 10 27.06 23.29 13.49 16.02
Gumbel Softmax mix α = 1 30.57 25.71 15.86 18.76
Gumbel Softmax mix α = 10 12.79 10.62 13.98 17.09
Sparsemax mix 24.65 20.15 12.44 16.23

Table 2: Experiments with scheduled sampling for Transformer. The table shows BLEU score for the best check-
point on BLEU, measured on the validation set. The first group of experiments do not have a backpropagation
pass through the first decoder. The results from the second group are from model runs with backpropagation pass
through the second as well as through the first decoder.

The results from our experiments are shown In used. We tested the models for machine trans-
Table 2. The scheduled sampling which uses only lation on two language pairs. The experimental
the highest-scored word predicted by the model results showed that our scheduled sampling strat-
does not have a very good performance. The egy gave better results on the validation set for
models which use mixed embeddings (the top-k, both language pairs compared to a teacher forcing
softmax, Gumbel softmax or sparsemax) and only baseline and, in one of the tested language pairs
backpropagate through the second decoder pass, (JA−EN), there were slightly better results on the
perform slightly better than the baseline on the val- test set.
idation set, and one of them is also slightly better One possible direction for future work is exper-
on the test set. The differentiable scheduled sam- imenting with more schedules. We noticed that
pling (when the model backpropagates through the when the schedule starts falling too fast, for exam-
first decoder) have much lower results. The perfor- ple, with the exponential or inverse sigmoid de-
mance of these models starts degrading too early, cay, the performance of the model degrades too
so we expect that using more training steps with fast. Therefore, we think it is worth exploring
teacher forcing at the beginning of the training more schedules where the training does more pure
would lead to better performance, so this setup still teacher forcing at the beginning of the training
needs to be examined more carefully. and then decays more slowly, for example, inverse
sigmoid decay which starts decreasing after more
5 Discussion and Future Work epochs. We will also try the experiments on more
language pairs.
In this paper, we presented our approach to ap- Finally, we need to explore the poor perfor-
plying the scheduled sampling technique to Trans- mance on the differential scheduled sampling
formers. Because of the specifics of the decoding, setup (with backpropagating through the two de-
applying scheduled sampling is not straightfor- coders). In this case, the performance of the model
ward as it is for RNN and required some changes starts decreasing earlier and the reason for this
in the way the Transformer model is trained, by needs to be examined carefully. We expect this
using a two-step decoding. We experimented with setup to give better results after adjusting the de-
several schedules and mixing of the embeddings cay schedule to allow more teacher forcing train-
in the case where the model predictions were ing before starting to use model predictions.
Acknowledgments Graham Neubig. 2011. The kyoto free translation task.

This work was supported by the European Re- Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,
search Council (ERC StG DeepSPIN 758969), and Wojciech Zaremba. 2015. Sequence level train-
ing with recurrent neural networks. arXiv preprint
and by the Fundação para a Ciência e Tecnolo- arXiv:1511.06732.
gia through contracts UID/EEA/50008/2019 and
CMUPERI/TIC/0046/2014 (GoLocal). We would Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016. Neural machine translation of rare words with
like to thank Gonçalo Correia and Ben Peters
subword units. arXiv preprint arXiv:1508.07909.
for their involvement on an earlier stage of this
project. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Sequence to sequence learning with neural net-
works. In Advances in neural information process-
References ing systems, pages 3104–3112.

Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Courville, and Yoshua Bengio. 2016. An actor-critic Kaiser, and Illia Polosukhin. 2017. Attention is all
algorithm for sequence prediction. arXiv preprint you need. In Advances in Neural Information Pro-
arXiv:1607.07086. cessing Systems, pages 5998–6008.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Ronald J Williams. 1992. Simple statistical gradient-
gio. 2014. Neural machine translation by jointly following algorithms for connectionist reinforce-
learning to align and translate. arXiv preprint ment learning. Machine learning, 8(3-4):229–256.
arXiv:1409.0473.
Sam Wiseman and Alexander M Rush. 2016.
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Sequence-to-sequence learning as beam-search opti-
Noam Shazeer. 2015. Scheduled sampling for se- mization. In Proceedings of the 2016 Conference on
quence prediction with recurrent neural networks. Empirical Methods in Natural Language Process-
In Advances in Neural Information Processing Sys- ing, pages 1296–1306.
tems, pages 1171–1179.

Mauro Cettolo, Marcello Federico, Luisa Bentivogli,


Niehues Jan, Stüker Sebastian, Sudoh Katsuitho,
Yoshino Koichiro, and Federmann Christian. 2017.
Overview of the iwslt 2017 evaluation campaign. In
International Workshop on Spoken Language Trans-
lation, pages 2–14.

Hal Daumé, John Langford, and Daniel Marcu. 2009.


Search-based structured prediction. Machine learn-
ing, 75(3):297–325.

Kartik Goyal, Chris Dyer, and Taylor Berg-


Kirkpatrick. 2017. Differentiable scheduled
sampling for credit assignment. arXiv preprint
arXiv:1704.06970.

Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categor-


ical reparameterization with gumbel-softmax. arXiv
preprint arXiv:1611.01144.

G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M.


Rush. 2017. OpenNMT: Open-Source Toolkit for
Neural Machine Translation. ArXiv e-prints.

Chris J Maddison, Andriy Mnih, and Yee Whye Teh.


2016. The concrete distribution: A continuous
relaxation of discrete random variables. arXiv
preprint arXiv:1611.00712.

Andre Martins and Ramon Astudillo. 2016. From soft-


max to sparsemax: A sparse model of attention and
multi-label classification. In International Confer-
ence on Machine Learning, pages 1614–1623.

You might also like