Neural Machine Translation, Seq2seq, and Attention
Neural Machine Translation, Seq2seq, and Attention
Learning 1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: Part VI
Neural Machine Translation, Seq2seq and Attention2 2
Authors: Guillaume Genthial, Lucas
Liu, Barak Oshri, Kushal Ranjan
Winter 2019
2 Attention Mechanism
in using the final RNN hidden state as the single "context vector" for
sequence-to-sequence models: often, different parts of an input have
cs224n: natural language processing with deep learning lecture notes: part vi
neural machine translation, seq2seq and attention 5
1. Encoder
Let (h1 , . . . , hn ) be the hidden vectors representing the input sen-
tence. These vectors are the output of a bi-LSTM for instance, and
capture contextual representation of each word in the sentence.
2. Decoder
We want to compute the hidden states si of the decoder using a
recursive formula of the form
s i = f ( s i −1 , y i −1 , c i )
ei,j = a(si−1 , h j )
exp(ei,j )
αi,j =
∑nk=1
exp(ei,k )
3 Other Models
attention vector over the encoder hidden. We can use one of the
following scoring functions:
T ¯
hi h j
score(hi , h¯j ) = hiT W h¯j ∈R
W [hi , h¯j ]
n
ci = ∑ αi,j h j
j =1
and we can use the context vector and the hidden state to compute
a new vector for the i-th time step of the decoder
h̃i = f ([h̄i , ci ])
The final step is to use the h̃i to make the final prediction of the
decoder. To address the issue of coverage, Luong et al. also use
an input-feeding approach. The attentional vectors h̃i are fed as
input to the decoder, instead of the final prediction. This is similar
to Bahdanau et al., who use the context vectors to compute the
hidden vectors of the decoder.
The main takeaway of this discussion is to show that they are lots
of ways of doing attention.
As the search space can be huge, we need to shrink its size. Here
is a list of sequence model decoders (both good ones and bad ones).
x t ∼ P( x t | x1 , . . . , x n )
xt = argmaxx˜t P( x˜t | x1 , . . . , xn )
K
Htk˜+1
[
H̃t+1 =
k =1
where
The BLEU score looks for whether n-grams in the machine trans-
lation also appear in the reference translation. Color-coded below are
some examples of different size n-grams that are shared between the
reference and candidate translation.
the precision score for the grams of length n. Finally, let wn = 1/2n
be a geometric weighting for the precision of the n’th gram. Our
brevity penalty is defined as
len
min(0,1− len ref )
β=e MT
k
BLEU = β ∏ pw
n
n
i =1
The BLEU score has been reported to correlate well with human
judgment of good translations, and so remains a benchmark for all
evaluation metrics following it. However, it does have many limi-
tations. It only works well on the corpus level because any zeros in
precision scores will zero the entire BLEU score. Additionally, this
BLEU score as presented suffers for only comparing a candidate
translation against a single reference, which is surely a noisy repre-
sentation of the relevant n-grams that need to be matched. Variants
of BLEU have modified the algorithm to compare the candidate with
multiple reference examples. Additionally, BLEU scores may only be
lecture notes: part vi
cs224n: natural language processing with deep learning
neural machine translation, seq2seq and attention 13
Despite the success of modern NMT systems, they have a hard time
dealing with large vocabulary size. Specifically, these Seq2Seq models
predict the next word in the sequence by computing a target proba-
bilistic distribution over the entire vocabulary using softmax. It turns
out that softmax can be quite expensive to compute with a large
vocabulary and its complexity also scales proportionally to the vocab-
ulary size. We will now examine a number of approaches to address
this issue.
2. Hierarchical Softmax
Morin et al.8 introduced a binary tree structure to more efficiently 8
Morin et al. 2005, Hierarchical Proba-
compute "softmax". Each probability in the target distribution bilistic Neural Network Language Model
One limitation for both methods is that they only save computa-
tion during training step (when target word is known). At test time,
lecture notes: part vi
cs224n: natural language processing with deep learning
neural machine translation, seq2seq and attention 14
one still has to compute the probability of all words in the vocabulary
in order to make predictions.
At test time, one can similarly predict target word out of a selected
subset, called candidate list, of the entire vocabulary. The challenge
is that the correct target word is unknown and we have to "guess"
what the target word might be. In the paper, the authors proposed
to construct a candidate list for each source sentence using K most
frequent words (based on unigram probability) and K’ likely target
words for each source word in the sentence. In Figure 8), an example
is shown with K 0 = 3 and the candidate list consists of all the words
in purple boxes. In practice, one can choose the following values:
K = 15k, 30k, 50k and K 0 = 10, 20.
ew = W f H f + Wb Hb + b
with unknown words and achieve open-vocabulary NMT. The system Vocabulary Neural Machine Translation
with Hybrid Word-Character Models
translates mostly at word-level and consults the character compo-
nents for rare words. On a high level, the character-level recurrent
neural networks compute source word representations and recover
unknown target words when needed. The twofold advantage of such
a hybrid approach is that it is much faster and easier to train than
character-based ones; at the same time, it never produces unknown
words as in the case of word-based models.
Word-based Translation as a Backbone The core of the hybrid
NMT is a deep LSTM encoder-decoder that translates at the word
level. We maintain a vocabulary of size |V | per language and use
<unk> to represent out of vocabulary words.
Source Character-based Representation In regular word-based
NMT, a universal embedding for <unk> is used to represent all out-
of-vocabulary words. This is problematic because it discards valuable
information about the source words. Instead, we learn a deep LSTM
model over characters of rare words, and use the final hidden state of Figure 11: Hybrid NMT
the LSTM as the representation for the rare word (Figure 11).
Target Character-level Generation General word-based NMT
allows generation of <unk> in the target output. Instead, the goal
here is to create a coherent framework that handles an unlimited
output vocabulary. The solution is to have a separate deep LSTM that
lecture notes: part vi
cs224n: natural language processing with deep learning
neural machine translation, seq2seq and attention 17