lecture 11
lecture 11
Christopher Manning
Lecture 6: LSTM RNNs and Neural Machine Translation
Long Short-Term Memory RNNs (LSTMs)
• A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the problem of
vanishing gradients
• Everyone cites that paper but really a crucial part of the modern LSTM is from Gers et al. (2000) !
• Only started to be recognized as promising through the work of S’s student Alex Graves c. 2006
• Work in which he also invented CTC (connectionist temporal classification) for speech recognition
21
Long Short-Term Memory (LSTM)
We have a sequence of inputs ! (") , and we will compute a sequence of hidden states ℎ(") and cell states
# (") . On timestep t:
Sigmoid function: all gate
Forget gate: controls what is kept vs values are between 0 and 1
forgotten, from previous cell state
23 Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Long Short-Term Memory (LSTM) %$"
You can think of the LSTM equations visually like this:
Write some new cell content
The + sign is the secret!
Forget some
cell content ct
ct-1 ct
it ot Output some cell content
ft ~ct
Compute the to the hidden state
forget gate
ht-1 ht
Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
24
How does LSTM solve vanishing gradients?
• The LSTM architecture makes it much easier for an RNN to
preserve information over many timesteps
• e.g., if the forget gate is set to 1 for a cell dimension and the input gate
set to 0, then the information of that cell is preserved indefinitely.
• In contrast, it’s harder for a vanilla RNN to learn a recurrent weight
matrix Wh that preserves info in the hidden state
• In practice, you get about 100 timesteps rather than about 7
• However, there are alternative ways of creating more direct and linear
pass-through connections in models for long distance dependencies
25
Is vanishing/exploding gradient just an RNN problem?
• No! It can be a problem for all neural architectures (including feed-forward and
convolutional neural networks), especially very deep ones.
• Due to chain rule / choice of nonlinearity function, gradient can become vanishingly small as it
backpropagates
• Thus, lower layers are learned very slowly (i.e., are hard to train)
• Another solution: lots of new deep feedforward/convolutional architectures add more
direct connections (thus allowing the gradient to flow)
For example:
• Residual connections aka “ResNet”
• Also known as skip-connections
• The identity connection
preserves information by default
• This makes deep networks much
easier to train
"Deep Residual Learning for Image Recognition", He et al, 2015. https://arxiv.org/pdf/1512.03385.pdf
26
Is vanishing/exploding gradient just a RNN problem?
Other methods:
• Dense connections aka “DenseNet” • Highway connections aka “HighwayNet”
• Directly connect each layer to all future layers! • Similar to residual connections, but the identity
connection vs the transformation layer is
controlled by a dynamic gate
• Inspired by LSTMs, but applied to deep
feedforward/convolutional networks
• Conclusion: Though vanishing/exploding gradients are a general problem, RNNs are particularly unstable
due to the repeated multiplication by the same weight matrix [Bengio et al, 1994]
”Densely Connected Convolutional Networks", Huang et al, 2017. https://arxiv.org/pdf/1608.06993.pdf ”Highway Networks", Srivastava et al, 2015. https://arxiv.org/pdf/1505.00387.pdf
”Learning Long-Term Dependencies with Gradient Descent is Difficult", Bengio et al. 1994, http://ai.dinfo.unifi.it/paolo//ps/tnn-94-gradient.pdf
27
3. Other RNN uses: RNNs can be used for sequence tagging
e.g., part-of-speech tagging, named entity recognition
DT JJ NN VBN IN DT NN
28
RNNs can be used as a sentence encoder model
e.g., for sentiment classification
positive How to compute
sentence encoding?
Sentence
encoding
29
RNNs can be used as a sentence encoder model
e.g., for sentiment classification
positive How to compute
sentence encoding?
30
RNNs can be used as a sentence encoder model
e.g., for sentiment classification
positive How to compute
sentence encoding?
31
RNN-LMs can be used to generate text based on other information
e.g., speech recognition, machine translation, summarization
RNN-LM
conditioning
In this example,
“exciting” is in the right
context and this
modifies the meaning of
“terribly” (from negative
the movie was terribly exciting ! to positive)
33
This contextual representation of “terribly”
Bidirectional RNNs has both left and right context!
Concatenated
hidden states
Backward RNN
Forward RNN
• If you do have entire input sequence (e.g., any kind of encoding), bidirectionality is
powerful (you should use it by default).
37
Multi-layer RNNs
• RNNs are already “deep” on one dimension (they unroll over many timesteps)
38
Multi-layer RNNs The hidden states from RNN layer i
are the inputs to RNN layer i+1
RNN layer 3
RNN layer 2
RNN layer 1
Source: "Findings of the 2016 Conference on Machine Translation (WMT16)", Bojar et al. 2016, http://www.statmt.org/wmt16/pdf/W16-2301.pdf
Source: "Findings of the 2018 Conference on Machine Translation (WMT18)", Bojar et al. 2018, http://www.statmt.org/wmt18/pdf/WMT028.pdf
Source: "Findings of the 2019 Conference on Machine Translation (WMT19)", Barrault et al. 2019, http://www.statmt.org/wmt18/pdf/WMT028.pdf
41
5. Machine Translation
Machine Translation (MT) is the task of translating a sentence x from one language (the
source language) to a sentence y in another language (the target language).
– Rousseau
42
1990s-2010s: Statistical Machine Translation
• Core idea: Learn a probabilistic model from data
• Suppose we’re translating French → English.
• We want to find best English sentence y, given French sentence x
• Use Bayes Rule to break this down into two components to be learned
separately:
47
2014 Ne
u
Ma ral
Tra chin
nsl e
atio
n
MT
res
e arc
h (dramatic reenactment)
48
6. What is Neural Machine Translation?
• Neural Machine Translation (NMT) is a way to do Machine Translation with a single
end-to-end neural network
49
Neural Machine Translation (NMT)
The sequence-to-sequence model
Target sentence (output)
Encoding of the source sentence.
Provides initial hidden state
he hit me with a pie <END>
for Decoder RNN.
argmax
argmax
argmax
argmax
argmax
argmax
argmax
Encoder RNN
Decoder RNN
il m’ a entarté <START> he hit me with a pie
50
source sentence.
Sequence-to-sequence is versatile!
• The general notion here is an encoder-decoder model
• One neural network takes input and produces a neural representation
• Another network produces output based on that neural representation
• If the input and output are sequences, we call it a seq2seq model
51
NMT: the first big success story of NLP Deep Learning
Neural Machine Translation went from a fringe research attempt in 2014 to the leading
standard method in 2016
• 2016: Google Translate switches from SMT to NMT – and by 2018 everyone had
• https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html
Encoder:
0 0 0 -
0 0 0 0 . . 0 0 - 0 0
. . . . 2 . 2 . 0 0 . .
1 2 2 . . . 3 2
2 2 2 2
- - 4
- 0 0 0 0 0 0 0 0 1 0 0
Builds up
. . . 0 0 . .
0 . . 6 . . . .
. 6 3 6 8 1 6 6 . 6 6
2 - - - - - - - 6 4 - -
Decoder
0 0 0 - - - 0 0
- 0 0 0
0 0 0
0 . . . . . . . . 0 . .
sentence
1 1 1 . . 1 1
. 1 1 - 1 1 1 1
1 - - - - - - 0 1 - -
0 0 0 0 0 0 0 0 0 0 0 0
. . . . . . . .
. . 3 2
. . .
1 7 7 4 5 7 7 7 0 3 5 7
meaning
0 0 0 0 0 0 0
0 0 0 . 0 0 . .
. . . . . . . 1 . . .
1 1 1 1 1 1 1 1 1 1 1 1
0
0 . 0 0 0 0 -
0 0 . 0 4 0 - - 0
. . 2 . . . . . . 0 .
2 2 2 2 2 0 . 2
2 4 - .
- 0 0 0 0 0 0 1 4 0
0 - 0 . . . . . 0 2
. 0 . . . 0 .
4 2 6 6 6 6 0 . 6
6 . 3 - - - - - 3 .
- 6 - 0 - 6 5 -
0 0 . 0 0 0 0 - 0
0 0
0 1 . . . . . 0 0 .
. . . 3 1 1 1 1 . .
1 2 1 - 1 . 1
0 - - - - - 1 5 -
- - - 0 0 0 0 0 - 0
0 0 0 . 0 . 0 0
. . 5 . . . . . .
. 3
. - 4 7 7 7 7 . 4 7
7 3 4 - 0 0 0 0 7 0
0 0 0 0 0 0 0
. 0 . . . . . . .
. . . . 1 1 1 1 . 1
1 4 2 2 1 1 1
2
Source Die Proteste waren am Wochenende eskaliert <EOS> The protests escalated over the weekend
Feeding in last
sentence word
Conditioning =
Bottleneck
Christopher Manning
Lecture 7: Attention and Final Projects; Practical Tips
1. Multi-layer deep encoder-decoder machine translation net
[Sutskever et al. 2014; Luong et al. 2015]
The hidden states from RNN layer i
are the inputs to RNN layer i + 1
Translation
The protests escalated over the weekend <EOS>
generated
0.1 0.2 0.4 0.5 0.2 -0.1 0.2 0.2 0.3 0.4 -0.2 -0.4 -0.3
0.3 0.6 0.4 0.5 0.6 0.6 0.6 0.6 0.6 0.4 0.6 0.6 0.5
0.1 -0.1 0.3 0.9 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1
-0.4 -0.7 -0.2 -0.3 -0.5 -0.7 -0.7 -0.7 -0.7 -0.7 -0.7 -0.7 -0.7
0.2 0.1 -0.3 -0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
Encoder:
Builds up 0.2 0.2 0.1 0.2 0.2 0.2 0.2 -0.4 0.2 -0.1 0.2 0.3 0.2
Decoder
-0.2 0.6 0.3 0.6 -0.8 0.6 -0.1 0.6 0.6 0.6 0.4 0.6 0.6
-0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1
sentence 0.1
0.1
-0.7
0.1
-0.7
0.1
-0.4
0.1
-0.5
0.1
-0.7
0.1
-0.7
0.1
-0.7
0.1
0.3
0.1
0.3
0.1
0.2
0.1
-0.5
0.1
-0.7
0.1
meaning
0.2 0.4 0.2 0.2 0.4 0.2 0.2 0.2 0.2 -0.1 -0.2 -0.4 0.2
0.6 -0.6 -0.3 0.4 -0.2 0.6 0.6 0.6 0.6 0.3 0.6 0.5 0.6
-0.1 0.2 -0.1 0.1 -0.3 -0.1 -0.1 -0.1 -0.1 -0.1 0.1 -0.5 -0.1
-0.7 -0.3 -0.4 -0.5 -0.4 -0.7 -0.7 -0.7 -0.7 -0.7 0.3 0.4 -0.7
0.1 0.4 0.2 -0.2 -0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
Source Die Proteste waren am Wochenende eskaliert <EOS> The protests escalated over the weekend Feeding in
sentence last word
Conditioning =
3
Bottleneck
How do we evaluate Machine Translation?
You’ll see BLEU in detail in
Commonest way: BLEU (Bilingual Evaluation Understudy) Assignment 3!
4 Source: ”BLEU: a Method for Automatic Evaluation of Machine Translation", Papineni et al, 2002. http://aclweb.org/anthology/P02-1040
MT progress over time
[Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal; NMT 2019 FAIR on newstest2019]
45
Phrase-based SMT
40
Syntax-based SMT
35
Neural MT
30
25
20
15
10
5
0
2013 2014 2015 2016 2017 2018 2019
Sources: http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf & http://matrix.statmt.org/
6
2. Why attention? Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
Target sentence (output)
Decoder RNN
il a m’ entarté <START> he hit me with a pie
7
Why attention? Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
This needs to capture all Target sentence (output)
information about the
source sentence. he hit me with a pie <END>
Information bottleneck!
Encoder RNN
Decoder RNN
il a m’ entarté <START> he hit me with a pie
8
Attention
• Attention provides a solution to the bottleneck problem.
• Core idea: on each step of the decoder, use direct connection to the encoder to focus
on a particular part of the source sequence
• First, we will show via diagram (no equations), then we will show with equations
9
Sequence-to-sequence with attention
Core idea: on each step of the decoder, use direct connection to the encoder to focus on a
particular part of the source sequence
dot product
Attention
scores
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
10
Source sentence (input)
Sequence-to-sequence with attention
dot product
Attention
scores
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
11
Source sentence (input)
Sequence-to-sequence with attention
dot product
Attention
scores
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
12
Source sentence (input)
Sequence-to-sequence with attention
dot product
Attention
scores
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
13
Source sentence (input)
Sequence-to-sequence with attention
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
14
Source sentence (input)
Sequence-to-sequence with attention
Attention Use the attention distribution to take a
output weighted sum of the encoder hidden states.
scores distribution
Attention Attention
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
15
Source sentence (input)
Sequence-to-sequence with attention
Attention he
output
Concatenate attention output
scores distribution
"!! with decoder hidden state, then
Attention Attention
Decoder RNN
Encoder
RNN
il a m’ entarté <START>
16
Source sentence (input)
Sequence-to-sequence with attention
Attention hit
output
scores distribution
"!#
Attention Attention
Decoder RNN
Encoder
RNN
Decoder RNN
Encoder
RNN
18
Source sentence (input)
Sequence-to-sequence with attention
Attention with
output
scores distribution "!%
Attention Attention
Decoder RNN
Encoder
RNN
19
Source sentence (input)
Sequence-to-sequence with attention
Attention a
output
scores distribution "!&
Attention Attention
Decoder RNN
Encoder
RNN
20
Source sentence (input)
Sequence-to-sequence with attention
Attention pie
output
scores distribution "!'
Attention Attention
Decoder RNN
Encoder
RNN
21
Source sentence (input)
Attention: in equations
• We have encoder hidden states
• On timestep t, we have decoder hidden state
• We get the attention scores for this step:
• We take softmax to get the attention distribution for this step (this is a probability distribution and
sums to 1)
• We use to take a weighted sum of the encoder hidden states to get the attention output
22
Attention is great!
• Attention significantly improves NMT performance
• It’s very useful to allow decoder to focus on certain parts of the source
• Attention provides a more “human-like” model of the MT process
• You can look back at the source sentence while translating, rather than needing to remember it all
• Attention solves the bottleneck problem
• Attention allows decoder to look directly at source; bypass bottleneck
• Attention helps with the vanishing gradient problem
• Provides shortcut to faraway states
• Attention provides some interpretability
with
me
pie
he
hit
• By inspecting attention distribution, we see what the decoder was focusing on
a
il
• We get (soft) alignment for free!
a
• This is cool because we never explicitly trained an alignment system
m’
• The network just learned alignment by itself
entarté
23
There are several attention variants
• We have some values and a query
thus obtaining the attention output a (sometimes called the context vector)
24
You’ll think about the relative
Attention variants advantages/disadvantages of these in Assignment 3!
##
$$
25 %
You’ll think about the relative
Attention variants advantages/disadvantages of these in Assignment 3!
[ ] [ ][ ]
• For low rank matrices $ ∈ ℝ#×%( , % ∈ ℝ#×%) , + ≪ -& , -'
T
= ×
27
Attention is a general Deep Learning technique
• More general definition of attention:
• Given a set of vector values, and a vector query, attention is a technique to compute
a weighted sum of the values, dependent on the query.
Intuition:
• The weighted sum is a selective summary of the information contained in the values,
where the query determines which values to focus on.
• Attention is a way to obtain a fixed-size representation of an arbitrary set of
representations (the values), dependent on some other representation (the query).
Upshot:
• Attention has become the powerful, flexible, general way pointer and memory
manipulation in all deep learning models. A new idea from after 2010! From NMT!
28