0% found this document useful (0 votes)
2 views

lecture 11

The document discusses Long Short-Term Memory (LSTM) RNNs, highlighting their ability to address the vanishing gradient problem in traditional RNNs. It explains the architecture of LSTMs, including their gates and states, and their application in various tasks such as machine translation and sentiment classification. The document also contrasts LSTMs with newer architectures like Transformers, which have become dominant in recent years.

Uploaded by

yimek33935
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

lecture 11

The document discusses Long Short-Term Memory (LSTM) RNNs, highlighting their ability to address the vanishing gradient problem in traditional RNNs. It explains the architecture of LSTMs, including their gates and states, and their application in various tasks such as machine translation and sentiment classification. The document also contrasts LSTMs with newer architectures like Transformers, which have become dominant in recent years.

Uploaded by

yimek33935
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Natural Language Processing

with Deep Learning


CS224N/Ling284

Christopher Manning
Lecture 6: LSTM RNNs and Neural Machine Translation
Long Short-Term Memory RNNs (LSTMs)
• A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the problem of
vanishing gradients
• Everyone cites that paper but really a crucial part of the modern LSTM is from Gers et al. (2000) !

• Only started to be recognized as promising through the work of S’s student Alex Graves c. 2006
• Work in which he also invented CTC (connectionist temporal classification) for speech recognition

• Became really well-known after Hinton brought it to Google in 2013


• Following Graves having been a postdoc with Hinton

Hochreiter and Schmidhuber, 1997. Long short-term memory. https://www.bioinf.jku.at/publications/older/2604.pdf


Gers, Schmidhuber, and Cummins, 2000. Learning to Forget: Continual Prediction with LSTM. https://dl.acm.org/doi/10.1162/089976600300015015
Graves, Fernandez, Gomez, and Schmidhuber, 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural nets.
https://www.cs.toronto.edu/~graves/icml_2006.pdf
20
Long Short-Term Memory RNNs (LSTMs)
• On step t, there is a hidden state "(") and a cell state #(")
• Both are vectors length n
• The cell stores long-term information
• The LSTM can read, erase, and write information from the cell
• The cell becomes conceptually rather like RAM in a computer

• The selection of which information is erased/written/read is controlled by three


corresponding gates (gates are calculated things whose values are probabilities)
• The gates are also vectors of length n
• On each timestep, each element of the gates can be open (1), closed (0), or
somewhere in-between
• The gates are dynamic: their value is computed based on the current context

21
Long Short-Term Memory (LSTM)
We have a sequence of inputs ! (") , and we will compute a sequence of hidden states ℎ(") and cell states
# (") . On timestep t:
Sigmoid function: all gate
Forget gate: controls what is kept vs values are between 0 and 1
forgotten, from previous cell state

Input gate: controls what parts of the

All these are vectors of same length n


new cell content are written to cell

Output gate: controls what parts of


cell are output to hidden state

New cell content: this is the new


content to be written to the cell

Cell state: erase (“forget”) some


content from last cell state, and write
(“input”) some new cell content
⊙ ⊙
Hidden state: read (“output”) some ⊙
content from the cell
Gates are applied using element-wise
22 (or Hadamard) product: ⊙
Long Short-Term Memory (LSTM)
You can think of the LSTM equations visually like this:

23 Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Long Short-Term Memory (LSTM) %$"
You can think of the LSTM equations visually like this:
Write some new cell content
The + sign is the secret!
Forget some
cell content ct
ct-1 ct
it ot Output some cell content
ft ~ct
Compute the to the hidden state
forget gate
ht-1 ht

Compute the Compute the Compute the


input gate new cell content output gate

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
24
How does LSTM solve vanishing gradients?
• The LSTM architecture makes it much easier for an RNN to
preserve information over many timesteps
• e.g., if the forget gate is set to 1 for a cell dimension and the input gate
set to 0, then the information of that cell is preserved indefinitely.
• In contrast, it’s harder for a vanilla RNN to learn a recurrent weight
matrix Wh that preserves info in the hidden state
• In practice, you get about 100 timesteps rather than about 7

• However, there are alternative ways of creating more direct and linear
pass-through connections in models for long distance dependencies

25
Is vanishing/exploding gradient just an RNN problem?
• No! It can be a problem for all neural architectures (including feed-forward and
convolutional neural networks), especially very deep ones.
• Due to chain rule / choice of nonlinearity function, gradient can become vanishingly small as it
backpropagates
• Thus, lower layers are learned very slowly (i.e., are hard to train)
• Another solution: lots of new deep feedforward/convolutional architectures add more
direct connections (thus allowing the gradient to flow)

For example:
• Residual connections aka “ResNet”
• Also known as skip-connections
• The identity connection
preserves information by default
• This makes deep networks much
easier to train
"Deep Residual Learning for Image Recognition", He et al, 2015. https://arxiv.org/pdf/1512.03385.pdf
26
Is vanishing/exploding gradient just a RNN problem?
Other methods:
• Dense connections aka “DenseNet” • Highway connections aka “HighwayNet”
• Directly connect each layer to all future layers! • Similar to residual connections, but the identity
connection vs the transformation layer is
controlled by a dynamic gate
• Inspired by LSTMs, but applied to deep
feedforward/convolutional networks

• Conclusion: Though vanishing/exploding gradients are a general problem, RNNs are particularly unstable
due to the repeated multiplication by the same weight matrix [Bengio et al, 1994]
”Densely Connected Convolutional Networks", Huang et al, 2017. https://arxiv.org/pdf/1608.06993.pdf ”Highway Networks", Srivastava et al, 2015. https://arxiv.org/pdf/1505.00387.pdf

”Learning Long-Term Dependencies with Gradient Descent is Difficult", Bengio et al. 1994, http://ai.dinfo.unifi.it/paolo//ps/tnn-94-gradient.pdf
27
3. Other RNN uses: RNNs can be used for sequence tagging
e.g., part-of-speech tagging, named entity recognition

DT JJ NN VBN IN DT NN

the startled cat knocked over the vase

28
RNNs can be used as a sentence encoder model
e.g., for sentiment classification
positive How to compute
sentence encoding?

Sentence
encoding

overall I enjoyed the movie a lot

29
RNNs can be used as a sentence encoder model
e.g., for sentiment classification
positive How to compute
sentence encoding?

Sentence Basic way:


encoding Use final hidden
e qua state
ls

overall I enjoyed the movie a lot

30
RNNs can be used as a sentence encoder model
e.g., for sentiment classification
positive How to compute
sentence encoding?

Sentence Usually better:


encoding Take element-wise
max or mean of all
hidden states

overall I enjoyed the movie a lot

31
RNN-LMs can be used to generate text based on other information
e.g., speech recognition, machine translation, summarization
RNN-LM

what’s the weather


Input (audio)

conditioning

<START> what’s the

This is an example of a conditional language model.


We’ll see Machine Translation as an example in more detail
32
4. Bidirectional and Multi-layer RNNs: motivation
Task: Sentiment Classification
We can regard this hidden state as a
positive representation of the word “terribly” in the
context of this sentence. We call this a
contextual representation.

Sentence These contextual


encoding representations only
ax elem contain information
a n / m ent- about the left context
e wise
t-w ise m mea (e.g. “the movie was”).
e n n /ma
ele m x
What about right
context?

In this example,
“exciting” is in the right
context and this
modifies the meaning of
“terribly” (from negative
the movie was terribly exciting ! to positive)

33
This contextual representation of “terribly”
Bidirectional RNNs has both left and right context!

Concatenated
hidden states

Backward RNN

Forward RNN

the movie was terribly exciting !


34
Bidirectional RNNs
This is a general notation to mean
On timestep t:
“compute one forward step of the
RNN” – it could be a simple RNN or
LSTM computation.

Forward RNN Generally, these


two RNNs have
Backward RNN separate weights

Concatenated hidden states

We regard this as “the hidden


state” of a bidirectional RNN.
This is what we pass on to the
35
next parts of the network.
Bidirectional RNNs
• Note: bidirectional RNNs are only applicable if you have access to the entire input
sequence
• They are not applicable to Language Modeling, because in LM you only have left
context available.

• If you do have entire input sequence (e.g., any kind of encoding), bidirectionality is
powerful (you should use it by default).

• For example, BERT (Bidirectional Encoder Representations from Transformers) is a


powerful pretrained contextual representation system built on bidirectionality.
• You will learn more about transformers, including BERT, in a couple of weeks!

37
Multi-layer RNNs
• RNNs are already “deep” on one dimension (they unroll over many timesteps)

• We can also make them “deep” in another dimension by


applying multiple RNNs – this is a multi-layer RNN.

• This allows the network to compute more complex representations


• The lower RNNs should compute lower-level features and the higher RNNs should
compute higher-level features.

• Multi-layer RNNs are also called stacked RNNs.

38
Multi-layer RNNs The hidden states from RNN layer i
are the inputs to RNN layer i+1

RNN layer 3

RNN layer 2

RNN layer 1

the movie was terribly exciting !


39
Multi-layer RNNs in practice
• Multi-layer or stacked RNNs allow a network to compute more complex representations
– they work better than just have one layer of high-dimensional encodings!
• The lower RNNs should compute lower-level features and the higher RNNs should
compute higher-level features.
• High-performing RNNs are usually multi-layer (but aren’t as deep as convolutional or
feed-forward networks)
• For example: In a 2017 paper, Britz et al. find that for Neural Machine Translation, 2 to 4
layers is best for the encoder RNN, and 4 layers is best for the decoder RNN
• Often 2 layers is a lot better than 1, and 3 might be a little better than 2
• Usually, skip-connections/dense-connections are needed to train deeper RNNs
(e.g., 8 layers)
• Transformer-based networks (e.g., BERT) are usually deeper, like 12 or 24 layers.
• You will learn about Transformers later; they have a lot of skipping-like connections
40 “Massive Exploration of Neural Machine Translation Architecutres”, Britz et al, 2017. https://arxiv.org/pdf/1703.03906.pdf
LSTMs: real-world success
• In 2013–2015, LSTMs started achieving state-of-the-art results
• Successful tasks include handwriting recognition, speech recognition, machine
translation, parsing, and image captioning, as well as language models
• LSTMs became the dominant approach for most NLP tasks

• Now (2019–2024), Transformers have become dominant for all tasks


• For example, in WMT (a Machine Translation conference + competition):
• In WMT 2014, there were 0 neural machine translation systems (!)
• In WMT 2016, the summary report contains “RNN” 44 times (and these systems won)
• In WMT 2019: “RNN” 7 times, ”Transformer” 105 times

Source: "Findings of the 2016 Conference on Machine Translation (WMT16)", Bojar et al. 2016, http://www.statmt.org/wmt16/pdf/W16-2301.pdf
Source: "Findings of the 2018 Conference on Machine Translation (WMT18)", Bojar et al. 2018, http://www.statmt.org/wmt18/pdf/WMT028.pdf
Source: "Findings of the 2019 Conference on Machine Translation (WMT19)", Barrault et al. 2019, http://www.statmt.org/wmt18/pdf/WMT028.pdf
41
5. Machine Translation
Machine Translation (MT) is the task of translating a sentence x from one language (the
source language) to a sentence y in another language (the target language).

x: L'homme est né libre, et partout il est dans les fers

y: Man is born free, but everywhere he is in chains

– Rousseau
42
1990s-2010s: Statistical Machine Translation
• Core idea: Learn a probabilistic model from data
• Suppose we’re translating French → English.
• We want to find best English sentence y, given French sentence x

• Use Bayes Rule to break this down into two components to be learned
separately:

Translation Model Language Model

Models how words and phrases Models how to write


should be translated (fidelity). good English (fluency).
45 Learned from parallel data. Learned from monolingual data.
1990s–2010s: Statistical Machine Translation
• SMT was a huge research field
• The best systems were extremely complex
• Hundreds of important details
• Systems had many separately-designed subcomponents
• Lots of feature engineering
• Need to design features to capture particular language phenomena
• Required compiling and maintaining extra resources
• Like tables of equivalent phrases
• Lots of human effort to maintain
• Repeated effort for each language pair!

47
2014 Ne
u
Ma ral
Tra chin
nsl e
atio
n

MT
res
e arc
h (dramatic reenactment)
48
6. What is Neural Machine Translation?
• Neural Machine Translation (NMT) is a way to do Machine Translation with a single
end-to-end neural network

• The neural network architecture is called a sequence-to-sequence model (aka seq2seq)


and it involves two RNNs

49
Neural Machine Translation (NMT)
The sequence-to-sequence model
Target sentence (output)
Encoding of the source sentence.
Provides initial hidden state
he hit me with a pie <END>
for Decoder RNN.

argmax

argmax

argmax

argmax

argmax
argmax

argmax
Encoder RNN

Decoder RNN
il m’ a entarté <START> he hit me with a pie

Source sentence (input) Decoder RNN is a Language Model that generates


target sentence, conditioned on encoding.
Encoder RNN produces Note: This diagram shows test time behavior: decoder
an encoding of the output is fed in as next step’s input

50
source sentence.
Sequence-to-sequence is versatile!
• The general notion here is an encoder-decoder model
• One neural network takes input and produces a neural representation
• Another network produces output based on that neural representation
• If the input and output are sequences, we call it a seq2seq model

• Sequence-to-sequence is useful for more than just MT


• Many NLP tasks can be phrased as sequence-to-sequence:
• Summarization (long text → short text)
• Dialogue (previous utterances → next utterance)
• Parsing (input text → output parse as sequence)
• Code generation (natural language → Python code)

51
NMT: the first big success story of NLP Deep Learning
Neural Machine Translation went from a fringe research attempt in 2014 to the leading
standard method in 2016

• 2014: First seq2seq paper published [Sutskever et al. 2014]

• 2016: Google Translate switches from SMT to NMT – and by 2018 everyone had
• https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html

• This was amazing!


• SMT systems, built by hundreds of engineers over many years, were outperformed
by NMT systems trained by small groups of engineers in a few months
55
In summary
Lots of new information today! What are some of the practical takeaways?

1. LSTMs are powerful 2. Clip your gradients


Translation
The protests escalated over the weekend <EOS> generated
- - - -
0 0 0 0 0 0 0 0 0
0 . . . . . . . 0 0 0
. 2 . 2 2 3 . . .
. 2 4 5 4
1 1 2 4 3
0 0 0 0 0 0 0 0
0 0 . . . . 0 0 0
. . . . . . . .
. 6 4 5 6 6 6 6 6 4
3 0 0 - - - - - 6 6 5
- 0 - 0 0 0 - - -
0 0 . . 0
. 0 0 0 0
3 9 . . . . . .
1 . 1 1 1 1 . . .
1 - - 1 1 1 1 1
- - 0 0 - - - - - -
0 . . 0 0 0 0 0 - - -
0 . 0 . . . 0 0 0
. . 2 3 .
4 . . . .
- - 5 7 7 7 7 7
0 7 0 0 0 0 7 7 7
0 0 0 0 0 0 0 0
. . . . . . . . . .
2 3 2 1 1 1 1 1 . . .
1 1 1 1 1

Encoder:
0 0 0 -
0 0 0 0 . . 0 0 - 0 0
. . . . 2 . 2 . 0 0 . .
1 2 2 . . . 3 2
2 2 2 2
- - 4
- 0 0 0 0 0 0 0 0 1 0 0

Builds up
. . . 0 0 . .
0 . . 6 . . . .
. 6 3 6 8 1 6 6 . 6 6
2 - - - - - - - 6 4 - -

Decoder
0 0 0 - - - 0 0
- 0 0 0
0 0 0
0 . . . . . . . . 0 . .

sentence
1 1 1 . . 1 1
. 1 1 - 1 1 1 1
1 - - - - - - 0 1 - -
0 0 0 0 0 0 0 0 0 0 0 0
. . . . . . . .
. . 3 2
. . .
1 7 7 4 5 7 7 7 0 3 5 7

meaning
0 0 0 0 0 0 0
0 0 0 . 0 0 . .
. . . . . . . 1 . . .
1 1 1 1 1 1 1 1 1 1 1 1

0
0 . 0 0 0 0 -
0 0 . 0 4 0 - - 0
. . 2 . . . . . . 0 .
2 2 2 2 2 0 . 2
2 4 - .
- 0 0 0 0 0 0 1 4 0
0 - 0 . . . . . 0 2
. 0 . . . 0 .
4 2 6 6 6 6 0 . 6
6 . 3 - - - - - 3 .
- 6 - 0 - 6 5 -
0 0 . 0 0 0 0 - 0
0 0
0 1 . . . . . 0 0 .
. . . 3 1 1 1 1 . .
1 2 1 - 1 . 1
0 - - - - - 1 5 -
- - - 0 0 0 0 0 - 0
0 0 0 . 0 . 0 0
. . 5 . . . . . .
. 3
. - 4 7 7 7 7 . 4 7
7 3 4 - 0 0 0 0 7 0
0 0 0 0 0 0 0
. 0 . . . . . . .
. . . . 1 1 1 1 . 1
1 4 2 2 1 1 1
2

Source Die Proteste waren am Wochenende eskaliert <EOS> The protests escalated over the weekend
Feeding in last
sentence word

Conditioning =
Bottleneck

3. Use bidirectionality 4. Encoder-Decoder Neural Machine


56
when possible Translation Systems work very well
Natural Language Processing
with Deep Learning
CS224N/Ling284

Christopher Manning
Lecture 7: Attention and Final Projects; Practical Tips
1. Multi-layer deep encoder-decoder machine translation net
[Sutskever et al. 2014; Luong et al. 2015]
The hidden states from RNN layer i
are the inputs to RNN layer i + 1

Translation
The protests escalated over the weekend <EOS>
generated
0.1 0.2 0.4 0.5 0.2 -0.1 0.2 0.2 0.3 0.4 -0.2 -0.4 -0.3
0.3 0.6 0.4 0.5 0.6 0.6 0.6 0.6 0.6 0.4 0.6 0.6 0.5
0.1 -0.1 0.3 0.9 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1
-0.4 -0.7 -0.2 -0.3 -0.5 -0.7 -0.7 -0.7 -0.7 -0.7 -0.7 -0.7 -0.7
0.2 0.1 -0.3 -0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

Encoder:
Builds up 0.2 0.2 0.1 0.2 0.2 0.2 0.2 -0.4 0.2 -0.1 0.2 0.3 0.2

Decoder
-0.2 0.6 0.3 0.6 -0.8 0.6 -0.1 0.6 0.6 0.6 0.4 0.6 0.6
-0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1

sentence 0.1
0.1
-0.7
0.1
-0.7
0.1
-0.4
0.1
-0.5
0.1
-0.7
0.1
-0.7
0.1
-0.7
0.1
0.3
0.1
0.3
0.1
0.2
0.1
-0.5
0.1
-0.7
0.1

meaning
0.2 0.4 0.2 0.2 0.4 0.2 0.2 0.2 0.2 -0.1 -0.2 -0.4 0.2
0.6 -0.6 -0.3 0.4 -0.2 0.6 0.6 0.6 0.6 0.3 0.6 0.5 0.6
-0.1 0.2 -0.1 0.1 -0.3 -0.1 -0.1 -0.1 -0.1 -0.1 0.1 -0.5 -0.1
-0.7 -0.3 -0.4 -0.5 -0.4 -0.7 -0.7 -0.7 -0.7 -0.7 0.3 0.4 -0.7
0.1 0.4 0.2 -0.2 -0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

Source Die Proteste waren am Wochenende eskaliert <EOS> The protests escalated over the weekend Feeding in
sentence last word

Conditioning =
3
Bottleneck
How do we evaluate Machine Translation?
You’ll see BLEU in detail in
Commonest way: BLEU (Bilingual Evaluation Understudy) Assignment 3!

• BLEU compares the machine-written translation to one or several human-written


translation(s), and computes a similarity score based on:
• Geometric mean of n-gram precision (usually for 1, 2, 3 and 4-grams)
• Plus a penalty for too-short system translations

• BLEU is useful but imperfect


• There are many valid ways to translate a sentence
• Therefore, a good translation can get a poor BLEU score because it has low n-gram
overlap with the human translation L

4 Source: ”BLEU: a Method for Automatic Evaluation of Machine Translation", Papineni et al, 2002. http://aclweb.org/anthology/P02-1040
MT progress over time
[Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal; NMT 2019 FAIR on newstest2019]

45
Phrase-based SMT
40
Syntax-based SMT
35
Neural MT
30
25
20
15
10
5
0
2013 2014 2015 2016 2017 2018 2019
Sources: http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf & http://matrix.statmt.org/
6
2. Why attention? Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
Target sentence (output)

he hit me with a pie <END>


Encoder RNN

Decoder RNN
il a m’ entarté <START> he hit me with a pie

Source sentence (input)

Problems with this architecture?

7
Why attention? Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
This needs to capture all Target sentence (output)
information about the
source sentence. he hit me with a pie <END>
Information bottleneck!
Encoder RNN

Decoder RNN
il a m’ entarté <START> he hit me with a pie

Source sentence (input)

8
Attention
• Attention provides a solution to the bottleneck problem.

• Core idea: on each step of the decoder, use direct connection to the encoder to focus
on a particular part of the source sequence

• First, we will show via diagram (no equations), then we will show with equations

9
Sequence-to-sequence with attention
Core idea: on each step of the decoder, use direct connection to the encoder to focus on a
particular part of the source sequence
dot product
Attention
scores

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

10
Source sentence (input)
Sequence-to-sequence with attention

dot product
Attention
scores

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

11
Source sentence (input)
Sequence-to-sequence with attention

dot product
Attention
scores

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

12
Source sentence (input)
Sequence-to-sequence with attention

dot product
Attention
scores

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

13
Source sentence (input)
Sequence-to-sequence with attention

On this decoder timestep, we’re


scores distribution mostly focusing on the first
encoder hidden state (”he”)
Attention Attention

Take softmax to turn the scores


into a probability distribution

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

14
Source sentence (input)
Sequence-to-sequence with attention
Attention Use the attention distribution to take a
output weighted sum of the encoder hidden states.
scores distribution
Attention Attention

The attention output mostly contains


information from the hidden states that
received high attention.

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

15
Source sentence (input)
Sequence-to-sequence with attention
Attention he
output
Concatenate attention output
scores distribution
"!! with decoder hidden state, then
Attention Attention

use to compute "!! as before

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

16
Source sentence (input)
Sequence-to-sequence with attention
Attention hit
output
scores distribution
"!#
Attention Attention

Decoder RNN
Encoder
RNN

Sometimes we take the


attention output from the
previous step, and also
feed it into the decoder
il a m’ entarté <START> he (along with the usual
decoder input). We do
this in Assignment 4.
17
Source sentence (input)
Sequence-to-sequence with attention
Attention me
output
scores distribution
"!$
Attention Attention

Decoder RNN
Encoder
RNN

il a m’ entarté <START> he hit

18
Source sentence (input)
Sequence-to-sequence with attention
Attention with
output
scores distribution "!%
Attention Attention

Decoder RNN
Encoder
RNN

il a m’ entarté <START> he hit me

19
Source sentence (input)
Sequence-to-sequence with attention
Attention a
output
scores distribution "!&
Attention Attention

Decoder RNN
Encoder
RNN

il a m’ entarté <START> he hit me with

20
Source sentence (input)
Sequence-to-sequence with attention
Attention pie
output
scores distribution "!'
Attention Attention

Decoder RNN
Encoder
RNN

il a m’ entarté <START> he hit me with a

21
Source sentence (input)
Attention: in equations
• We have encoder hidden states
• On timestep t, we have decoder hidden state
• We get the attention scores for this step:

• We take softmax to get the attention distribution for this step (this is a probability distribution and
sums to 1)

• We use to take a weighted sum of the encoder hidden states to get the attention output

• Finally we concatenate the attention output with the decoder hidden


state and proceed as in the non-attention seq2seq model

22
Attention is great!
• Attention significantly improves NMT performance
• It’s very useful to allow decoder to focus on certain parts of the source
• Attention provides a more “human-like” model of the MT process
• You can look back at the source sentence while translating, rather than needing to remember it all
• Attention solves the bottleneck problem
• Attention allows decoder to look directly at source; bypass bottleneck
• Attention helps with the vanishing gradient problem
• Provides shortcut to faraway states
• Attention provides some interpretability

with
me

pie
he

hit
• By inspecting attention distribution, we see what the decoder was focusing on

a
il
• We get (soft) alignment for free!
a
• This is cool because we never explicitly trained an alignment system
m’
• The network just learned alignment by itself
entarté

23
There are several attention variants
• We have some values and a query

• Attention always involves: There are


1. Computing the attention scores multiple ways
to do this
2. Taking softmax to get attention distribution ⍺:

3. Using attention distribution to take weighted sum of values:

thus obtaining the attention output a (sometimes called the context vector)

24
You’ll think about the relative
Attention variants advantages/disadvantages of these in Assignment 3!

There are several ways you can compute from and :

• Basic dot-product attention:


• This assumes . This is the version we saw earlier. ##
$$

• Multiplicative attention: [Luong, Pham, and Manning 2015]


• Where is a weight matrix. Perhaps better called “bilinear attention”

##
$$

25 %
You’ll think about the relative
Attention variants advantages/disadvantages of these in Assignment 3!

• Reduced-rank multiplicative attention: !! = # " $" % ℎ! = ($#)" (%ℎ! )

[ ] [ ][ ]
• For low rank matrices $ ∈ ℝ#×%( , % ∈ ℝ#×%) , + ≪ -& , -'
T

= ×

• Additive attention: [Bahdanau, Cho, and Bengio 2014] Remember this


• Where are weight matrices and is a weight vector. when we look
at Transformers
• d3 (the attention dimensionality) is a hyperparameter next week!
• “Additive” is a weird/bad name. It’s really using a feed-forward neural net layer.
More information: “Deep Learning for NLP Best Practices”, Ruder, 2017. http://ruder.io/deep-learning-nlp-best-practices/index.html#attention
“Massive Exploration of Neural Machine Translation Architectures”, Britz et al, 2017, https://arxiv.org/pdf/1703.03906.pdf
26
Attention is a general Deep Learning technique
• We’ve seen that attention is a great way to improve the sequence-to-sequence model
for Machine Translation.
• However: You can use attention in many architectures
(not just seq2seq) and many tasks (not just MT)

• More general definition of attention:


• Given a set of vector values, and a vector query, attention is a technique to compute
a weighted sum of the values, dependent on the query.

• We sometimes say that the query attends to the values.


• For example, in the seq2seq + attention model, each decoder hidden state (query)
attends to all the encoder hidden states (values).

27
Attention is a general Deep Learning technique
• More general definition of attention:
• Given a set of vector values, and a vector query, attention is a technique to compute
a weighted sum of the values, dependent on the query.

Intuition:
• The weighted sum is a selective summary of the information contained in the values,
where the query determines which values to focus on.
• Attention is a way to obtain a fixed-size representation of an arbitrary set of
representations (the values), dependent on some other representation (the query).

Upshot:
• Attention has become the powerful, flexible, general way pointer and memory
manipulation in all deep learning models. A new idea from after 2010! From NMT!
28

You might also like