0% found this document useful (0 votes)

17 views57 pages

lecture 11

The document discusses Long Short-Term Memory (LSTM) RNNs, highlighting their ability to address the vanishing gradient problem in traditional RNNs. It explains the architecture of LSTMs, including their gates and states, and their application in various tasks such as machine translation and sentiment classification. The document also contrasts LSTMs with newer architectures like Transformers, which have become dominant in recent years.

Uploaded by

yimek33935

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views57 pages

lecture 11

Uploaded by

yimek33935

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Natural Language Processing

with Deep Learning

CS224N/Ling284

Christopher Manning
Lecture 6: LSTM RNNs and Neural Machine Translation
Long Short-Term Memory RNNs (LSTMs)
• A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the problem of
vanishing gradients
• Everyone cites that paper but really a crucial part of the modern LSTM is from Gers et al. (2000) !

• Only started to be recognized as promising through the work of S’s student Alex Graves c. 2006
• Work in which he also invented CTC (connectionist temporal classification) for speech recognition

• Became really well-known after Hinton brought it to Google in 2013

• Following Graves having been a postdoc with Hinton

Hochreiter and Schmidhuber, 1997. Long short-term memory. https://www.bioinf.jku.at/publications/older/2604.pdf

Gers, Schmidhuber, and Cummins, 2000. Learning to Forget: Continual Prediction with LSTM. https://dl.acm.org/doi/10.1162/089976600300015015
Graves, Fernandez, Gomez, and Schmidhuber, 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural nets.
https://www.cs.toronto.edu/~graves/icml_2006.pdf
20
Long Short-Term Memory RNNs (LSTMs)
• On step t, there is a hidden state "(") and a cell state #(")
• Both are vectors length n
• The cell stores long-term information
• The LSTM can read, erase, and write information from the cell
• The cell becomes conceptually rather like RAM in a computer

• The selection of which information is erased/written/read is controlled by three

corresponding gates (gates are calculated things whose values are probabilities)
• The gates are also vectors of length n
• On each timestep, each element of the gates can be open (1), closed (0), or
somewhere in-between
• The gates are dynamic: their value is computed based on the current context

21
Long Short-Term Memory (LSTM)
We have a sequence of inputs ! (") , and we will compute a sequence of hidden states ℎ(") and cell states
# (") . On timestep t:
Sigmoid function: all gate
Forget gate: controls what is kept vs values are between 0 and 1
forgotten, from previous cell state

Input gate: controls what parts of the

All these are vectors of same length n

new cell content are written to cell

Output gate: controls what parts of

cell are output to hidden state

New cell content: this is the new

content to be written to the cell

Cell state: erase (“forget”) some

content from last cell state, and write
(“input”) some new cell content
⊙ ⊙
Hidden state: read (“output”) some ⊙
content from the cell
Gates are applied using element-wise
22 (or Hadamard) product: ⊙
Long Short-Term Memory (LSTM)
You can think of the LSTM equations visually like this:

23 Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Long Short-Term Memory (LSTM) %$"
You can think of the LSTM equations visually like this:
Write some new cell content
The + sign is the secret!
Forget some
cell content ct
ct-1 ct
it ot Output some cell content
ft ~ct
Compute the to the hidden state
forget gate
ht-1 ht

Compute the Compute the Compute the

input gate new cell content output gate

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
24
How does LSTM solve vanishing gradients?
• The LSTM architecture makes it much easier for an RNN to
preserve information over many timesteps
• e.g., if the forget gate is set to 1 for a cell dimension and the input gate
set to 0, then the information of that cell is preserved indefinitely.
• In contrast, it’s harder for a vanilla RNN to learn a recurrent weight
matrix Wh that preserves info in the hidden state
• In practice, you get about 100 timesteps rather than about 7

• However, there are alternative ways of creating more direct and linear
pass-through connections in models for long distance dependencies

25
Is vanishing/exploding gradient just an RNN problem?
• No! It can be a problem for all neural architectures (including feed-forward and
convolutional neural networks), especially very deep ones.
• Due to chain rule / choice of nonlinearity function, gradient can become vanishingly small as it
backpropagates
• Thus, lower layers are learned very slowly (i.e., are hard to train)
• Another solution: lots of new deep feedforward/convolutional architectures add more
direct connections (thus allowing the gradient to flow)

For example:
• Residual connections aka “ResNet”
• Also known as skip-connections
• The identity connection
preserves information by default
• This makes deep networks much
easier to train
"Deep Residual Learning for Image Recognition", He et al, 2015. https://arxiv.org/pdf/1512.03385.pdf
26
Is vanishing/exploding gradient just a RNN problem?
Other methods:
• Dense connections aka “DenseNet” • Highway connections aka “HighwayNet”
• Directly connect each layer to all future layers! • Similar to residual connections, but the identity
connection vs the transformation layer is
controlled by a dynamic gate
• Inspired by LSTMs, but applied to deep
feedforward/convolutional networks

• Conclusion: Though vanishing/exploding gradients are a general problem, RNNs are particularly unstable
due to the repeated multiplication by the same weight matrix [Bengio et al, 1994]
”Densely Connected Convolutional Networks", Huang et al, 2017. https://arxiv.org/pdf/1608.06993.pdf ”Highway Networks", Srivastava et al, 2015. https://arxiv.org/pdf/1505.00387.pdf

”Learning Long-Term Dependencies with Gradient Descent is Difficult", Bengio et al. 1994, http://ai.dinfo.unifi.it/paolo//ps/tnn-94-gradient.pdf
27
3. Other RNN uses: RNNs can be used for sequence tagging
e.g., part-of-speech tagging, named entity recognition

DT JJ NN VBN IN DT NN

the startled cat knocked over the vase

28
RNNs can be used as a sentence encoder model
e.g., for sentiment classification
positive How to compute
sentence encoding?

Sentence
encoding

overall I enjoyed the movie a lot

29
RNNs can be used as a sentence encoder model
e.g., for sentiment classification
positive How to compute
sentence encoding?

Sentence Basic way:

encoding Use final hidden
e qua state
ls

overall I enjoyed the movie a lot

30
RNNs can be used as a sentence encoder model
e.g., for sentiment classification
positive How to compute
sentence encoding?

Sentence Usually better:

encoding Take element-wise
max or mean of all
hidden states

overall I enjoyed the movie a lot

31
RNN-LMs can be used to generate text based on other information
e.g., speech recognition, machine translation, summarization
RNN-LM

what’s the weather

Input (audio)

conditioning

<START> what’s the

This is an example of a conditional language model.

We’ll see Machine Translation as an example in more detail
32
4. Bidirectional and Multi-layer RNNs: motivation
Task: Sentiment Classification
We can regard this hidden state as a
positive representation of the word “terribly” in the
context of this sentence. We call this a
contextual representation.

Sentence These contextual

encoding representations only
ax elem contain information
a n / m ent- about the left context
e wise
t-w ise m mea (e.g. “the movie was”).
e n n /ma
ele m x
What about right
context?

In this example,
“exciting” is in the right
context and this
modifies the meaning of
“terribly” (from negative
the movie was terribly exciting ! to positive)

33
This contextual representation of “terribly”
Bidirectional RNNs has both left and right context!

Concatenated
hidden states

Backward RNN

Forward RNN

the movie was terribly exciting !

34
Bidirectional RNNs
This is a general notation to mean
On timestep t:
“compute one forward step of the
RNN” – it could be a simple RNN or
LSTM computation.

Forward RNN Generally, these

two RNNs have
Backward RNN separate weights

Concatenated hidden states

We regard this as “the hidden

state” of a bidirectional RNN.
This is what we pass on to the
35
next parts of the network.
Bidirectional RNNs
• Note: bidirectional RNNs are only applicable if you have access to the entire input
sequence
• They are not applicable to Language Modeling, because in LM you only have left
context available.

• If you do have entire input sequence (e.g., any kind of encoding), bidirectionality is
powerful (you should use it by default).

• For example, BERT (Bidirectional Encoder Representations from Transformers) is a

powerful pretrained contextual representation system built on bidirectionality.
• You will learn more about transformers, including BERT, in a couple of weeks!

37
Multi-layer RNNs
• RNNs are already “deep” on one dimension (they unroll over many timesteps)

• We can also make them “deep” in another dimension by

applying multiple RNNs – this is a multi-layer RNN.

• This allows the network to compute more complex representations

• The lower RNNs should compute lower-level features and the higher RNNs should
compute higher-level features.

• Multi-layer RNNs are also called stacked RNNs.

38
Multi-layer RNNs The hidden states from RNN layer i
are the inputs to RNN layer i+1

RNN layer 3

RNN layer 2

RNN layer 1

the movie was terribly exciting !

39
Multi-layer RNNs in practice
• Multi-layer or stacked RNNs allow a network to compute more complex representations
– they work better than just have one layer of high-dimensional encodings!
• The lower RNNs should compute lower-level features and the higher RNNs should
compute higher-level features.
• High-performing RNNs are usually multi-layer (but aren’t as deep as convolutional or
feed-forward networks)
• For example: In a 2017 paper, Britz et al. find that for Neural Machine Translation, 2 to 4
layers is best for the encoder RNN, and 4 layers is best for the decoder RNN
• Often 2 layers is a lot better than 1, and 3 might be a little better than 2
• Usually, skip-connections/dense-connections are needed to train deeper RNNs
(e.g., 8 layers)
• Transformer-based networks (e.g., BERT) are usually deeper, like 12 or 24 layers.
• You will learn about Transformers later; they have a lot of skipping-like connections
40 “Massive Exploration of Neural Machine Translation Architecutres”, Britz et al, 2017. https://arxiv.org/pdf/1703.03906.pdf
LSTMs: real-world success
• In 2013–2015, LSTMs started achieving state-of-the-art results
• Successful tasks include handwriting recognition, speech recognition, machine
translation, parsing, and image captioning, as well as language models
• LSTMs became the dominant approach for most NLP tasks

• Now (2019–2024), Transformers have become dominant for all tasks

• For example, in WMT (a Machine Translation conference + competition):
• In WMT 2014, there were 0 neural machine translation systems (!)
• In WMT 2016, the summary report contains “RNN” 44 times (and these systems won)
• In WMT 2019: “RNN” 7 times, ”Transformer” 105 times

Source: "Findings of the 2016 Conference on Machine Translation (WMT16)", Bojar et al. 2016, http://www.statmt.org/wmt16/pdf/W16-2301.pdf
Source: "Findings of the 2018 Conference on Machine Translation (WMT18)", Bojar et al. 2018, http://www.statmt.org/wmt18/pdf/WMT028.pdf
Source: "Findings of the 2019 Conference on Machine Translation (WMT19)", Barrault et al. 2019, http://www.statmt.org/wmt18/pdf/WMT028.pdf
41
5. Machine Translation
Machine Translation (MT) is the task of translating a sentence x from one language (the
source language) to a sentence y in another language (the target language).

x: L'homme est né libre, et partout il est dans les fers

y: Man is born free, but everywhere he is in chains

– Rousseau
42
1990s-2010s: Statistical Machine Translation
• Core idea: Learn a probabilistic model from data
• Suppose we’re translating French → English.
• We want to find best English sentence y, given French sentence x

• Use Bayes Rule to break this down into two components to be learned
separately:

Translation Model Language Model

Models how words and phrases Models how to write

should be translated (fidelity). good English (fluency).
45 Learned from parallel data. Learned from monolingual data.
1990s–2010s: Statistical Machine Translation
• SMT was a huge research field
• The best systems were extremely complex
• Hundreds of important details
• Systems had many separately-designed subcomponents
• Lots of feature engineering
• Need to design features to capture particular language phenomena
• Required compiling and maintaining extra resources
• Like tables of equivalent phrases
• Lots of human effort to maintain
• Repeated effort for each language pair!

47
2014 Ne
u
Ma ral
Tra chin
nsl e
atio
n

MT
res
e arc
h (dramatic reenactment)
48
6. What is Neural Machine Translation?
• Neural Machine Translation (NMT) is a way to do Machine Translation with a single
end-to-end neural network

• The neural network architecture is called a sequence-to-sequence model (aka seq2seq)

and it involves two RNNs

49
Neural Machine Translation (NMT)
The sequence-to-sequence model
Target sentence (output)
Encoding of the source sentence.
Provides initial hidden state
he hit me with a pie <END>
for Decoder RNN.

argmax

argmax
argmax

argmax
Encoder RNN

Decoder RNN
il m’ a entarté <START> he hit me with a pie

Source sentence (input) Decoder RNN is a Language Model that generates

target sentence, conditioned on encoding.
Encoder RNN produces Note: This diagram shows test time behavior: decoder
an encoding of the output is fed in as next step’s input

50
source sentence.
Sequence-to-sequence is versatile!
• The general notion here is an encoder-decoder model
• One neural network takes input and produces a neural representation
• Another network produces output based on that neural representation
• If the input and output are sequences, we call it a seq2seq model

• Sequence-to-sequence is useful for more than just MT

• Many NLP tasks can be phrased as sequence-to-sequence:
• Summarization (long text → short text)
• Dialogue (previous utterances → next utterance)
• Parsing (input text → output parse as sequence)
• Code generation (natural language → Python code)

51
NMT: the first big success story of NLP Deep Learning
Neural Machine Translation went from a fringe research attempt in 2014 to the leading
standard method in 2016

• 2014: First seq2seq paper published [Sutskever et al. 2014]

• 2016: Google Translate switches from SMT to NMT – and by 2018 everyone had
• https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html

• This was amazing!

• SMT systems, built by hundreds of engineers over many years, were outperformed
by NMT systems trained by small groups of engineers in a few months
55
In summary
Lots of new information today! What are some of the practical takeaways?

1. LSTMs are powerful 2. Clip your gradients

Translation
The protests escalated over the weekend <EOS> generated
- - - -
0 0 0 0 0 0 0 0 0
0 . . . . . . . 0 0 0
. 2 . 2 2 3 . . .
. 2 4 5 4
1 1 2 4 3
0 0 0 0 0 0 0 0
0 0 . . . . 0 0 0
. . . . . . . .
. 6 4 5 6 6 6 6 6 4
3 0 0 - - - - - 6 6 5
- 0 - 0 0 0 - - -
0 0 . . 0
. 0 0 0 0
3 9 . . . . . .
1 . 1 1 1 1 . . .
1 - - 1 1 1 1 1
- - 0 0 - - - - - -
0 . . 0 0 0 0 0 - - -
0 . 0 . . . 0 0 0
. . 2 3 .
4 . . . .
- - 5 7 7 7 7 7
0 7 0 0 0 0 7 7 7
0 0 0 0 0 0 0 0
. . . . . . . . . .
2 3 2 1 1 1 1 1 . . .
1 1 1 1 1

Encoder:
0 0 0 -
0 0 0 0 . . 0 0 - 0 0
. . . . 2 . 2 . 0 0 . .
1 2 2 . . . 3 2
2 2 2 2
- - 4
- 0 0 0 0 0 0 0 0 1 0 0

Builds up
. . . 0 0 . .
0 . . 6 . . . .
. 6 3 6 8 1 6 6 . 6 6
2 - - - - - - - 6 4 - -

Decoder
0 0 0 - - - 0 0
- 0 0 0
0 0 0
0 . . . . . . . . 0 . .

sentence
1 1 1 . . 1 1
. 1 1 - 1 1 1 1
1 - - - - - - 0 1 - -
0 0 0 0 0 0 0 0 0 0 0 0
. . . . . . . .
. . 3 2
. . .
1 7 7 4 5 7 7 7 0 3 5 7

meaning
0 0 0 0 0 0 0
0 0 0 . 0 0 . .
. . . . . . . 1 . . .
1 1 1 1 1 1 1 1 1 1 1 1

0
0 . 0 0 0 0 -
0 0 . 0 4 0 - - 0
. . 2 . . . . . . 0 .
2 2 2 2 2 0 . 2
2 4 - .
- 0 0 0 0 0 0 1 4 0
0 - 0 . . . . . 0 2
. 0 . . . 0 .
4 2 6 6 6 6 0 . 6
6 . 3 - - - - - 3 .
- 6 - 0 - 6 5 -
0 0 . 0 0 0 0 - 0
0 0
0 1 . . . . . 0 0 .
. . . 3 1 1 1 1 . .
1 2 1 - 1 . 1
0 - - - - - 1 5 -
- - - 0 0 0 0 0 - 0
0 0 0 . 0 . 0 0
. . 5 . . . . . .
. 3
. - 4 7 7 7 7 . 4 7
7 3 4 - 0 0 0 0 7 0
0 0 0 0 0 0 0
. 0 . . . . . . .
. . . . 1 1 1 1 . 1
1 4 2 2 1 1 1
2

Source Die Proteste waren am Wochenende eskaliert <EOS> The protests escalated over the weekend
Feeding in last
sentence word

Conditioning =
Bottleneck

3. Use bidirectionality 4. Encoder-Decoder Neural Machine

56
when possible Translation Systems work very well
Natural Language Processing
with Deep Learning
CS224N/Ling284

Christopher Manning
Lecture 7: Attention and Final Projects; Practical Tips
1. Multi-layer deep encoder-decoder machine translation net
[Sutskever et al. 2014; Luong et al. 2015]
The hidden states from RNN layer i
are the inputs to RNN layer i + 1

Translation
The protests escalated over the weekend <EOS>
generated
0.1 0.2 0.4 0.5 0.2 -0.1 0.2 0.2 0.3 0.4 -0.2 -0.4 -0.3
0.3 0.6 0.4 0.5 0.6 0.6 0.6 0.6 0.6 0.4 0.6 0.6 0.5
0.1 -0.1 0.3 0.9 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1
-0.4 -0.7 -0.2 -0.3 -0.5 -0.7 -0.7 -0.7 -0.7 -0.7 -0.7 -0.7 -0.7
0.2 0.1 -0.3 -0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

Encoder:
Builds up 0.2 0.2 0.1 0.2 0.2 0.2 0.2 -0.4 0.2 -0.1 0.2 0.3 0.2

Decoder
-0.2 0.6 0.3 0.6 -0.8 0.6 -0.1 0.6 0.6 0.6 0.4 0.6 0.6
-0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1

sentence 0.1
0.1
-0.7
0.1
-0.7
0.1
-0.4
0.1
-0.5
0.1
-0.7
0.1
-0.7
0.1
-0.7
0.1
0.3
0.1
0.3
0.1
0.2
0.1
-0.5
0.1
-0.7
0.1

meaning
0.2 0.4 0.2 0.2 0.4 0.2 0.2 0.2 0.2 -0.1 -0.2 -0.4 0.2
0.6 -0.6 -0.3 0.4 -0.2 0.6 0.6 0.6 0.6 0.3 0.6 0.5 0.6
-0.1 0.2 -0.1 0.1 -0.3 -0.1 -0.1 -0.1 -0.1 -0.1 0.1 -0.5 -0.1
-0.7 -0.3 -0.4 -0.5 -0.4 -0.7 -0.7 -0.7 -0.7 -0.7 0.3 0.4 -0.7
0.1 0.4 0.2 -0.2 -0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

Source Die Proteste waren am Wochenende eskaliert <EOS> The protests escalated over the weekend Feeding in
sentence last word

Conditioning =
3
Bottleneck
How do we evaluate Machine Translation?
You’ll see BLEU in detail in
Commonest way: BLEU (Bilingual Evaluation Understudy) Assignment 3!

• BLEU compares the machine-written translation to one or several human-written

translation(s), and computes a similarity score based on:
• Geometric mean of n-gram precision (usually for 1, 2, 3 and 4-grams)
• Plus a penalty for too-short system translations

• BLEU is useful but imperfect

• There are many valid ways to translate a sentence
• Therefore, a good translation can get a poor BLEU score because it has low n-gram
overlap with the human translation L

4 Source: ”BLEU: a Method for Automatic Evaluation of Machine Translation", Papineni et al, 2002. http://aclweb.org/anthology/P02-1040
MT progress over time
[Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal; NMT 2019 FAIR on newstest2019]

45
Phrase-based SMT
40
Syntax-based SMT
35
Neural MT
30
25
20
15
10
5
0
2013 2014 2015 2016 2017 2018 2019
Sources: http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf & http://matrix.statmt.org/
6
2. Why attention? Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
Target sentence (output)

he hit me with a pie <END>

Encoder RNN

Decoder RNN
il a m’ entarté <START> he hit me with a pie

Source sentence (input)

Problems with this architecture?

7
Why attention? Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
This needs to capture all Target sentence (output)
information about the
source sentence. he hit me with a pie <END>
Information bottleneck!
Encoder RNN

Decoder RNN
il a m’ entarté <START> he hit me with a pie

Source sentence (input)

8
Attention
• Attention provides a solution to the bottleneck problem.

• Core idea: on each step of the decoder, use direct connection to the encoder to focus
on a particular part of the source sequence

• First, we will show via diagram (no equations), then we will show with equations

9
Sequence-to-sequence with attention
Core idea: on each step of the decoder, use direct connection to the encoder to focus on a
particular part of the source sequence
dot product
Attention
scores

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

10
Source sentence (input)
Sequence-to-sequence with attention

dot product
Attention
scores

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

11
Source sentence (input)
Sequence-to-sequence with attention

dot product
Attention
scores

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

12
Source sentence (input)
Sequence-to-sequence with attention

dot product
Attention
scores

Decoder RNN
Encoder
RNN

il a m’ entarté <START> he hit me

19
Source sentence (input)
Sequence-to-sequence with attention
Attention a
output
scores distribution "!&
Attention Attention

Decoder RNN
Encoder
RNN

il a m’ entarté <START> he hit me with

20
Source sentence (input)
Sequence-to-sequence with attention
Attention pie
output
scores distribution "!'
Attention Attention

Decoder RNN
Encoder
RNN

il a m’ entarté <START> he hit me with a

21
Source sentence (input)
Attention: in equations
• We have encoder hidden states
• On timestep t, we have decoder hidden state
• We get the attention scores for this step:

• We take softmax to get the attention distribution for this step (this is a probability distribution and
sums to 1)

• We use to take a weighted sum of the encoder hidden states to get the attention output

• Finally we concatenate the attention output with the decoder hidden

state and proceed as in the non-attention seq2seq model

22
Attention is great!
• Attention significantly improves NMT performance
• It’s very useful to allow decoder to focus on certain parts of the source
• Attention provides a more “human-like” model of the MT process
• You can look back at the source sentence while translating, rather than needing to remember it all
• Attention solves the bottleneck problem
• Attention allows decoder to look directly at source; bypass bottleneck
• Attention helps with the vanishing gradient problem
• Provides shortcut to faraway states
• Attention provides some interpretability

with
me

pie
he

hit
• By inspecting attention distribution, we see what the decoder was focusing on

a
il
• We get (soft) alignment for free!
a
• This is cool because we never explicitly trained an alignment system
m’
• The network just learned alignment by itself
entarté

23
There are several attention variants
• We have some values and a query

• Attention always involves: There are

1. Computing the attention scores multiple ways
to do this
2. Taking softmax to get attention distribution ⍺:

3. Using attention distribution to take weighted sum of values:

thus obtaining the attention output a (sometimes called the context vector)

24
You’ll think about the relative
Attention variants advantages/disadvantages of these in Assignment 3!

There are several ways you can compute from and :

• Basic dot-product attention:

• This assumes . This is the version we saw earlier. ##
$$

• Multiplicative attention: [Luong, Pham, and Manning 2015]

• Where is a weight matrix. Perhaps better called “bilinear attention”

##
$$

25 %
You’ll think about the relative
Attention variants advantages/disadvantages of these in Assignment 3!

• Reduced-rank multiplicative attention: !! = # " $" % ℎ! = ($#)" (%ℎ! )

[ ] [ ][ ]
• For low rank matrices $ ∈ ℝ#×%( , % ∈ ℝ#×%) , + ≪ -& , -'
T

= ×

• Additive attention: [Bahdanau, Cho, and Bengio 2014] Remember this

• Where are weight matrices and is a weight vector. when we look
at Transformers
• d3 (the attention dimensionality) is a hyperparameter next week!
• “Additive” is a weird/bad name. It’s really using a feed-forward neural net layer.
More information: “Deep Learning for NLP Best Practices”, Ruder, 2017. http://ruder.io/deep-learning-nlp-best-practices/index.html#attention
“Massive Exploration of Neural Machine Translation Architectures”, Britz et al, 2017, https://arxiv.org/pdf/1703.03906.pdf
26
Attention is a general Deep Learning technique
• We’ve seen that attention is a great way to improve the sequence-to-sequence model
for Machine Translation.
• However: You can use attention in many architectures
(not just seq2seq) and many tasks (not just MT)

• More general definition of attention:

• Given a set of vector values, and a vector query, attention is a technique to compute
a weighted sum of the values, dependent on the query.

• We sometimes say that the query attends to the values.

• For example, in the seq2seq + attention model, each decoder hidden state (query)
attends to all the encoder hidden states (values).

27
Attention is a general Deep Learning technique
• More general definition of attention:
• Given a set of vector values, and a vector query, attention is a technique to compute
a weighted sum of the values, dependent on the query.

Intuition:
• The weighted sum is a selective summary of the information contained in the values,
where the query determines which values to focus on.
• Attention is a way to obtain a fixed-size representation of an arbitrary set of
representations (the values), dependent on some other representation (the query).

Upshot:
• Attention has become the powerful, flexible, general way pointer and memory
manipulation in all deep learning models. A new idea from after 2010! From NMT!
28

lec-10
No ratings yet
lec-10
37 pages
Sequence Modeling
No ratings yet
Sequence Modeling
131 pages
RNN-StannfordBased
No ratings yet
RNN-StannfordBased
102 pages
English Grammar Notes
67% (3)
English Grammar Notes
85 pages
Daedalus AI and Society
No ratings yet
Daedalus AI and Society
384 pages
Sequence Models231205
No ratings yet
Sequence Models231205
72 pages
Cs224n 2025 Lecture06 Fancy Rnn
No ratings yet
Cs224n 2025 Lecture06 Fancy Rnn
57 pages
module-4-RNN-LSTM-GRU
No ratings yet
module-4-RNN-LSTM-GRU
59 pages
RNN LSTM GRU Transformers
0% (1)
RNN LSTM GRU Transformers
123 pages
Unit 4
No ratings yet
Unit 4
50 pages
RNN & LSTM: Nguyen Van Vinh Computer Science Department, UET, Vnu Ha Noi
No ratings yet
RNN & LSTM: Nguyen Van Vinh Computer Science Department, UET, Vnu Ha Noi
35 pages
Dl Module 4 Notes
No ratings yet
Dl Module 4 Notes
27 pages
chapter 2
No ratings yet
chapter 2
68 pages
Rnn Tutorial
No ratings yet
Rnn Tutorial
41 pages
10 20 - Apr - DL
No ratings yet
10 20 - Apr - DL
69 pages
RNN_2
No ratings yet
RNN_2
144 pages
RNN LSTM
No ratings yet
RNN LSTM
42 pages
DL_MOD4 (3)
No ratings yet
DL_MOD4 (3)
105 pages
Unit 4 - DL
No ratings yet
Unit 4 - DL
23 pages
42 Recurrent Neural Networks and LSTM
No ratings yet
42 Recurrent Neural Networks and LSTM
68 pages
07 RNN Recurrent Neural Networks
No ratings yet
07 RNN Recurrent Neural Networks
115 pages
4. Recurrent Neural Network
No ratings yet
4. Recurrent Neural Network
36 pages
RNNs and LSTMs
No ratings yet
RNNs and LSTMs
41 pages
06-DL-Deep Learning For Text Data (LSTM Seq2Seq Models)
No ratings yet
06-DL-Deep Learning For Text Data (LSTM Seq2Seq Models)
44 pages
DL MODULE 5
No ratings yet
DL MODULE 5
10 pages
RNN and LSTM
No ratings yet
RNN and LSTM
32 pages
RNNs
No ratings yet
RNNs
22 pages
NLP Lecture 6
No ratings yet
NLP Lecture 6
57 pages
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
No ratings yet
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
71 pages
06 - LLM
No ratings yet
06 - LLM
18 pages
LSTM Presentation
No ratings yet
LSTM Presentation
23 pages
Deep Arch Msc 2024
No ratings yet
Deep Arch Msc 2024
83 pages
Recurrent Neural Network
No ratings yet
Recurrent Neural Network
34 pages
RNN
No ratings yet
RNN
28 pages
Unit III- Recurrent Neural Networks
No ratings yet
Unit III- Recurrent Neural Networks
44 pages
Module 6
No ratings yet
Module 6
42 pages
9 RNN LSTM Gru
No ratings yet
9 RNN LSTM Gru
91 pages
RNN LSTM
No ratings yet
RNN LSTM
72 pages
longshorttermmemorylstm-231215171600-1feb7b1b
No ratings yet
longshorttermmemorylstm-231215171600-1feb7b1b
17 pages
28-Recurrent Neural Networks - Bidirectional RNNs-19!09!2024
No ratings yet
28-Recurrent Neural Networks - Bidirectional RNNs-19!09!2024
12 pages
LSTM
No ratings yet
LSTM
19 pages
CH4_AA1.1-Sequence Models (1)
No ratings yet
CH4_AA1.1-Sequence Models (1)
26 pages
6 - RNN LSTM & Gru
No ratings yet
6 - RNN LSTM & Gru
14 pages
RNN
No ratings yet
RNN
32 pages
LSTM
No ratings yet
LSTM
22 pages
UNIT-IV DL
No ratings yet
UNIT-IV DL
23 pages
Understanding LSTM Networks - Colah's Blog
No ratings yet
Understanding LSTM Networks - Colah's Blog
7 pages
RNNs and Their Types - Simple Explanation
No ratings yet
RNNs and Their Types - Simple Explanation
5 pages
Unit_3_rcnn
No ratings yet
Unit_3_rcnn
25 pages
dis6-sol
No ratings yet
dis6-sol
6 pages
MODULE 4
No ratings yet
MODULE 4
14 pages
Deep Learning RNN
100% (1)
Deep Learning RNN
53 pages
Unit 4 - Machine Learning - WWW - Rgpvnotes.in
0% (1)
Unit 4 - Machine Learning - WWW - Rgpvnotes.in
16 pages
RNNs and Their Types - 15 Slides (Easy Copy-Paste Format)
No ratings yet
RNNs and Their Types - 15 Slides (Easy Copy-Paste Format)
6 pages
Unit 4 - Machine Learning
No ratings yet
Unit 4 - Machine Learning
16 pages
Nur Math Nurture Wb-1588690800
No ratings yet
Nur Math Nurture Wb-1588690800
46 pages
Unit 4 - MachineLearning
No ratings yet
Unit 4 - MachineLearning
16 pages
LSTM Material 1
No ratings yet
LSTM Material 1
3 pages
Ielts Writing Task 1 Table
No ratings yet
Ielts Writing Task 1 Table
11 pages
CS5560 Lect12-RNN - LSTM
No ratings yet
CS5560 Lect12-RNN - LSTM
30 pages
UNIT-5 Foundations of Deep Learning
No ratings yet
UNIT-5 Foundations of Deep Learning
9 pages
Advances in Intelligent Systems and Computing PDF
No ratings yet
Advances in Intelligent Systems and Computing PDF
807 pages
Adjectives For Attitude
No ratings yet
Adjectives For Attitude
8 pages
Scientific Writing Style
No ratings yet
Scientific Writing Style
34 pages
Stylistics Analysis of The Poem "The Onset" by Robert Frost
No ratings yet
Stylistics Analysis of The Poem "The Onset" by Robert Frost
9 pages
First Impressions and Confidence
No ratings yet
First Impressions and Confidence
4 pages
Peer Evaluation Form
0% (1)
Peer Evaluation Form
3 pages
French Negotiation Culture
100% (1)
French Negotiation Culture
10 pages
BPY-11 PSYCHOLOGY Basic Psychological Processes (Paper - I)
No ratings yet
BPY-11 PSYCHOLOGY Basic Psychological Processes (Paper - I)
4 pages
Employee Performance Review Template TrackTime24
No ratings yet
Employee Performance Review Template TrackTime24
1 page
Other Discipline's Contribution To Organisational Behaviour
No ratings yet
Other Discipline's Contribution To Organisational Behaviour
15 pages
Inference SAT 2023 (1)
No ratings yet
Inference SAT 2023 (1)
5 pages
Week 9
No ratings yet
Week 9
6 pages
Relative Clauses
50% (2)
Relative Clauses
1 page
Re Scripted Meditation Script
No ratings yet
Re Scripted Meditation Script
3 pages
Solution-Focused Counseling in Schools: Suggested APA Style Reference
No ratings yet
Solution-Focused Counseling in Schools: Suggested APA Style Reference
8 pages
Lesson Plan Form Homophones
No ratings yet
Lesson Plan Form Homophones
4 pages
VI SEM CSE CS1351 Artificial Intelligence UNIT-III Question and Answers
No ratings yet
VI SEM CSE CS1351 Artificial Intelligence UNIT-III Question and Answers
18 pages
Uki
No ratings yet
Uki
6 pages
20HSEN601 BUSINESS COMMUNICATION AND VALUE SCIENCE IV
No ratings yet
20HSEN601 BUSINESS COMMUNICATION AND VALUE SCIENCE IV
2 pages
Dinner Party Activity Sheet PT
No ratings yet
Dinner Party Activity Sheet PT
3 pages
DLL Math 8 Q3 W5 D1 2023
No ratings yet
DLL Math 8 Q3 W5 D1 2023
3 pages
Rubrics Criteria Outstanding 4 Proficient 3 Developing 2 Amateur 1 Rhythm
No ratings yet
Rubrics Criteria Outstanding 4 Proficient 3 Developing 2 Amateur 1 Rhythm
1 page
DLL Quarter3 M1 W1
No ratings yet
DLL Quarter3 M1 W1
3 pages
General Mathematics-Intervention - Remediation-Plan - RMYA - S2023
No ratings yet
General Mathematics-Intervention - Remediation-Plan - RMYA - S2023
1 page
Chapter 7 The Contemporary World
No ratings yet
Chapter 7 The Contemporary World
5 pages
English: Quarter 3, Wk.6 - Module 1
No ratings yet
English: Quarter 3, Wk.6 - Module 1
14 pages
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
From Everand
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
Fouad Sabry
No ratings yet
Long Short Term Memory: Fundamentals and Applications for Sequence Prediction
From Everand
Long Short Term Memory: Fundamentals and Applications for Sequence Prediction
Fouad Sabry
No ratings yet
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
From Everand
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
Fouad Sabry
No ratings yet

lecture 11

Uploaded by

lecture 11

Uploaded by

Natural Language Processing

with Deep Learning

• Became really well-known after Hinton brought it to Google in 2013

Hochreiter and Schmidhuber, 1997. Long short-term memory. https://www.bioinf.jku.at/publications/older/2604.pdf

• The selection of which information is erased/written/read is controlled by three

Input gate: controls what parts of the

All these are vectors of same length n

Output gate: controls what parts of

New cell content: this is the new

Cell state: erase (“forget”) some

Compute the Compute the Compute the

the startled cat knocked over the vase

overall I enjoyed the movie a lot

Sentence Basic way:

overall I enjoyed the movie a lot

Sentence Usually better:

overall I enjoyed the movie a lot

what’s the weather

<START> what’s the

This is an example of a conditional language model.

Sentence These contextual

the movie was terribly exciting !

Forward RNN Generally, these

Concatenated hidden states

We regard this as “the hidden

• For example, BERT (Bidirectional Encoder Representations from Transformers) is a

• We can also make them “deep” in another dimension by

• This allows the network to compute more complex representations

• Multi-layer RNNs are also called stacked RNNs.

the movie was terribly exciting !

• Now (2019–2024), Transformers have become dominant for all tasks

x: L'homme est né libre, et partout il est dans les fers

y: Man is born free, but everywhere he is in chains

Translation Model Language Model

Models how words and phrases Models how to write

• The neural network architecture is called a sequence-to-sequence model (aka seq2seq)

Source sentence (input) Decoder RNN is a Language Model that generates

• Sequence-to-sequence is useful for more than just MT

• 2014: First seq2seq paper published [Sutskever et al. 2014]

• This was amazing!

1. LSTMs are powerful 2. Clip your gradients

3. Use bidirectionality 4. Encoder-Decoder Neural Machine

• BLEU compares the machine-written translation to one or several human-written

• BLEU is useful but imperfect

he hit me with a pie <END>

Source sentence (input)

Problems with this architecture?

Source sentence (input)

On this decoder timestep, we’re

Take softmax to turn the scores

The attention output mostly contains

use to compute "!! as before

Sometimes we take the

il a m’ entarté <START> he hit

il a m’ entarté <START> he hit me

il a m’ entarté <START> he hit me with

il a m’ entarté <START> he hit me with a

• Finally we concatenate the attention output with the decoder hidden

• Attention always involves: There are

3. Using attention distribution to take weighted sum of values:

There are several ways you can compute from and :

• Basic dot-product attention:

• Multiplicative attention: [Luong, Pham, and Manning 2015]

• Reduced-rank multiplicative attention: !! = # " $" % ℎ! = ($#)" (%ℎ! )

• Additive attention: [Bahdanau, Cho, and Bengio 2014] Remember this

• More general definition of attention:

• We sometimes say that the query attends to the values.

You might also like