0% found this document useful (0 votes)

1 views

XCS224N_Module4_Slides

The document discusses language models and recurrent neural networks (RNNs) in the context of natural language processing. It covers topics such as regularization techniques, the importance of parameter initialization, and the training of RNNs, including challenges like exploding and vanishing gradients. Additionally, it introduces LSTMs as a solution to these problems and outlines the structure and advantages of RNNs for language modeling.

Uploaded by

nasoheel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

XCS224N_Module4_Slides

Uploaded by

nasoheel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 91

Natural Language Processing

with Deep Learning

CS224N/Ling284

Christopher Manning
Lecture 5: Language Models and Recurrent Neural Networks
(oh, and finish neural dependency parsing J)
2. A bit more about neural networks

16
We have models with many parameters! Regularization!
• A full loss function includes regularization over all parameters 𝜃, e.g., L2 regularization:

• Classic view: Regularization works to prevent overfitting when we have a lot of features
(or later a very powerful/deep model, etc.)
• Now: Regularization produces models that generalize well when we have a “big” model
• We do not care that our models overfit on the training data, even though they are hugely overfit

error
error Test
Trainin overfitting
g error
0
17
model “power”
Dropout (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov 2012/JMLR 2014)
Preventing Feature Co-adaptation = Good Regularization Method!
• Training time: at each instance of evaluation (in online SGD-training), randomly set
50% of the inputs to each neuron to 0
• Test time: halve the model weights (now twice as many)
• (Except usually only drop first layer inputs a little (~15%) or not at all)
• This prevents feature co-adaptation: A feature cannot only be useful in the presence
of particular other features
• In a single layer: A kind of middle-ground between Naïve Bayes (where all feature
weights are set independently) and logistic regression models (where weights are
set in the context of all others)
• Can be thought of as a form of model bagging (i.e., like an ensemble model)
• Nowadays usually thought of as strong, feature-dependent regularizer
[Wager, Wang, & Liang 2013]
18
“Vectorization”
• E.g., looping over word vectors versus concatenating them all into one large matrix
and then multiplying the softmax weights with that matrix:

• 1000 loops, best of 3: 639 µs per loop

10000 loops, best of 3: 53.8 µs per loop ß Now using a single a C x N matrix
• Matrices are awesome!!! Always try to use vectors and matrices rather than for loops!
• The speed gain goes from 1 to 2 orders of magnitude with GPUs!
19
Non-linearities, old and new
logistic (“sigmoid”) tanh hard tanh ReLU (Rectified Linear Unit)
rect(z) = max(z, 0)
1 1

0 −1

tanh is just a rescaled and shifted sigmoid (2 × as steep, [−1,1]): Leaky ReLU / Swish [Ramachandran,
tanh(z) = 2logistic(2z) −1 Parametric ReLU Zoph & Le 2017]
Both logistic and tanh are still used in various places (e.g., to get a
probability), but are no longer the defaults for making deep networks
For building a deep network, the first thing you should try is ReLU —
it trains quickly and performs well due to good gradient backflow
Parameter Initialization
• You normally must initialize weights to small random values (i.e., not zero matrices!)
• To avoid symmetries that prevent learning/specialization
• Initialize hidden layer biases to 0 and output (or reconstruction) biases to optimal value
if weights were 0 (e.g., mean target or inverse sigmoid of mean target)
• Initialize all other weights ~ Uniform(–r, r), with r chosen so numbers get neither too
big or too small [later the need for this is removed with use of layer normalization]
• Xavier initialization has variance inversely proportional to fan-in nin (previous layer size)
and fan-out nout (next layer size):
Optimizers
• Usually, plain SGD will work just fine!
• However, getting good results will often require hand-tuning the learning rate
• See next slide
• For more complex nets and situations, or just to avoid worry, you often do better with
one of a family of more sophisticated “adaptive” optimizers that scale the parameter
adjustment by an accumulated gradient.
• These models give differential per-parameter learning rates
• Adagrad
• RMSprop
• Adam ß A fairly good, safe place to begin in many cases
• SparseAdam
• …
Learning Rates
• You can just use a constant learning rate. Start around lr = 0.001?
• It must be order of magnitude right – try powers of 10
• Too big: model may diverge or not converge
• Too small: your model may not have trained by the assignment deadline
• Better results can generally be obtained by allowing learning rates to decrease as you
train
• By hand: halve the learning rate every k epochs
• An epoch = a pass through the data (shuffled or sampled – not in same order each time)
• By a formula: 𝑙𝑟 = 𝑙𝑟!𝑒 "#$ , for epoch t
• There are fancier methods like cyclic learning rates (q.v.)
• Fancier optimizers still use a learning rate but it may be an initial rate that the
optimizer shrinks – so you may want to start with a higher learning rate
3. Language Modeling + RNNs

24
Language Modeling
• Language Modeling is the task of predicting what word comes next
books
the students opened their ______ laptops

exams
minds
• More formally: given a sequence of words ,
compute the probability distribution of the next word :

where can be any word in the vocabulary

• A system that does this is called a Language Model

25
Language Modeling
• You can also think of a Language Model as a system that
assigns probability to a piece of text

• For example, if we have some text , then the

probability of this text (according to the Language Model) is:

This is what our LM provides

26
You use Language Models every day!

27
You use Language Models every day!

28
n-gram Language Models
the students opened their ______

• Question: How to learn a Language Model?

• Answer (pre- Deep Learning): learn an n-gram Language Model!

• Definition: A n-gram is a chunk of n consecutive words.

• unigrams: “the”, “students”, “opened”, ”their”
• bigrams: “the students”, “students opened”, “opened their”
• trigrams: “the students opened”, “students opened their”
• 4-grams: “the students opened their”

• Idea: Collect statistics about how frequent different n-grams are and use these to
predict next word.
29
n-gram Language Models
• First we make a Markov assumption: 𝑥 ($&') depends only on the preceding n-1 words
n-1 words

(assumption)

prob of a n-gram
(definition of
prob of a (n-1)-gram conditional prob)

• Question: How do we get these n-gram and (n-1)-gram probabilities?

• Answer: By counting them in some large corpus of text!

(statistical
approximation)
30
n-gram Language Models: Example
Suppose we are learning a 4-gram Language Model.
as the proctor started the clock, the students opened their _____
discard
condition on this

For example, suppose that in the corpus:

• “students opened their” occurred 1000 times
• “students opened their books” occurred 400 times
• à P(books | students opened their) = 0.4 Should we have discarded
• “students opened their exams” occurred 100 times the “proctor” context?
• à P(exams | students opened their) = 0.1
31
Sparsity Problems with n-gram Language Models
Sparsity Problem 1
Problem: What if “students
(Partial) Solution: Add small 𝛿
opened their 𝑤” never
to the count for every 𝑤 ∈ 𝑉.
occurred in data? Then 𝑤 has
This is called smoothing.
probability 0!

Sparsity Problem 2
Problem: What if “students
(Partial) Solution: Just condition
opened their” never occurred in
on “opened their” instead.
data? Then we can’t calculate
This is called backoff.
probability for any 𝑤!

Note: Increasing n makes sparsity problems worse.

Typically, we can’t have n bigger than 5.
32
Storage Problems with n-gram Language Models

Storage: Need to store

count for all n-grams you
saw in the corpus.

Increasing n or increasing
corpus increases model size!

33
n-gram Language Models in practice
• You can build a simple trigram Language Model over a
1.7 million word corpus (Reuters) in a few seconds on your laptop*
Business and financial news
today the _______

get probability
distribution

company 0.153 Sparsity problem:

bank 0.153 not much granularity
price 0.077
in the probability
italian 0.039
emirate 0.039 distribution
…

Otherwise, seems reasonable! * Try for yourself: https://nlpforhackers.io/language-models/

34
Generating text with a n-gram Language Model
You can also use a Language Model to generate text

today the _______

condition get probability

on this distribution

company 0.153
bank 0.153
price 0.077 sample
italian 0.039
emirate 0.039
…

35
Generating text with a n-gram Language Model
You can also use a Language Model to generate text

today the price _______

condition get probability

on this distribution

of 0.308 sample
for 0.050
it 0.046
to 0.046
is 0.031
…

36
Generating text with a n-gram Language Model
You can also use a Language Model to generate text

today the price of _______

condition get probability

on this distribution

the 0.072
18 0.043
oil 0.043
its 0.036
gold 0.018 sample
…

37
Generating text with a n-gram Language Model
You can also use a Language Model to generate text

today the price of gold per ton , while production of shoe

lasts and shoe industry , the bank intervened just after it
considered and rejected an imf demand to rebuild depleted
european stocks , sept 30 end primary 76 cts a share .

Surprisingly grammatical!

…but incoherent. We need to consider more than

three words at a time if we want to model language well.

But increasing n worsens sparsity problem,

and increases model size…
38
How to build a neural Language Model?
• Recall the Language Modeling task:
• Input: sequence of words
• Output: prob dist of the next word

• How about a window-based neural model?

• We saw this applied to Named Entity Recognition in Lecture 3:
LOCATION

museums in Paris are amazing

39
A fixed-window neural Language Model

as the proctor started the clock the students opened their ______
discard
fixed window
40
A fixed-window neural Language Model
books
laptops

output distribution

a zoo

hidden layer

concatenated word embeddings

words / one-hot vectors the students opened their

41
A fixed-window neural Language Model
Approximately: Y. Bengio, et al. (2000/2003): A Neural Probabilistic Language Model
Improvements over n-gram LM: books
laptops
• No sparsity problem
• Don’t need to store all observed n-grams
a zoo
Remaining problems:
• Fixed window is too small
• Enlarging window enlarges 𝑊
• Window can never be large enough!
• 𝑥 (') and 𝑥 ()) are multiplied by
completely different weights in 𝑊.
No symmetry in how the inputs are
processed.
We need a neural architecture the students opened their
that can process any length input
42
Core idea: Apply
Recurrent Neural Networks (RNN) the same weights
A family of neural architectures 𝑊 repeatedly

outputs
(optional) …

hidden states …

input sequence
(any length) …

43
A Simple RNN Language Model books
laptops

output distribution

a zoo

hidden states

is the initial hidden state

word embeddings

words / one-hot vectors the students opened their

Note: this input sequence could be much

44 longer now!
RNN Language Models books
laptops

RNN Advantages:
• Can process any length input
• Computation for step t can (in a zoo
theory) use information from
many steps back
• Model size doesn’t increase for
longer input context
• Same weights applied on every
timestep, so there is symmetry
in how inputs are processed.

RNN Disadvantages:
• Recurrent computation is slow More on
• In practice, difficult to access these later
information from many steps in the the students opened their
back course
45
Training an RNN Language Model
• Get a big corpus of text which is a sequence of words
• Feed into RNN-LM; compute output distribution for every step t.
• i.e. predict probability dist of every word, given words so far

• Loss function on step t is cross-entropy between predicted probability

distribution , and the true next word (one-hot for ):

• Average this to get overall loss for entire training set:

46
Natural Language Processing
with Deep Learning
CS224N/Ling284

Christopher Manning
Lecture 6: Simple and LSTM Recurrent Neural Networks
Lecture Plan
1. RNN Language Models (25 mins)
2. Other uses of RNNs (8 mins)
3. Exploding and vanishing gradients (15 mins)
4. LSTMs (20 mins)
5. Bidirectional and multi-layer RNNs (12 mins)

• Projects
• Next Thursday: a lecture about choosing final projects
• It’s fine to delay thinking about projects until next week
• But if you’re already thinking about projects, you can view some info/inspiration on
the website. It’s still last year’s information at present!
• It’s great if you can line up your own mentor; we also lining up some mentors
2
Overview
• Last lecture we learned:
• Language models, n-gram language models, and Recurrent Neural Networks (RNNs)

• Today we’ll learn how to get RNNs to work for you

• Training RNNs
• Uses of RNNs
• Problems with RNNs (exploding and vanishing gradients) and how to fix them
• These problems motivate a more sophisticated RNN architecture: LSTMs
• And other more complex RNN options: bidirectional RNNs and multi-layer RNNs

• Next lecture we’ll learn:

• How we can do Neural Machine Translation (NMT) using an RNN-based architecture
called sequence-to-sequence with attention
3
1. The Simple RNN Language Model books
laptops

output distribution

a zoo

hidden states

is the initial hidden state

word embeddings

words / one-hot vectors the students opened their

Note: this input sequence could be much

4 longer now!
Training an RNN Language Model
• Get a big corpus of text which is a sequence of words
• Feed into RNN-LM; compute output distribution for every step t.
• i.e., predict probability dist of every word, given words so far

• Loss function on step t is cross-entropy between predicted probability

distribution , and the true next word (one-hot for ):

• Average this to get overall loss for entire training set:

5
Training an RNN Language Model
= negative log prob
of “students”
Loss

Predicted
prob dists

Corpus the students opened their exams …

6
Training an RNN Language Model
= negative log prob
of “opened”
Loss

Predicted
prob dists

Corpus the students opened their exams …

7
Training an RNN Language Model
= negative log prob
of “their”
Loss

Predicted
prob dists

Corpus the students opened their exams …

8
Training an RNN Language Model
= negative log prob
of “exams”
Loss

Predicted
prob dists

Corpus the students opened their exams …

9
Training an RNN Language Model
“Teacher forcing”
Loss + + + +… =

Predicted
prob dists

Corpus the students opened their exams …

10
Training a RNN Language Model
• However: Computing loss and gradients across entire corpus is too
expensive!

• In practice, consider as a sentence (or a document)

• Recall: Stochastic Gradient Descent allows us to compute loss and gradients for small
chunk of data, and update.

• Compute loss for a sentence (actually, a batch of sentences), compute gradients

and update weights. Repeat.

11
Training the parameters of RNNs: Backpropagation for RNNs

… …

Question: What’s the derivative of w.r.t. the repeated weight matrix ?

“The gradient w.r.t. a repeated weight

Answer: is the sum of the gradient
w.r.t. each time it appears”
Why?

12
Multivariable Chain Rule

Gradients sum at
outward branches!
(lecture 3)

Source:
https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/differentiating-vector-valued-functions/a/multivariable-chain-rule-simple-version

13
Backpropagation for RNNs: Proof sketch

In our example: Apply the multivariable chain rule:

…
eq

eq u als
als
ua

eq u
ls

Source:
https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/differentiating-vector-valued-functions/a/multivariable-chain-rule-simple-version

14
Backpropagation for RNNs

… …

Answer: Backpropagate over timesteps

i=t,…,0, summing gradients as you go.
This algorithm is called “backpropagation
through time” [Werbos, P.G., 1988, Neural
Networks 1, and others]

Question: How do we
15 calculate this?
Generating text with a RNN Language Model
Just like a n-gram Language Model, you can use an RNN Language Model to
generate text by repeated sampling. Sampled output becomes next step’s input.

favorite season is spring

sample sample sample sample

16 my favorite season is spring

Generating text with an RNN Language Model
Let’s have some fun!
• You can train an RNN-LM on any kind of text, then generate text in that style.
• RNN-LM trained on Harry Potter:

Source: https://medium.com/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da6
17
Generating text with an RNN Language Model
Let’s have some fun!
• You can train an RNN-LM on any kind of text, then generate text in that style.
• RNN-LM trained on recipes:

Source: https://gist.github.com/nylki/1efbaa36635956d35bcc
18
Evaluating Language Models
• The standard evaluation metric for Language Models is perplexity.

Normalized by
number of words

Inverse probability of corpus, according to Language Model

• This is equal to the exponential of the cross-entropy loss :

Lower perplexity is better!

19
RNNs have greatly improved perplexity

n-gram model

Increasingly
complex RNNs

Perplexity improves
(lower is better)

Source: https://research.fb.com/building-an-efficient-neural-language-model-over-a-billion-words/

20
Why should we care about Language Modeling?
• Language Modeling is a benchmark task that helps us
measure our progress on understanding language

• Language Modeling is a subcomponent of many NLP tasks, especially those involving

generating text or estimating the probability of text:
• Predictive typing
• Speech recognition
• Handwriting recognition
• Spelling/grammar correction
• Authorship identification
• Machine translation
• Summarization
• Dialogue
• etc.

21
Recap
• Language Model: A system that predicts the next word

• Recurrent Neural Network: A family of neural networks that:

• Take sequential input of any length
• Apply the same weights on each step
• Can optionally produce output on each step

• Recurrent Neural Network ≠ Language Model

• We’ve shown that RNNs are a great way to build a LM

• But RNNs are useful for much more!

22
2. Other RNN uses: RNNs can be used for sequence tagging
e.g., part-of-speech tagging, named entity recognition

DT JJ NN VBN IN DT NN

the startled cat knocked over the vase

23
RNNs can be used for sentence classification
e.g., sentiment classification
positive How to compute
sentence encoding?

Sentence
encoding

overall I enjoyed the movie a lot

24
RNNs can be used for sentence classification
e.g., sentiment classification
positive How to compute
sentence encoding?

Sentence Basic way:

encoding Use final hidden
state
e qua
ls

overall I enjoyed the movie a lot

25
RNNs can be used for sentence classification
e.g., sentiment classification
positive How to compute
sentence encoding?

Sentence Usually better:

encoding Take element-wise
max or mean of all
hidden states

overall I enjoyed the movie a lot

26
RNNs can be used as a language encoder module
e.g., question answering, machine translation, many other tasks!
Answer: German
Here the RNN acts as an
al

lot chi
encoder for the Question (the r
eu re

s o tec
ar
n
of ectu

f n tur
hidden states represent the s

eu e
lot rchit

ra
Question). The encoder is part

l
a

of a larger neural system. Context: Ludwig

van Beethoven was
a German
composer and
pianist. A crucial
figure …

Question: what nationality was Beethoven ?

27
RNN-LMs can be used to generate text
e.g., speech recognition, machine translation, summarization
RNN-LM

what’s the weather

Input (audio)

conditioning

<START> what’s the

This is an example of a conditional language model.

We’ll see Machine Translation in much more detail next class.
28
3. Problems with Vanishing and Exploding Gradients

29
Vanishing gradient intuition

30
Vanishing gradient intuition

chain rule!

31
Vanishing gradient intuition

chain rule!

32
Vanishing gradient intuition

chain rule!

33
Vanishing gradient intuition

Vanishing gradient problem:

When these are small, the gradient
What happens if these are small? signal gets smaller and smaller as it
backpropagates further
34
Vanishing gradient proof sketch (linear case)
• Recall:
• What if were the identity function, ?
(chain rule)

• Consider the gradient of the loss on step , with respect

to the hidden state on some previous step . Let
(chain rule)

(value of )

If Wh is “small”, then this term gets

exponentially problematic as becomes large

Source: “On the difficulty of training recurrent neural networks”, Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf
35 (and supplemental materials), at http://proceedings.mlr.press/v28/pascanu13-supp.pdf
Vanishing gradient proof sketch (linear case)
sufficient but
• What’s wrong with ? not necessary

• Consider if the eigenvalues of are all less than 1:

(eigenvectors)
• We can write using the eigenvectors of as a basis:

Approaches 0 as grows, so gradient vanishes

• What about nonlinear activations (i.e., what we use?)

• Pretty much the same thing, except the proof requires
for some dependent on dimensionality and
Source: “On the difficulty of training recurrent neural networks”, Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf
36 (and supplemental materials), at http://proceedings.mlr.press/v28/pascanu13-supp.pdf
Why is vanishing gradient a problem?

Gradient signal from far away is lost because it’s much smaller than gradient signal from close-by.

So, model weights are updated only with respect to near effects, not long-term effects.

37
Effect of vanishing gradient on RNN-LM
• LM task: When she tried to print her tickets, she found that the printer was out of toner.
She went to the stationery store to buy more toner. It was very overpriced. After
installing the toner into the printer, she finally printed her ________

• To learn from this training example, the RNN-LM needs to model the dependency
between “tickets” on the 7th step and the target word “tickets” at the end.

• But if gradient is small, the model can’t learn this dependency

• So, the model is unable to predict similar long-distance dependencies at test time

38
Why is exploding gradient a problem?
• If the gradient becomes too big, then the SGD update step becomes too big:
learning rate

gradient

• This can cause bad updates: we take too large a step and reach a weird and bad
parameter configuration (with large loss)
• You think you’ve found a hill to climb, but suddenly you’re in Iowa

• In the worst case, this will result in Inf or NaN in your network
(then you have to restart training from an earlier checkpoint)

39
Gradient clipping: solution for exploding gradient
• Gradient clipping: if the norm of the gradient is greater than some threshold, scale it
down before applying SGD update

• Intuition: take a step in the same direction, but a smaller step

• In practice, remembering to clip gradients is important, but exploding gradients are an

easy problem to solve
Source: “On the difficulty of training recurrent neural networks”, Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf
40
How to fix the vanishing gradient problem?
• The main problem is that it’s too difficult for the RNN to learn to preserve information
over many timesteps.

• In a vanilla RNN, the hidden state is constantly being rewritten

• How about a RNN with separate memory?

41
4. Long Short-Term Memory RNNs (LSTMs)
• A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients
problem.
• Everyone cites that paper but really a crucial part of the modern LSTM is from Gers et al. (2000) 💜

• On step t, there is a hidden state and a cell state

• Both are vectors length n
• The cell stores long-term information
• The LSTM can read, erase, and write information from the cell
• The cell becomes conceptually rather like RAM in a computer

• The selection of which information is erased/written/read is controlled by three corresponding gates

• The gates are also vectors length n
• On each timestep, each element of the gates can be open (1), closed (0), or somewhere in-between
• The gates are dynamic: their value is computed based on the current context
“Long short-term memory”, Hochreiter and Schmidhuber, 1997. https://www.bioinf.jku.at/publications/older/2604.pdf
“Learning to Forget: Continual Prediction with LSTM”, Gers, Schmidhuber, and Cummins, 2000. https://dl.acm.org/doi/10.1162/089976600300015015
42
Long Short-Term Memory (LSTM)
We have a sequence of inputs 𝑥 (") , and we will compute a sequence of hidden states ℎ(") and cell states
𝑐 (") . On timestep t:
Sigmoid function: all gate
Forget gate: controls what is kept vs values are between 0 and 1
forgotten, from previous cell state

Input gate: controls what parts of the

All these are vectors of same length n

new cell content are written to cell

Output gate: controls what parts of

cell are output to hidden state

New cell content: this is the new

content to be written to the cell

Cell state: erase (“forget”) some

content from last cell state, and write
(“input”) some new cell content

Hidden state: read (“output”) some

content from the cell
Gates are applied using element-wise
43 (or Hadamard) product: ⊙
Long Short-Term Memory (LSTM)
You can think of the LSTM equations visually like this:

44 Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Long Short-Term Memory (LSTM)
You can think of the LSTM equations visually like this:
Write some new cell content
The + sign is the secret!
Forget some
cell content ct
ct-1 ct
it ot Output some cell content
ft ~ct
Compute the to the hidden state
forget gate
ht-1 ht

Compute the Compute the Compute the

input gate new cell content output gate

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
45
How does LSTM solve vanishing gradients?
• The LSTM architecture makes it easier for the RNN to
preserve information over many timesteps
• e.g., if the forget gate is set to 1 for a cell dimension and the input
gate set to 0, then the information of that cell is preserved
indefinitely.
• In contrast, it’s harder for a vanilla RNN to learn a recurrent
weight matrix Wh that preserves info in the hidden state
• In practice, you get about 100 timesteps rather than about 7

• LSTM doesn’t guarantee that there is no vanishing/exploding

gradient, but it does provide an easier way for the model to learn
long-distance dependencies

46
LSTMs: real-world success
• In 2013–2015, LSTMs started achieving state-of-the-art results
• Successful tasks include handwriting recognition, speech recognition, machine
translation, parsing, and image captioning, as well as language models
• LSTMs became the dominant approach for most NLP tasks

• Now (2021), other approaches (e.g., Transformers) have become dominant for many
tasks
• For example, in WMT (a Machine Translation conference + competition):
• In WMT 2016, the summary report contains “RNN” 44 times
• In WMT 2019: “RNN” 7 times, ”Transformer” 105 times

Source: "Findings of the 2016 Conference on Machine Translation (WMT16)", Bojar et al. 2016, http://www.statmt.org/wmt16/pdf/W16-2301.pdf
Source: "Findings of the 2018 Conference on Machine Translation (WMT18)", Bojar et al. 2018, http://www.statmt.org/wmt18/pdf/WMT028.pdf
47 Source: "Findings of the 2019Conference on Machine Translation (WMT19)", Barrault et al. 2019, http://www.statmt.org/wmt18/pdf/WMT028.pdf
Is vanishing/exploding gradient just a RNN problem?
• No! It can be a problem for all neural architectures (including feed-forward and
convolutional), especially very deep ones.
• Due to chain rule / choice of nonlinearity function, gradient can become vanishingly small as it
backpropagates
• Thus, lower layers are learned very slowly (hard to train)
• Solution: lots of new deep feedforward/convolutional architectures that add more
direct connections (thus allowing the gradient to flow)

For example:
• Residual connections aka “ResNet”
• Also known as skip-connections
• The identity connection
preserves information by default
• This makes deep networks much
easier to train
"Deep Residual Learning for Image Recognition", He et al, 2015. https://arxiv.org/pdf/1512.03385.pdf
48
Is vanishing/exploding gradient just a RNN problem?
• Solution: lots of new deep feedforward/convolutional architectures that add more
direct connections (thus allowing the gradient to flow)

Other methods:
• Dense connections aka “DenseNet” • Highway connections aka “HighwayNet”
• Directly connect each layer to all future layers! • Similar to residual connections, but the identity
connection vs the transformation layer is
controlled by a dynamic gate
• Inspired by LSTMs, but applied to deep
feedforward/convolutional networks

”Densely Connected Convolutional Networks", Huang et al, 2017. https://arxiv.org/pdf/1608.06993.pdf ”Highway Networks", Srivastava et al, 2015. https://arxiv.org/pdf/1505.00387.pdf
49
Is vanishing/exploding gradient just a RNN problem?
• No! It can be a problem for all neural architectures (including feed-forward and
convolutional), especially very deep ones.
• Due to chain rule / choice of nonlinearity function, gradient can become vanishingly small as it
backpropagates
• Thus, lower layers are learned very slowly (hard to train)
• Solution: lots of new deep feedforward/convolutional architectures that add more
direct connections (thus allowing the gradient to flow)

• Conclusion: Though vanishing/exploding gradients are a general problem, RNNs are

particularly unstable due to the repeated multiplication by the same weight matrix
[Bengio et al, 1994]

”Learning Long-Term Dependencies with Gradient Descent is Difficult", Bengio et al. 1994, http://ai.dinfo.unifi.it/paolo//ps/tnn-94-gradient.pdf
50
5. Bidirectional and Multi-layer RNNs: motivation
Task: Sentiment Classification
We can regard this hidden state as a
positive representation of the word “terribly” in the
context of this sentence. We call this a
contextual representation.

Sentence These contextual

encoding representations only
ax elem contain information
a n / m ent- about the left context
e wise
t-w ise m mea (e.g. “the movie
e n n /ma
ele m x was”).

What about right

context?

In this example,
“exciting” is in the
right context and this
modifies the meaning
the movie was terribly exciting ! of “terribly” (from
negative to positive)
51
This contextual representation of “terribly”
Bidirectional RNNs has both left and right context!

Concatenated
hidden states

Backward RNN

Forward RNN

the movie was terribly exciting !

52
Bidirectional RNNs
This is a general notation to mean
On timestep t:
“compute one forward step of the
RNN” – it could be a vanilla, LSTM
or GRU computation.

Forward RNN Generally, these

two RNNs have
Backward RNN separate weights

Concatenated hidden states

We regard this as “the hidden

state” of a bidirectional RNN.
This is what we pass on to the
53
next parts of the network.
Bidirectional RNNs: simplified diagram

the movie was terribly exciting !

The two-way arrows indicate bidirectionality and

the depicted hidden states are assumed to be the
concatenated forwards+backwards states

54
Bidirectional RNNs
• Note: bidirectional RNNs are only applicable if you have access to the entire input
sequence
• They are not applicable to Language Modeling, because in LM you only have left
context available.

• If you do have entire input sequence (e.g., any kind of encoding), bidirectionality is
powerful (you should use it by default).

• For example, BERT (Bidirectional Encoder Representations from Transformers) is a

powerful pretrained contextual representation system built on bidirectionality.
• You will learn more about transformers include BERT in a couple of weeks!

55
Multi-layer RNNs
• RNNs are already “deep” on one dimension (they unroll over many timesteps)

• We can also make them “deep” in another dimension by

applying multiple RNNs – this is a multi-layer RNN.

• This allows the network to compute more complex representations

• The lower RNNs should compute lower-level features and the higher RNNs should
compute higher-level features.

• Multi-layer RNNs are also called stacked RNNs.

56
Multi-layer RNNs The hidden states from RNN layer i
are the inputs to RNN layer i+1

RNN layer 3

RNN layer 2

RNN layer 1

the movie was terribly exciting !

57
Multi-layer RNNs in practice
• High-performing RNNs are often multi-layer (but aren’t as deep as convolutional or
feed-forward networks)

• For example: In a 2017 paper, Britz et al find that for Neural Machine Translation, 2 to 4
layers is best for the encoder RNN, and 4 layers is best for the decoder RNN
• Usually, skip-connections/dense-connections are needed to train deeper RNNs (e.g., 8 layers)

• Transformer-based networks (e.g., BERT) are usually deeper, like 12 or 24 layers.

• You will learn about Transformers later; they have a lot of
skipping-like connections

“Massive Exploration of Neural Machine Translation Architecutres”, Britz et al, 2017. https://arxiv.org/pdf/1703.03906.pdf
58
In summary
Lots of new information today! What are some of the practical takeaways?

1. LSTMs are powerful 2. Clip your gradients

3. Use bidirectionality 4. Multi-layer RNNs are more powerful, but

59
when possible you might need skip connections if it’s deep

Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Cheatsheet Recurrent Neural Networks
No ratings yet
Cheatsheet Recurrent Neural Networks
5 pages
Cs224n 2023 Lecture05 RNNLM
No ratings yet
Cs224n 2023 Lecture05 RNNLM
68 pages
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
No ratings yet
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
66 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
Deep Learning (MODULE-4)_RNN - NLP
No ratings yet
Deep Learning (MODULE-4)_RNN - NLP
52 pages
Recurrent Neural Networks: Amir H. Payberah
No ratings yet
Recurrent Neural Networks: Amir H. Payberah
142 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
No ratings yet
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
53 pages
04 - RNNs
No ratings yet
04 - RNNs
37 pages
Cs224n 2025 Lecture05 Rnnlm
No ratings yet
Cs224n 2025 Lecture05 Rnnlm
54 pages
Brief Introduction to LLM
No ratings yet
Brief Introduction to LLM
69 pages
14-LookingForward
No ratings yet
14-LookingForward
48 pages
Cs224n Text Generation
No ratings yet
Cs224n Text Generation
73 pages
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
14 pages
2023 07 28 Evolution of Language Models
No ratings yet
2023 07 28 Evolution of Language Models
73 pages
3-Lecture Three - (Chapter Two-N-gram Language Models)
No ratings yet
3-Lecture Three - (Chapter Two-N-gram Language Models)
28 pages
LLM_book_43-102
No ratings yet
LLM_book_43-102
60 pages
Recurrent Neural Nets
No ratings yet
Recurrent Neural Nets
144 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Ngrams
100% (1)
Ngrams
22 pages
Foundations of Large Language Models 1738142777
No ratings yet
Foundations of Large Language Models 1738142777
101 pages
Foundations of LLM
No ratings yet
Foundations of LLM
231 pages
4 Classification 2
No ratings yet
4 Classification 2
55 pages
L5_CSE256_FA24_LM
No ratings yet
L5_CSE256_FA24_LM
65 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
5th Unit
No ratings yet
5th Unit
36 pages
Deep Neural Network Language Models - W12-2703
No ratings yet
Deep Neural Network Language Models - W12-2703
9 pages
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
No ratings yet
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
7 pages
LSTM Lecture
No ratings yet
LSTM Lecture
163 pages
Jason Wei Stanford cs330 Talk
No ratings yet
Jason Wei Stanford cs330 Talk
44 pages
Natual Language Processing
No ratings yet
Natual Language Processing
33 pages
Nn4ir PDF
No ratings yet
Nn4ir PDF
290 pages
NLP NN Language Modeling Week5
No ratings yet
NLP NN Language Modeling Week5
33 pages
NLP Cache Model
No ratings yet
NLP Cache Model
9 pages
UNIT 5a
No ratings yet
UNIT 5a
48 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
Anlp 02 Wordrep Textclass
No ratings yet
Anlp 02 Wordrep Textclass
58 pages
RNN
No ratings yet
RNN
22 pages
RNN-1
No ratings yet
RNN-1
50 pages
anlp-02-wordrep-textclass
No ratings yet
anlp-02-wordrep-textclass
59 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
cs224n-2021-LSTM NN
No ratings yet
cs224n-2021-LSTM NN
59 pages
Cs224n 2025 Lecture03 Neuralnets
No ratings yet
Cs224n 2025 Lecture03 Neuralnets
96 pages
Lecture 7 - Conditional Language Modeling
No ratings yet
Lecture 7 - Conditional Language Modeling
64 pages
lec20.LLM
No ratings yet
lec20.LLM
58 pages
10.48550 Arxiv.2204.02311
No ratings yet
10.48550 Arxiv.2204.02311
87 pages
Downloed Papers
No ratings yet
Downloed Papers
700 pages
Kalyan 1 s2.0 S2949719123000456 Main
No ratings yet
Kalyan 1 s2.0 S2949719123000456 Main
48 pages
(2303.18223) A Survey of Large Language Models
No ratings yet
(2303.18223) A Survey of Large Language Models
115 pages
Tsai et al_2020_Learning molecular dynamics with simple language model built upon long
No ratings yet
Tsai et al_2020_Learning molecular dynamics with simple language model built upon long
11 pages
2005 14165v3 PDF
No ratings yet
2005 14165v3 PDF
74 pages
Automated Image Captioning With Convnets and Recurrent Nets: Andrej Karpathy, Fei-Fei Li
No ratings yet
Automated Image Captioning With Convnets and Recurrent Nets: Andrej Karpathy, Fei-Fei Li
105 pages
Kami Export - Assignment - 2 - 20240709
No ratings yet
Kami Export - Assignment - 2 - 20240709
13 pages
Day 1
No ratings yet
Day 1
32 pages
Chowdhery Et Al. - 2022 - PaLM Scaling Language Modeling With Pathways
No ratings yet
Chowdhery Et Al. - 2022 - PaLM Scaling Language Modeling With Pathways
83 pages
Attacking Problems in Logarithms and Exponential Functions
From Everand
Attacking Problems in Logarithms and Exponential Functions
David S. Kahn
5/5 (1)
SAT Math: Master the Skills in 40 Pages
From Everand
SAT Math: Master the Skills in 40 Pages
Jennifer L Johnson
No ratings yet
Homework Helpers: Calculus
From Everand
Homework Helpers: Calculus
Denise Szecsei
No ratings yet
FPGA Implementation of A Convolutional Neural Network For Wake Up Word Detection - Project Assignment - Ole Martin Skafsa - NTNU
No ratings yet
FPGA Implementation of A Convolutional Neural Network For Wake Up Word Detection - Project Assignment - Ole Martin Skafsa - NTNU
120 pages
Istanbul Technical University: Institute of Science and Technology
No ratings yet
Istanbul Technical University: Institute of Science and Technology
113 pages
Neutral Networ LF
No ratings yet
Neutral Networ LF
7 pages
A Step by Step Backpropagation Example - Matt Mazur
No ratings yet
A Step by Step Backpropagation Example - Matt Mazur
7 pages
Solar Power Forcasting Recent Topic
No ratings yet
Solar Power Forcasting Recent Topic
24 pages
SC Unit 4
No ratings yet
SC Unit 4
23 pages
7th Sem Updated Lab Manual
No ratings yet
7th Sem Updated Lab Manual
14 pages
Thesis of Industrial Engineering
100% (3)
Thesis of Industrial Engineering
5 pages
Lecture9_ML-Algorithms
No ratings yet
Lecture9_ML-Algorithms
22 pages
Cauchy Machine Neural Network Presentation
No ratings yet
Cauchy Machine Neural Network Presentation
25 pages
Deep Learning - Assignment 11 Your Name, Roll Number 1. What Is The Difference Between Backpropagation Algorithm and Backpropagation Through Time (BPTT) Algorithm ?
No ratings yet
Deep Learning - Assignment 11 Your Name, Roll Number 1. What Is The Difference Between Backpropagation Algorithm and Backpropagation Through Time (BPTT) Algorithm ?
10 pages
Efficient Second-Order TreeCRF For Neural Dependency Parsing
No ratings yet
Efficient Second-Order TreeCRF For Neural Dependency Parsing
11 pages
Question Bank
No ratings yet
Question Bank
4 pages
Neural Network Based Closed Loop Speed Control of DC Motor Using Arduino Uno
No ratings yet
Neural Network Based Closed Loop Speed Control of DC Motor Using Arduino Uno
4 pages
Basic Neural Networks
No ratings yet
Basic Neural Networks
9 pages
Chatbot: A Deep Neural Network Based Human To Machine Conversation Model
No ratings yet
Chatbot: A Deep Neural Network Based Human To Machine Conversation Model
7 pages
Must Know Questions Deep Learning
No ratings yet
Must Know Questions Deep Learning
22 pages
Audio Signal Processing by Neural Networks
No ratings yet
Audio Signal Processing by Neural Networks
34 pages
Enhanced Face Recognition Algorithm Using PCA With Artificial Neural Networks
No ratings yet
Enhanced Face Recognition Algorithm Using PCA With Artificial Neural Networks
9 pages
The Cascade-Correlation Learning Architecture: Scott E. Fahlman and Christian Lebiere
No ratings yet
The Cascade-Correlation Learning Architecture: Scott E. Fahlman and Christian Lebiere
14 pages
GA based back propogation networks
No ratings yet
GA based back propogation networks
4 pages
Atsalakis Et Al. 2011 - Elliott Wave Theory and Neuro-Fuzzy Systems in Stock Market Prediction - The WASP System
100% (1)
Atsalakis Et Al. 2011 - Elliott Wave Theory and Neuro-Fuzzy Systems in Stock Market Prediction - The WASP System
11 pages
Autonomous Land Vehicle in A Neural Network
No ratings yet
Autonomous Land Vehicle in A Neural Network
13 pages
Disease Detection Using ML
100% (8)
Disease Detection Using ML
24 pages
10 - Mark - CNN Architecture and Training
No ratings yet
10 - Mark - CNN Architecture and Training
7 pages
5a. Recurrent Neural Networks
No ratings yet
5a. Recurrent Neural Networks
45 pages
Vanishing and Exploding
No ratings yet
Vanishing and Exploding
9 pages
Unsupervised Domain Adaptation by Backpropagation
No ratings yet
Unsupervised Domain Adaptation by Backpropagation
11 pages
08 Ann PDF
No ratings yet
08 Ann PDF
72 pages