0% found this document useful (0 votes)

5 views120 pages

Support Materi

Uploaded by

Elisabeth Pasaribu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views120 pages

Support Materi

Uploaded by

Elisabeth Pasaribu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 120

Lecture 10:

Recurrent Neural Networks

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 1 April 28, 2022
Administrative
- Project TA matchups out, see Ed for the link

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 2 April 28, 2022
Administrative
- A2 is due next Monday May 2nd, 11:59pm

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 3 April 28, 2022
Administrative
- Discussion section tomorrow 2:30-3:30PT

Object detection & RNNs Review

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 4 April 28, 2022
Last time: Detection and Segmentation
Semantic Object Instance
Classification
Segmentation Detection Segmentation

CAT GRASS, CAT, DOG, DOG, CAT DOG, DOG, CAT

TREE, SKY

No spatial extent No objects, just pixels Multiple Objects This image is CC0 public domain

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 5 April 28, 2022
Training “Feedforward” Neural Networks

1. One time set up: activation functions, preprocessing,

weight initialization, regularization, gradient checking

2. Training dynamics: babysitting the learning process,

parameter updates, hyperparameter optimization

3. Evaluation: model ensembles, test-time

augmentation, transfer learning

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 6 April 28, 2022
Today: Recurrent Neural Networks

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 7 April 28, 2022
“Vanilla” Neural Network

Vanilla Neural Networks

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 8 April 28, 2022
Recurrent Neural Networks: Process Sequences

e.g. Image Captioning

image -> sequence of words

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 9 April 28, 2022
Recurrent Neural Networks: Process Sequences

e.g. action prediction

sequence of video frames -> action class

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 10 April 28, 2022
Recurrent Neural Networks: Process Sequences

E.g. Video Captioning

Sequence of video frames -> caption
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 11 April 28, 2022
Recurrent Neural Networks: Process Sequences

e.g. Video classification on frame level

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 12 April 28, 2022
Sequential Processing of Non-Sequence Data

Classify images by taking a

series of “glimpses”

Ba, Mnih, and Kavukcuoglu, “Multiple Object Recognition with Visual Attention”, ICLR 2015.
Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015
Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra,
2015. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 13 April 28, 2022
Sequential Processing of Non-Sequence Data
Generate images one piece at a time!

Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015
Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra, 2015. Reproduced with
permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 14 April 28, 2022
Recurrent Neural Network

RNN

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 15 April 28, 2022
Recurrent Neural Network

y
Key idea: RNNs have an
“internal state” that is
updated as a sequence is
RNN processed

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 16 April 28, 2022
Unrolled RNN

y1 y2 y3 yt

RNN RNN RNN ... RNN

x1 x2 x3 xt

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 17 April 28, 2022
RNN hidden state update
We can process a sequence of vectors x by
applying a recurrence formula at every time step: y

RNN
new state old state input vector at
some time step
some function x
with parameters W

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 18 April 28, 2022
RNN output generation
We can process a sequence of vectors x by
applying a recurrence formula at every time step: y

RNN

output new state

another function x
with parameters Wo

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 19 April 28, 2022
Recurrent Neural Network

y1 y2 y3 yt

h0 h1 h2 h3
RNN RNN RNN ... RNN

x1 x2 x3 xt

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 20 April 28, 2022
Recurrent Neural Network
We can process a sequence of vectors x by
applying a recurrence formula at every time step: y

RNN

Notice: the same function and the same set x

of parameters are used at every time step.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 21 April 28, 2022
(Vanilla) Recurrent Neural Network
The state consists of a single “hidden” vector h:

RNN

x
Sometimes called a “Vanilla RNN” or an
“Elman RNN” after Prof. Jeffrey Elman

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 22 April 28, 2022
RNN: Computational Graph

h0 fW h1

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 23 April 28, 2022
RNN: Computational Graph

h0 fW h1 fW h2

x1 x2

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 24 April 28, 2022
RNN: Computational Graph

h0 fW h1 fW h2 fW h3
… hT

x1 x2 x3

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 25 April 28, 2022
RNN: Computational Graph

Re-use the same weight matrix at every time-step

h0 fW h1 fW h2 fW h3
… hT

x1 x2 x3
W

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 26 April 28, 2022
RNN: Computational Graph: Many to Many

y1 y2 y3 yT

h0 fW h1 fW h2 fW h3
… hT

x1 x2 x3
W

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 27 April 28, 2022
RNN: Computational Graph: Many to Many

y1 L1 y2 L2 y3 L3 yT LT

h0 fW h1 fW h2 fW h3
… hT

x1 x2 x3
W

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 28 April 28, 2022
RNN: Computational Graph: Many to Many L

y1 L1 y2 L2 y3 L3 yT LT

h0 fW h1 fW h2 fW h3
… hT

x1 x2 x3
W

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 29 April 28, 2022
RNN: Computational Graph: Many to One

h0 fW h1 fW h2 fW h3
… hT

x1 x2 x3
W

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 30 April 28, 2022
RNN: Computational Graph: Many to One

h0 fW h1 fW h2 fW h3
… hT

x1 x2 x3
W

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 31 April 28, 2022
RNN: Computational Graph: One to Many

y1 y2 y3 yT

h0 fW h1 fW h2 fW h3
… hT

x
W

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 32 April 28, 2022
RNN: Computational Graph: One to Many

y1 y2 y3 yT

h0 fW h1 fW h2 fW h3
… hT

x ? ?
W ?

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 33 April 28, 2022
RNN: Computational Graph: One to Many

y1 y2 y3 yT

h0 fW h1 fW h2 fW h3
… hT

x 0 0
W 0

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 34 April 28, 2022
RNN: Computational Graph: One to Many

y1 y2 y3 yT

h0 fW h1 fW h2 fW h3
… hT

x y1 y2 yT-1

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 35 April 28, 2022
Sequence to Sequence: Many-to-one + one-to-many

Many to one: Encode input

sequence in a single vector

h0 fW h1 fW h2 fW h3 … hT

x1 x2 x3
W1

Sutskever et al, “Sequence to Sequence Learning with Neural Networks”, NIPS 2014

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 36 April 28, 2022
Sequence to Sequence: Many-to-one +
one-to-many
One to many: Produce output
sequence from single input vector
Many to one: Encode input
sequence in a single vector
y1 y2

h0 fW h1 fW h2 fW h3 … hT
fW h1 fW h2 fW …

x1 x2 x3
W1 W2

Sutskever et al, “Sequence to Sequence Learning with Neural Networks”, NIPS 2014

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 37 April 28, 2022
Example:
Character-level
Language Model

Vocabulary:
[h,e,l,o]

Example training
sequence:
“hello”

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 38 April 28, 2022
Example:
Character-level
Language Model

Vocabulary:
[h,e,l,o]

Example training
sequence:
“hello”

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 39 April 28, 2022
Example:
Character-level
Language Model

Vocabulary:
[h,e,l,o]

Example training
sequence:
“hello”

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 40 April 28, 2022
“e” “l” “l” “o”
Example: Character-level Sample
Language Model .03
.84
.25
.20
.11
.17
.11
.02
Softmax .00 .05 .68 .08
Sampling .13 .50 .03 .79

Vocabulary:
[h,e,l,o]

At test-time sample characters

one at a time, feed back to
model

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 41 April 28, 2022
“e” “l” “l” “o”
Example: Character-level Sample
Language Model .03
.84
.25
.20
.11
.17
.11
.02
Softmax .00 .05 .68 .08
Sampling .13 .50 .03 .79

Vocabulary:
[h,e,l,o]

At test-time sample
characters one at a time, feed
back to model

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 42 April 28, 2022
“e” “l” “l” “o”
Example: Character-level Sample
Language Model .03
.84
.25
.20
.11
.17
.11
.02
Softmax .00 .50 .68 .08
Sampling .13 .05 .03 .79

Vocabulary:
[h,e,l,o]

At test-time sample
characters one at a time, feed
back to model

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 43 April 28, 2022
“e” “l” “l” “o”
Example: Character-level Sample
Language Model .03
.84
.25
.20
.11
.17
.11
.02
Softmax .00 .50 .68 .08
Sampling .13 .05 .03 .79

Vocabulary:
[h,e,l,o]

At test-time sample
characters one at a time, feed
back to model

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 44 April 28, 2022
Example: Character-level
Language Model
Sampling

[w11 w12 w13 w14] [1] [w11]

[w21 w22 w23 w14] [0] = [w21]
[w31 w32 w33 w14] [0] [w31] Embedding
.03
.13
.25
.20
.11
.17
.11
.17
layer .00 .05 .68 .68
[0] .84 .50 .03 .03

Matrix multiply with a one-hot vector just

extracts a column from the weight matrix.
We often put a separate embedding layer
between input and hidden layers.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 45 April 28, 2022
Forward through entire sequence to
compute loss, then backward through
Backpropagation through time entire sequence to compute gradient

Loss

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 46 April 28, 2022
Truncated Backpropagation through time
Loss

Run forward and backward

through chunks of the
sequence instead of whole
sequence

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 47 April 28, 2022
Truncated Backpropagation through time
Loss

Carry hidden states

forward in time forever,
but only backpropagate
for some smaller
number of steps

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 48 April 28, 2022
Truncated Backpropagation through time
Loss

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 49 April 28, 2022
min-char-rnn.py gist: 112 lines of Python

(https://gist.github.com/karpathy/d4dee
566867f8291f086)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 50 April 28, 2022
y

RNN

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 51 April 28, 2022
at first:
train more

train more

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 52 April 28, 2022
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 53 April 28, 2022
The Stacks Project: open source algebraic geometry textbook

Latex source http://stacks.math.columbia.edu/

The stacks project is licensed under the GNU Free Documentation License

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 54 April 28, 2022
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 55 April 28, 2022
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 56 April 28, 2022
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 57 April 28, 2022
Generated
C code

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 58 April 28, 2022
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 59 April 28, 2022
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 60 April 28, 2022
https://openai.com/blog/openai-codex/

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 61 April 28, 2022
OpenAI GPT-2 generated text source

Input: In a shocking finding, scientist discovered a herd of unicorns living in a remote,

previously unexplored valley, in the Andes Mountains. Even more surprising to the
researchers was the fact that the unicorns spoke perfect English.

Output: The scientist named the population, after their distinctive horn, Ovid’s Unicorn.
These four-horned, silver-white unicorns were previously unknown to science.

Now, after almost two centuries, the mystery of what sparked this odd phenomenon is
finally solved.

Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several
companions, were exploring the Andes Mountains when they found a small valley, with no
other animals or humans. Pérez noticed that the valley had what appeared to be a natural
fountain, surrounded by two peaks of rock and silver snow.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 62 April 28, 2022
Searching for interpretable cells

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 63 April 28, 2022
Searching for interpretable cells

Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 64 April 28, 2022
Searching for interpretable cells

quote detection cell

Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 65 April 28, 2022
Searching for interpretable cells

line length tracking cell

Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 66 April 28, 2022
Searching for interpretable cells

if statement cell
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 67 April 28, 2022
Searching for interpretable cells

quote/comment cell
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 68 April 28, 2022
Searching for interpretable cells

code depth cell

Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 69 April 28, 2022
RNN tradeoffs

RNN Advantages:
- Can process any length input
- Computation for step t can (in theory) use information from many steps back
- Model size doesn’t increase for longer input
- Same weights applied on every timestep, so there is symmetry in how inputs
are processed.
RNN Disadvantages:
- Recurrent computation is slow
- In practice, difficult to access information from many steps back

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 70 April 28, 2022
Image Captioning

Figure from Karpathy et a, “Deep

Visual-Semantic Alignments for Generating
Image Descriptions”, CVPR 2015; figure
copyright IEEE, 2015.
Reproduced for educational purposes.

Explain Images with Multimodal Recurrent Neural Networks, Mao et al.

Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei
Show and Tell: A Neural Image Caption Generator, Vinyals et al.
Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al.
Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 71 April 28, 2022
Recurrent Neural Network

Convolutional Neural Network

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 72 April 28, 2022
test image

This image is CC0 public domain

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 73 April 28, 2022
test image

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 74 April 28, 2022
test image

X
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 75 April 28, 2022
test image

x0
<START>

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 76 April 28, 2022
test image

before:
h = tanh(Wxh * x + Whh * h)
h0

Wih
now:
h = tanh(Wxh * x + Whh * h + Wih * v)
x0
<START>

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 77 April 28, 2022
test image

sample!
h0

x0
straw
<START>

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 78 April 28, 2022
test image

y0 y1

h0 h1

x0
straw
<START>

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 79 April 28, 2022
test image

y0 y1

h0 h1 sample!

x0
straw hat
<START>

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 80 April 28, 2022
test image

y0 y1 y2

h0 h1 h2

x0
straw hat
<START>

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 81 April 28, 2022
test image

y0 y1 y2

sample
<END> token
h0 h1 h2 => finish.

x0
straw hat
<START>

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 82 April 28, 2022
Captions generated using neuraltalk2
All images are CC0 Public domain:

Image Captioning: Example Results cat suitcase, cat tree, dog, bear,
surfers, tennis, giraffe, motorcycle

A cat sitting on a A cat is sitting on a tree A dog is running in the A white teddy bear sitting in
suitcase on the floor branch grass with a frisbee the grass

Two people walking on A tennis player in action Two giraffes standing in a A man riding a dirt bike on
the beach with surfboards on the court grassy field a dirt track

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 83 April 28, 2022
Captions generated using neuraltalk2
All images are CC0 Public domain: fur

Image Captioning: Failure Cases coat, handstand, spider web, baseball

A bird is perched on
a tree branch

A woman is holding a cat

in her hand

A man in a
baseball uniform
throwing a ball

A woman standing on a
beach holding a surfboard
A person holding a
computer mouse on a desk

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 84 April 28, 2022
Visual Question Answering (VQA)

Agrawal et al, “VQA: Visual Question Answering”, ICCV 2015

Zhu et al, “Visual 7W: Grounded Question Answering in Images”, CVPR 2016
Figure from Zhu et al, copyright IEEE 2016. Reproduced for educational purposes.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 85 April 28, 2022
Visual Question Answering (VQA)

Agrawal et al, “Visual 7W: Grounded Question Answering in Images”, CVPR 2015
Figures from Agrawal et al, copyright IEEE 2015. Reproduced for educational purposes.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 86 April 28, 2022
Visual Dialog: Conversations about images

Das et al, “Visual Dialog”, CVPR 2017

Figures from Das et al, copyright IEEE 2017. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 87 April 28, 2022
Visual Language Navigation: Go to the living room
Agent encodes instructions in
language and uses an RNN to
generate a series of movements as the
visual input changes after each move.

Wang et al, “Reinforced Cross-Modal Matching and Self-Supervised

Imitation Learning for Vision-Language Navigation”, CVPR 2018
Figures from Wang et al, copyright IEEE 2017. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 88 April 28, 2022
All images are CC0 Public domain:
dog,

Visual Question Answering: Dataset Bias

Image

Model Yes or No
What is the dog Question
playing with?

Frisbee Answer

Jabri et al. “Revisiting Visual Question Answering Baselines” ECCV 2016

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 89 April 28, 2022
Multilayer RNNs

depth

time

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 90 April 28, 2022
Long Short Term Memory (LSTM)

Vanilla RNN LSTM

Hochreiter and Schmidhuber, “Long Short Term Memory”, Neural Computation 1997

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 91 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013

W tanh

ht-1 stack ht

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 92 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013

Backpropagation from ht
to ht-1 multiplies by W yt
(actually WhhT)

W tanh

ht-1 stack ht

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 93 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013

Backpropagation from ht
to ht-1 multiplies by W yt
(actually WhhT)

W tanh

ht-1 stack ht

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 94 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013

y1 y2 y3 y4

h0 h1 h2 h3 h4

x1 x2 x3 x4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 95 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
Gradients over multiple time steps:
y1 y2 y3 y4

h0 h1 h2 h3 h4

x1 x2 x3 x4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 96 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
Gradients over multiple time steps:
y1 y2 y3 y4

h0 h1 h2 h3 h4

x1 x2 x3 x4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 97 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
Gradients over multiple time steps:
y1 y2 y3 y4

h0 h1 h2 h3 h4

x1 x2 x3 x4

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 98 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
Gradients over multiple time steps:
y1 y2 y3 y4

h0 h1 h2 h3 h4

x1 x2 x3 x4

Almost always < 1

Vanishing gradients

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 99 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
Gradients over multiple time steps:
y1 y2 y3 y4

h0 h1 h2 h3 h4

x1 x2 x3 x4

What if we assumed no non-linearity?

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 100 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
Gradients over multiple time steps:
y1 y2 y3 y4

h0 h1 h2 h3 h4

x1 x2 x3 x4
What if we assumed no non-linearity?
Largest singular value > 1:
Exploding gradients

Largest singular value < 1:

Vanishing gradients

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 101 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
Gradients over multiple time steps:
y1 y2 y3 y4

h0 h1 h2 h3 h4

x1 x2 x3 x4
What if we assumed no non-linearity?
Largest singular value > 1: Gradient clipping:
Exploding gradients Scale gradient if its
norm is too big
Largest singular value < 1:
Vanishing gradients

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 102 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
Gradients over multiple time steps:
y1 y2 y3 y4

h0 h1 h2 h3 h4

x1 x2 x3 x4
What if we assumed no non-linearity?
Largest singular value > 1:
Exploding gradients

Largest singular value < 1: Change RNN

Vanishing gradients architecture

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 103 April 28, 2022
Long Short Term Memory (LSTM)

Vanilla RNN LSTM

Hochreiter and Schmidhuber, “Long Short Term Memory”, Neural Computation 1997

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 104 April 28, 2022
Long Short Term Memory (LSTM)

Vanilla RNN LSTM

Four gates

Cell state
Hidden state

Hochreiter and Schmidhuber, “Long Short Term Memory”, Neural Computation 1997

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 105 April 28, 2022
Long Short Term Memory (LSTM)
[Hochreiter et al., 1997]

vector from
below (x)
x sigmoid i