Support Materi
Support Materi
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 1 April 28, 2022
Administrative
- Project TA matchups out, see Ed for the link
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 2 April 28, 2022
Administrative
- A2 is due next Monday May 2nd, 11:59pm
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 3 April 28, 2022
Administrative
- Discussion section tomorrow 2:30-3:30PT
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 4 April 28, 2022
Last time: Detection and Segmentation
Semantic Object Instance
Classification
Segmentation Detection Segmentation
No spatial extent No objects, just pixels Multiple Objects This image is CC0 public domain
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 5 April 28, 2022
Training “Feedforward” Neural Networks
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 6 April 28, 2022
Today: Recurrent Neural Networks
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 7 April 28, 2022
“Vanilla” Neural Network
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 8 April 28, 2022
Recurrent Neural Networks: Process Sequences
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 9 April 28, 2022
Recurrent Neural Networks: Process Sequences
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 10 April 28, 2022
Recurrent Neural Networks: Process Sequences
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 12 April 28, 2022
Sequential Processing of Non-Sequence Data
Ba, Mnih, and Kavukcuoglu, “Multiple Object Recognition with Visual Attention”, ICLR 2015.
Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015
Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra,
2015. Reproduced with permission.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 13 April 28, 2022
Sequential Processing of Non-Sequence Data
Generate images one piece at a time!
Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015
Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra, 2015. Reproduced with
permission.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 14 April 28, 2022
Recurrent Neural Network
RNN
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 15 April 28, 2022
Recurrent Neural Network
y
Key idea: RNNs have an
“internal state” that is
updated as a sequence is
RNN processed
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 16 April 28, 2022
Unrolled RNN
y1 y2 y3 yt
x1 x2 x3 xt
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 17 April 28, 2022
RNN hidden state update
We can process a sequence of vectors x by
applying a recurrence formula at every time step: y
RNN
new state old state input vector at
some time step
some function x
with parameters W
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 18 April 28, 2022
RNN output generation
We can process a sequence of vectors x by
applying a recurrence formula at every time step: y
RNN
another function x
with parameters Wo
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 19 April 28, 2022
Recurrent Neural Network
y1 y2 y3 yt
h0 h1 h2 h3
RNN RNN RNN ... RNN
x1 x2 x3 xt
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 20 April 28, 2022
Recurrent Neural Network
We can process a sequence of vectors x by
applying a recurrence formula at every time step: y
RNN
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 21 April 28, 2022
(Vanilla) Recurrent Neural Network
The state consists of a single “hidden” vector h:
RNN
x
Sometimes called a “Vanilla RNN” or an
“Elman RNN” after Prof. Jeffrey Elman
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 22 April 28, 2022
RNN: Computational Graph
h0 fW h1
x1
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 23 April 28, 2022
RNN: Computational Graph
h0 fW h1 fW h2
x1 x2
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 24 April 28, 2022
RNN: Computational Graph
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 25 April 28, 2022
RNN: Computational Graph
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 26 April 28, 2022
RNN: Computational Graph: Many to Many
y1 y2 y3 yT
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 27 April 28, 2022
RNN: Computational Graph: Many to Many
y1 L1 y2 L2 y3 L3 yT LT
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 28 April 28, 2022
RNN: Computational Graph: Many to Many L
y1 L1 y2 L2 y3 L3 yT LT
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 29 April 28, 2022
RNN: Computational Graph: Many to One
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 30 April 28, 2022
RNN: Computational Graph: Many to One
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 31 April 28, 2022
RNN: Computational Graph: One to Many
y1 y2 y3 yT
h0 fW h1 fW h2 fW h3
… hT
x
W
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 32 April 28, 2022
RNN: Computational Graph: One to Many
y1 y2 y3 yT
h0 fW h1 fW h2 fW h3
… hT
x ? ?
W ?
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 33 April 28, 2022
RNN: Computational Graph: One to Many
y1 y2 y3 yT
h0 fW h1 fW h2 fW h3
… hT
x 0 0
W 0
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 34 April 28, 2022
RNN: Computational Graph: One to Many
y1 y2 y3 yT
h0 fW h1 fW h2 fW h3
… hT
x y1 y2 yT-1
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 35 April 28, 2022
Sequence to Sequence: Many-to-one + one-to-many
h0 fW h1 fW h2 fW h3 … hT
x1 x2 x3
W1
Sutskever et al, “Sequence to Sequence Learning with Neural Networks”, NIPS 2014
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 36 April 28, 2022
Sequence to Sequence: Many-to-one +
one-to-many
One to many: Produce output
sequence from single input vector
Many to one: Encode input
sequence in a single vector
y1 y2
h0 fW h1 fW h2 fW h3 … hT
fW h1 fW h2 fW …
x1 x2 x3
W1 W2
Sutskever et al, “Sequence to Sequence Learning with Neural Networks”, NIPS 2014
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 37 April 28, 2022
Example:
Character-level
Language Model
Vocabulary:
[h,e,l,o]
Example training
sequence:
“hello”
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 38 April 28, 2022
Example:
Character-level
Language Model
Vocabulary:
[h,e,l,o]
Example training
sequence:
“hello”
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 39 April 28, 2022
Example:
Character-level
Language Model
Vocabulary:
[h,e,l,o]
Example training
sequence:
“hello”
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 40 April 28, 2022
“e” “l” “l” “o”
Example: Character-level Sample
Language Model .03
.84
.25
.20
.11
.17
.11
.02
Softmax .00 .05 .68 .08
Sampling .13 .50 .03 .79
Vocabulary:
[h,e,l,o]
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 41 April 28, 2022
“e” “l” “l” “o”
Example: Character-level Sample
Language Model .03
.84
.25
.20
.11
.17
.11
.02
Softmax .00 .05 .68 .08
Sampling .13 .50 .03 .79
Vocabulary:
[h,e,l,o]
At test-time sample
characters one at a time, feed
back to model
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 42 April 28, 2022
“e” “l” “l” “o”
Example: Character-level Sample
Language Model .03
.84
.25
.20
.11
.17
.11
.02
Softmax .00 .50 .68 .08
Sampling .13 .05 .03 .79
Vocabulary:
[h,e,l,o]
At test-time sample
characters one at a time, feed
back to model
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 43 April 28, 2022
“e” “l” “l” “o”
Example: Character-level Sample
Language Model .03
.84
.25
.20
.11
.17
.11
.02
Softmax .00 .50 .68 .08
Sampling .13 .05 .03 .79
Vocabulary:
[h,e,l,o]
At test-time sample
characters one at a time, feed
back to model
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 44 April 28, 2022
Example: Character-level
Language Model
Sampling
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 45 April 28, 2022
Forward through entire sequence to
compute loss, then backward through
Backpropagation through time entire sequence to compute gradient
Loss
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 46 April 28, 2022
Truncated Backpropagation through time
Loss
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 47 April 28, 2022
Truncated Backpropagation through time
Loss
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 48 April 28, 2022
Truncated Backpropagation through time
Loss
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 49 April 28, 2022
min-char-rnn.py gist: 112 lines of Python
(https://gist.github.com/karpathy/d4dee
566867f8291f086)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 50 April 28, 2022
y
RNN
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 51 April 28, 2022
at first:
train more
train more
train more
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 52 April 28, 2022
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 53 April 28, 2022
The Stacks Project: open source algebraic geometry textbook
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 54 April 28, 2022
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 55 April 28, 2022
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 56 April 28, 2022
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 57 April 28, 2022
Generated
C code
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 58 April 28, 2022
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 59 April 28, 2022
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 60 April 28, 2022
https://openai.com/blog/openai-codex/
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 61 April 28, 2022
OpenAI GPT-2 generated text source
Output: The scientist named the population, after their distinctive horn, Ovid’s Unicorn.
These four-horned, silver-white unicorns were previously unknown to science.
Now, after almost two centuries, the mystery of what sparked this odd phenomenon is
finally solved.
Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several
companions, were exploring the Andes Mountains when they found a small valley, with no
other animals or humans. Pérez noticed that the valley had what appeared to be a natural
fountain, surrounded by two peaks of rock and silver snow.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 62 April 28, 2022
Searching for interpretable cells
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 63 April 28, 2022
Searching for interpretable cells
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 64 April 28, 2022
Searching for interpretable cells
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 65 April 28, 2022
Searching for interpretable cells
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 66 April 28, 2022
Searching for interpretable cells
if statement cell
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 67 April 28, 2022
Searching for interpretable cells
quote/comment cell
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 68 April 28, 2022
Searching for interpretable cells
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 69 April 28, 2022
RNN tradeoffs
RNN Advantages:
- Can process any length input
- Computation for step t can (in theory) use information from many steps back
- Model size doesn’t increase for longer input
- Same weights applied on every timestep, so there is symmetry in how inputs
are processed.
RNN Disadvantages:
- Recurrent computation is slow
- In practice, difficult to access information from many steps back
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 70 April 28, 2022
Image Captioning
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 71 April 28, 2022
Recurrent Neural Network
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 72 April 28, 2022
test image
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 73 April 28, 2022
test image
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 74 April 28, 2022
test image
X
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 75 April 28, 2022
test image
x0
<START>
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 76 April 28, 2022
test image
y0
before:
h = tanh(Wxh * x + Whh * h)
h0
Wih
now:
h = tanh(Wxh * x + Whh * h + Wih * v)
x0
<START>
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 77 April 28, 2022
test image
y0
sample!
h0
x0
straw
<START>
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 78 April 28, 2022
test image
y0 y1
h0 h1
x0
straw
<START>
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 79 April 28, 2022
test image
y0 y1
h0 h1 sample!
x0
straw hat
<START>
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 80 April 28, 2022
test image
y0 y1 y2
h0 h1 h2
x0
straw hat
<START>
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 81 April 28, 2022
test image
y0 y1 y2
sample
<END> token
h0 h1 h2 => finish.
x0
straw hat
<START>
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 82 April 28, 2022
Captions generated using neuraltalk2
All images are CC0 Public domain:
Image Captioning: Example Results cat suitcase, cat tree, dog, bear,
surfers, tennis, giraffe, motorcycle
A cat sitting on a A cat is sitting on a tree A dog is running in the A white teddy bear sitting in
suitcase on the floor branch grass with a frisbee the grass
Two people walking on A tennis player in action Two giraffes standing in a A man riding a dirt bike on
the beach with surfboards on the court grassy field a dirt track
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 83 April 28, 2022
Captions generated using neuraltalk2
All images are CC0 Public domain: fur
A bird is perched on
a tree branch
A man in a
baseball uniform
throwing a ball
A woman standing on a
beach holding a surfboard
A person holding a
computer mouse on a desk
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 84 April 28, 2022
Visual Question Answering (VQA)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 85 April 28, 2022
Visual Question Answering (VQA)
Agrawal et al, “Visual 7W: Grounded Question Answering in Images”, CVPR 2015
Figures from Agrawal et al, copyright IEEE 2015. Reproduced for educational purposes.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 86 April 28, 2022
Visual Dialog: Conversations about images
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 87 April 28, 2022
Visual Language Navigation: Go to the living room
Agent encodes instructions in
language and uses an RNN to
generate a series of movements as the
visual input changes after each move.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 88 April 28, 2022
All images are CC0 Public domain:
dog,
Model Yes or No
What is the dog Question
playing with?
Frisbee Answer
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 89 April 28, 2022
Multilayer RNNs
depth
time
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 90 April 28, 2022
Long Short Term Memory (LSTM)
Hochreiter and Schmidhuber, “Long Short Term Memory”, Neural Computation 1997
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 91 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
yt
W tanh
ht-1 stack ht
xt
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 92 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
Backpropagation from ht
to ht-1 multiplies by W yt
(actually WhhT)
W tanh
ht-1 stack ht
xt
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 93 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
Backpropagation from ht
to ht-1 multiplies by W yt
(actually WhhT)
W tanh
ht-1 stack ht
xt
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 94 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
y1 y2 y3 y4
h0 h1 h2 h3 h4
x1 x2 x3 x4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 95 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
Gradients over multiple time steps:
y1 y2 y3 y4
h0 h1 h2 h3 h4
x1 x2 x3 x4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 96 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
Gradients over multiple time steps:
y1 y2 y3 y4
h0 h1 h2 h3 h4
x1 x2 x3 x4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 97 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
Gradients over multiple time steps:
y1 y2 y3 y4
h0 h1 h2 h3 h4
x1 x2 x3 x4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 98 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
Gradients over multiple time steps:
y1 y2 y3 y4
h0 h1 h2 h3 h4
x1 x2 x3 x4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 99 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
Gradients over multiple time steps:
y1 y2 y3 y4
h0 h1 h2 h3 h4
x1 x2 x3 x4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 100 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
Gradients over multiple time steps:
y1 y2 y3 y4
h0 h1 h2 h3 h4
x1 x2 x3 x4
What if we assumed no non-linearity?
Largest singular value > 1:
Exploding gradients
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 101 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
Gradients over multiple time steps:
y1 y2 y3 y4
h0 h1 h2 h3 h4
x1 x2 x3 x4
What if we assumed no non-linearity?
Largest singular value > 1: Gradient clipping:
Exploding gradients Scale gradient if its
norm is too big
Largest singular value < 1:
Vanishing gradients
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 102 April 28, 2022
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
Gradients over multiple time steps:
y1 y2 y3 y4
h0 h1 h2 h3 h4
x1 x2 x3 x4
What if we assumed no non-linearity?
Largest singular value > 1:
Exploding gradients
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 103 April 28, 2022
Long Short Term Memory (LSTM)
Hochreiter and Schmidhuber, “Long Short Term Memory”, Neural Computation 1997
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 104 April 28, 2022
Long Short Term Memory (LSTM)
Four gates
Cell state
Hidden state
Hochreiter and Schmidhuber, “Long Short Term Memory”, Neural Computation 1997
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 105 April 28, 2022
Long Short Term Memory (LSTM)
[Hochreiter et al., 1997]
vector from
below (x)
x sigmoid i
h sigmoid f
W
vector from sigmoid o
before (h)
tanh g
4h x 2h 4h 4*h
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 106 April 28, 2022
Long Short Term Memory (LSTM)
[Hochreiter et al., 1997] i: Input gate, whether to write to cell
f: Forget gate, Whether to erase cell
o: Output gate, How much to reveal cell
vector from g: Gate gate (?), How much to write to cell
below (x)
x sigmoid i
h sigmoid f
W
vector from sigmoid o
before (h)
tanh g
4h x 2h 4h 4*h
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 107 April 28, 2022
Long Short Term Memory (LSTM)
[Hochreiter et al., 1997] i: Input gate, whether to write to cell
f: Forget gate, Whether to erase cell
o: Output gate, How much to reveal cell
vector from g: Gate gate (?), How much to write to cell
below (x)
x sigmoid i
h sigmoid f
W
vector from sigmoid o
before (h)
tanh g
4h x 2h 4h 4*h
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 108 April 28, 2022
Long Short Term Memory (LSTM)
[Hochreiter et al., 1997] i: Input gate, whether to write to cell
f: Forget gate, Whether to erase cell
o: Output gate, How much to reveal cell
vector from g: Gate gate (?), How much to write to cell
below (x)
x sigmoid i
h sigmoid f
W
vector from sigmoid o
before (h)
tanh g
4h x 2h 4h 4*h
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 109 April 28, 2022
Long Short Term Memory (LSTM)
[Hochreiter et al., 1997] i: Input gate, whether to write to cell
f: Forget gate, Whether to erase cell
o: Output gate, How much to reveal cell
vector from g: Gate gate (?), How much to write to cell
below (x)
x sigmoid i
h sigmoid f
W
vector from sigmoid o
before (h)
tanh g
4h x 2h 4h 4*h
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 110 April 28, 2022
Long Short Term Memory (LSTM)
[Hochreiter et al., 1997]
ct-1 ☉ + ct
f
i
W ☉ tanh
g
ht-1 stack
o ☉ ht
xt
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 111 April 28, 2022
Long Short Term Memory (LSTM): Gradient Flow
[Hochreiter et al., 1997]
Backpropagation from ct to
ct-1 only elementwise
multiplication by f, no matrix
ct-1 ☉ + ct multiply by W
f
i
W ☉ tanh
g
ht-1 stack
o ☉ ht
xt
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 112 April 28, 2022
Long Short Term Memory (LSTM): Gradient Flow
[Hochreiter et al., 1997]
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 113 April 28, 2022
Do LSTMs solve the vanishing gradient problem?
The LSTM architecture makes it easier for the RNN to preserve information
over many timesteps
- e.g. if the f = 1 and the i = 0, then the information of that cell is preserved
indefinitely.
- By contrast, it’s harder for vanilla RNN to learn a recurrent weight matrix
Wh that preserves info in hidden state
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 114 April 28, 2022
c3
Lecture 10 - 115
Long Short Term Memory (LSTM): Gradient Flow
c2
Softmax
FC 1000
Pool
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
...
3x3 conv, 128
3x3 conv, 128
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64 / 2
Input
Similar to ResNet!
c0
c3
Lecture 10 - 116
Long Short Term Memory (LSTM): Gradient Flow
c2
Softmax
FC 1000
Pool
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
...
3x3 conv, 128
3x3 conv, 128
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64 / 2
Input
Similar to ResNet!
c0
[An Empirical Exploration of
Other RNN Variants Recurrent Network Architectures,
Jozefowicz et al., 2015]
GRU [Learning phrase representations using rnn
encoder-decoder for statistical machine translation,
Cho et al. 2014]
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 117 April 28, 2022
Neural Architecture Search for RNN architectures
Zoph et Le, “Neural Architecture Search with Reinforcement Learning”, ICLR 2017
Figures copyright Zoph et al, 2017. Reproduced with permission.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 118 April 28, 2022
Summary
- RNNs allow a lot of flexibility in architecture design
- Vanilla RNNs are simple but don’t work very well
- Common to use LSTM or GRU: their additive interactions
improve gradient flow
- Backward flow of gradients in RNN can explode or vanish.
Exploding is controlled with gradient clipping. Vanishing is
controlled with additive interactions (LSTM)
- Better/simpler architectures are a hot topic of current research,
as well as new paradigms for reasoning over sequences
- Better understanding (both theoretical and empirical) is needed.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 119 April 28, 2022
Next time: Attention and Transformers
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 10 - 120 April 28, 2022