6S191 MIT DeepLearning L2
6S191 MIT DeepLearning L2
Ava Soleimany
MIT 6.S191
January 27, 2020
1 9 1
6 . S
M I T
Given an image of a ball,
can you predict where it will go next?
1 9 1
6 . S
M I T ???
Given an image of a ball,
can you predict where it will go next?
1 9 1
6 . S
M I T
Given an image of a ball,
can you predict where it will go next?
1 9 1
6 . S
M I T
Sequences in the Wild
1 9 1
6 . S
M I T
Audio
6.S191 Introduction to Deep Learning
1/27/20
introtodeeplearning.com @MITDeepLearning
Sequences in the Wild
character:
1 9 1
6 . S
6.S191 Introduction to Deep Learning
word:
M I T
Text
6.S191 Introduction to Deep Learning
1/27/20
introtodeeplearning.com @MITDeepLearning
1 9 1
6 . S
A Sequence Modeling Problem:
Predict the Next Word
M I T
A Sequence Modeling Problem: Predict the Next Word
“This morning I took my cat for a walk.”
1 9 1
6 . S
M I T
6.S191 Introduction to Deep Learning
H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
A Sequence Modeling Problem: Predict the Next Word
“This morning I took my cat for a walk.”
given these words
1 9 1
6 . S
M I T
6.S191 Introduction to Deep Learning
H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
A Sequence Modeling Problem: Predict the Next Word
“This morning I took my cat for a walk.”
given these words
1 9 1 predict the
next word
6 . S
M I T
6.S191 Introduction to Deep Learning
H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
Idea #1: Use a Fixed Window
“This morning I took my cat for a walk.”
given these
two words
1 9 1 predict the
next word
6 . S
M I T
6.S191 Introduction to Deep Learning
H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
Idea #1: Use a Fixed Window
“This morning I took my cat for a walk.”
given these
two words
1 9 1 predict the
next word
. S
One-hot feature encoding: tells us what each word is
[1000001000] 6
M I T for a
prediction
6.S191 Introduction to Deep Learning
H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
Problem #1: Can’t Model Long-Term Dependencies
“France is where I grew up, but I now live in Boston. I speak fluent ___.”
1 9 1
.
J’aime 6.S191!
6 S
I T
We need information from the distant past to accurately
1 9 1
. S
“bag of words”
6
[ 0 1 0 0 1 0 0 … 0 0 1 1 0 0 0 1]
M I T
prediction
M I T
The food was bad, not good at all.
1 9 1 predict the
next word
morning I
.
[10000000010010001000 00010 … ]
6 S
took this cat
M I T
prediction
1 9 1 cat
. S
Each of these inputs has a separate parameter:
6
M I T
6.S191 Introduction to Deep Learning
H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
Problem #3: No Parameter Sharing
[10000000010010001000 00010 … ]
this morning took the
1 9 1 cat
. S
Each of these inputs has a separate parameter:
6
M I T
[0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 … ]
this morning
1 9 1 cat
. S
Each of these inputs has a separate parameter:
6
M I T
[0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 … ]
1 9 1
2. Track long-term dependencies
3.
6 .
Maintain information about order S RNN
4.
M I T
Share parameters across the sequence
#"
1 9 1
6 . S
!
One to One
M I T
“Vanilla” neural network
#"
1 9 1
6 . S
!
One to One
M I T
“Vanilla” neural network
Many to One
Sentiment Classification
#"
1 9 1
6 . S
!
One to One
M I T
“Vanilla” neural network
Many to One
Sentiment Classification
Many to Many
Music Generation
6.S191 Lab!
#"
1 9 1
6 . S … and many other
architectures and
applications
!
One to One
M I T
“Vanilla” neural network
Many to One
Sentiment Classification
Many to Many
Music Generation
6.S191 Lab!
1 9
6 . S
M I
input vector
T !"
1 9
RNN ℎ"
6 . S
M I
input vector
T !"
1 9
RNN ℎ"
6 . S
T
recurrent cell
M I
input vector !"
1
output vector $#" Apply a recurrence relation at every
9
time step to process a sequence:
. S 1
T
RNN
recurrent cell
ℎ"
6
M I
input vector !"
1
output vector -," Apply a recurrence relation at every
ℎ = $1(ℎ9 , * )
time step to process a sequence:
RNN ℎ"
6 . S "
cell state
%
function
"'(
old state
"
input vector at
time step t
T
parameterized
recurrent cell
I
by W
M
input vector *"
1
output vector -," Apply a recurrence relation at every
ℎ = $1(ℎ9 , * )
time step to process a sequence:
RNN ℎ"
6 . S "
cell state
%
function
"'(
old state
"
input vector at
time step t
T
parameterized
recurrent cell
I
by W
M
input vector *"
Note: the same function and set of
parameters are used at every time step
my_rnn = RNN()
hidden_state = [0, 0, 0, 0]
RNN
M I T
prediction, hidden_state = my_rnn(word, hidden_state)
next_word_prediction = prediction
# >>> "networks!"
input vector
recurrent cell
!"
ℎ"
my_rnn = RNN()
hidden_state = [0, 0, 0, 0]
RNN
M I T
prediction, hidden_state = my_rnn(word, hidden_state)
next_word_prediction = prediction
# >>> "networks!"
input vector
recurrent cell
!"
ℎ"
my_rnn = RNN()
hidden_state = [0, 0, 0, 0]
RNN
M I T
prediction, hidden_state = my_rnn(word, hidden_state)
next_word_prediction = prediction
# >>> "networks!"
input vector
recurrent cell
!"
ℎ"
1 9
RNN ℎ"
6 . S
T
recurrent cell
M I
input vector !"
1 9
RNN ℎ"
6 . S
T
recurrent cell
M I
input vector !"
Input Vector
!"
1 9
RNN ℎ"
6 . S Update Hidden State
ℎ" = tanh(,.-- ℎ"/0 + ,.2- !" )
T
recurrent cell
M I
input vector !"
Input Vector
!"
"!# = %(&' ℎ#
1 9
RNN ℎ#
6 . S Update Hidden State
ℎ# = tanh(%(&& ℎ#01 + %(3& *# )
T
recurrent cell
M I
input vector *#
Input Vector
*#
1 9 1
$#"
6 . S
RNN =
M I T
Represent as computational graph unrolled across time
!"
1 9 1
$#" $#%
6 . S
RNN =
M I T
!" !%
1 9 1
$#" $#%
6 . S $#&
RNN =
M I T
!" !% !&
1 9 1
$#" $#%
RNN =
M I T
!" !% !& !' … !"
1 9 1
$#" $#%
RNN =
M I T
()* ()* ()* ()*
1 9 1
$#" $#%
RNN =
M I T
()*
(**
()*
(**
()*
(**
()*
1 9 1
$#" $#%
(*+
6
(*+. S $#&
(*+
$#' …
(*+
$#"
RNN =
M I T
()*
(**
()*
(**
()*
(**
()*
1 9 1
$#" $#%
(*+
6
(*+. S $#&
(*+
$#' …
(*+
$#"
RNN =
M I T()*
(**
()*
(**
()*
(**
()*
(% (&
1 9
(' 1 ()
$#" $#%
*,-
6
*,-. S $#&
*,-
$#' …
*,-
$#"
RNN
M I T
=
*+,
*,,
*+,
*,,
*+,
*,,
*+,
1 9
(' 1 ()
$#" $#%
*,-
6
*,-. S $#&
*,-
$#' …
*,-
$#"
RNN
M I T
=
*+,
*,,
*+,
*,,
*+,
*,,
*+,
1
super(MyRNNCell, self).__init__()
output vector $#"
# Initialize weight matrices
self.W_xh = self.add_weight([rnn_units, input_dim])
1 9
S
self.W_hh = self.add_weight([rnn_units, rnn_units])
.
self.W_hy = self.add_weight([output_dim, rnn_units])
6
# Initialize hidden state to zeros
RNN
self.h = tf.zeros([rnn_units, 1])
M I T
def call(self, x):
# Update the hidden state
self.h = tf.math.tanh( self.W_hh * self.h + self.W_xh * x )
!"
ℎ"
S
tf.keras.layers.SimpleRNN(rnn_units)
6 . RNN
M I T input vector
recurrent cell
!"
ℎ"
"
1 9 1
Backpropagation algorithm:
!
M I T minimize loss
1 9
(' 1 ()
$#" $#%
*,-
6
*,-. S $#&
*,-
$#' …
*,-
$#"
RNN
M I=T *+,
*,,
*+,
*,,
*+,
*,,
*+,
(% (&
1 9
(' ()
$#" $#%
*,-
6
*,-. S $#&
*,-
$#' …
*,-
$#"
RNN
M I=T *+,
*,,
*+,
*,,
*+,
*,,
*+,
…9
1 1 '))
ℎ&
'()
#" #$
6 . S #% #&
M I T
6.S191 Introduction to Deep Learning
1/27/20
introtodeeplearning.com @MITDeepLearning
Standard RNN Gradient Flow
ℎ"
'()
'))
'()
'))
'()
…9
1 1 '))
ℎ&
'()
#" #$
6 . S #%
Computing the gradient wrt ℎ" involves many factors of *++ + repeated gradient computation!
#&
M I T
6.S191 Introduction to Deep Learning
1/27/20
introtodeeplearning.com @MITDeepLearning
Standard RNN Gradient Flow: Exploding Gradients
ℎ"
'()
'))
'()
'))
'()
…9
1 1 '))
ℎ&
'()
#" #$
6 . S #%
Computing the gradient wrt ℎ" involves many factors of *++ + repeated gradient computation!
#&
M I T
Many values > 1:
exploding gradients
Gradient clipping to
scale big gradients
…9
1 1 '))
'()
ℎ&
#" #$
6 . S #%
Computing the gradient wrt ℎ" involves many factors of *++ + repeated gradient computation!
#&
M I T
Many values > 1:
exploding gradients
Gradient clipping to
Many values < 1:
vanishing gradients
1. Activation function
2. Weight initialization
scale big gradients
3. Network architecture
1 9 1
6 . S
M I T
6.S191 Introduction to Deep Learning
1/27/20
introtodeeplearning.com @MITDeepLearning
The Problem of Long-Term Dependencies
Why are vanishing gradients a problem?
1 9 1
6 . S
M I T
6.S191 Introduction to Deep Learning
1/27/20
introtodeeplearning.com @MITDeepLearning
The Problem of Long-Term Dependencies
Why are vanishing gradients a problem?
1 9 1
Errors due to further back time steps
have smaller and smaller gradients
6 . S
M I T
6.S191 Introduction to Deep Learning
1/27/20
introtodeeplearning.com @MITDeepLearning
The Problem of Long-Term Dependencies
Why are vanishing gradients a problem?
1 9 1
Errors due to further back time steps
have smaller and smaller gradients
6 . S
M I T
Bias parameters to capture short-term
dependencies
1 9 1
Errors due to further back time steps
have smaller and smaller gradients
6 . S
M I T
Bias parameters to capture short-term
dependencies
1
$#" $#% $#& $#' $#(
1 9
Errors due to further back time steps
have smaller and smaller gradients
6 . S !" !% !& !' !(
M I T
Bias parameters to capture short-term
dependencies
1
$#" $#% $#& $#' $#(
1 9
Errors due to further back time steps
have smaller and smaller gradients
6 . S !" !% !& !'
M I T
Bias parameters to capture short-term
dependencies
1
$#" $#% $#& $#' $#(
1 9
Errors due to further back time steps
have smaller and smaller gradients
6 . S !" !% !& !'
M I T
Bias parameters to capture short-term
dependencies
$#" $#% … $#) $#)*%
!" !% … !) !)*%
1 9 1
ReLU derivative
M I T tanh derivative
sigmoid derivative
1 9 1
6 . S
gated cell
T
LSTM, GRU, etc.
M I
Long Short Term Memory (LSTMs) networks rely on a gated cell to
track information throughout many time steps.
1 9 1 #"$%
tanh
ℎ"&%
6 . S tanh
ℎ"
tanh
M
!"&%
I T !" !"$%
1 9 1 #"+*
tanh
6 . S tanh tanh
T
$ $ tanh $ $ $ tanh $ $ $ tanh $
!")*
M I !"
tf.keras.layers.LSTM(num_units)
1 9 1
6 . S !
M I T
Gates optionally let information through, for example via a
sigmoid neural net layer and pointwise multiplication
1 9 1!"
(")*
6 . S ("
M I T
ℎ")*
-"
# # tanh #
tanh
ℎ"
,"
6.S191 Introduction to Deep Learning
Hochreiter & Schmidhuber, Neural Computation 1997. 1/27/20
introtodeeplearning.com @MITDeepLearning
Long Short Term Memory (LSTMs)
1) Forget 2) Store 3) Update 4) Output
LSTMs forget irrelevant parts of the previous state
1 9 1!"
(")*
6 . S ("
M I T
ℎ")*
-"
# # tanh #
tanh
ℎ"
,"
6.S191 Introduction to Deep Learning
Olah, “Understanding LSTMs”. 1/27/20
introtodeeplearning.com @MITDeepLearning
Long Short Term Memory (LSTMs)
1) Forget 2) Store 3) Update 4) Output
LSTMs store relevant new information into the cell state
1 9 1!"
#"$%
6 . S #"
M I Tℎ"$%
& &
-"
tanh &
tanh
ℎ"
,"
6.S191 Introduction to Deep Learning
Olah, “Understanding LSTMs”. 1/27/20
introtodeeplearning.com @MITDeepLearning
Long Short Term Memory (LSTMs)
1) Forget 2) Store 3) Update 4) Output
LSTMs selectively update cell state values
1 9 1%"
,"#$
6 . S ,"
M I T
ℎ"#$
' ' tanh '
tanh
ℎ"
&"
6.S191 Introduction to Deep Learning
Olah, “Understanding LSTMs”. 1/27/20
introtodeeplearning.com @MITDeepLearning
Long Short Term Memory (LSTMs)
1) Forget 2) Store 3) Update 4) Output
The output gate controls what information is sent to the next time step
1 9 1!"
(")*
6 . S ("
M I T ℎ")*
# # tanh
-"
#
tanh
ℎ"
,"
6.S191 Introduction to Deep Learning
Olah, “Understanding LSTMs”. 1/27/20
introtodeeplearning.com @MITDeepLearning
Long Short Term Memory (LSTMs)
1) Forget 2) Store 3) Update 4) Output
1 9 1!"
(")*
6 . S ("
M I T
ℎ")*
-"
# # tanh #
tanh
ℎ"
,"
6.S191 Introduction to Deep Learning
1/27/20
introtodeeplearning.com @MITDeepLearning
LSTM Gradient Flow
Uninterrupted gradient flow!
#) #"
1 9 1 #*
+, +)
6 . S +" +*
T
tanh tanh tanh
!)
$
M I
tanh $
!"
$ $ tanh $ $
!*
$ tanh $
1 9 1
6 . S
• Forget gate gets rid of irrelevant information
• Store relevant information from current input
T
• Selectively update cell state
M I
• Output gate returns a filtered version of the cell state
3. Backpropagation through time with uninterrupted gradient flow
1
F# G C A
9
Output: next character in sheet music
1
6 . S
M I T
E F# G C
6.S191 Lab!
1
sentiment
<positive>
1 9
S
Input: sequence of words
1
sentiment
<positive>
1 9
6 . S
M I T
I love this class!
1 9 1
6 . S
M I
the
T dog eats <start> le chien
encoding
bottleneck
1 9 1
6 . S
M I
the
T dog eats <start> le chien
1 9 1
6 . S
M I Tthe dog eats <start> le chien
1 9 1
6 . S
M I T
6.S191 Introduction to Deep Learning
Waymo. 1/27/20
introtodeeplearning.com @MITDeepLearning
Environmental Modeling
1 9 1
Particulates
6 . S SO2
M I T
Winds Humidity
6.S191 Introduction to Deep Learning
earth.nullschool.net 1/27/20
introtodeeplearning.com @MITDeepLearning
Deep Learning for Sequence Modeling: Summary
1. RNNs are well suited for sequence modeling tasks
2. Model sequences via a recurrence relation
3. Training RNNs with backpropagation through time
1 9 1
. S
4. Gated cells like LSTMs let us model long-term dependencies
6
5. Models for music generation, classification, machine translation, and more
M I T
6.S191 Introduction to Deep Learning
1/27/20
introtodeeplearning.com @MITDeepLearning
6.S191: Introduction to Deep Learning
1
Lab 1: Introduction to TensorFlow and Music Generation with RNNs
6
T
1. Open the lab in Google Colab