0% found this document useful (0 votes)
343 views93 pages

6S191 MIT DeepLearning L2

This document discusses challenges in sequence modeling and introduces recurrent neural networks (RNNs) as an approach to address these challenges. Specifically, it notes that to model sequences effectively, a model needs to handle variable-length sequences, track long-term dependencies, maintain information about order, and share parameters across the sequence. RNNs are presented as a way to meet these criteria for sequence modeling problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
343 views93 pages

6S191 MIT DeepLearning L2

This document discusses challenges in sequence modeling and introduces recurrent neural networks (RNNs) as an approach to address these challenges. Specifically, it notes that to model sequences effectively, a model needs to handle variable-length sequences, track long-term dependencies, maintain information about order, and share parameters across the sequence. RNNs are presented as a way to meet these criteria for sequence modeling problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Deep Sequence Modeling

Ava Soleimany
MIT 6.S191
January 27, 2020

6.S191 Introduction to Deep Learning


introtodeeplearning.com @MITDeepLearning
Given an image of a ball,
can you predict where it will go next?

1 9 1
6 . S
M I T
Given an image of a ball,
can you predict where it will go next?

1 9 1
6 . S
M I T ???
Given an image of a ball,
can you predict where it will go next?

1 9 1
6 . S
M I T
Given an image of a ball,
can you predict where it will go next?

1 9 1
6 . S
M I T
Sequences in the Wild

1 9 1
6 . S
M I T
Audio
6.S191 Introduction to Deep Learning
1/27/20
introtodeeplearning.com @MITDeepLearning
Sequences in the Wild

character:

1 9 1
6 . S
6.S191 Introduction to Deep Learning
word:

M I T
Text
6.S191 Introduction to Deep Learning
1/27/20
introtodeeplearning.com @MITDeepLearning
1 9 1
6 . S
A Sequence Modeling Problem:
Predict the Next Word

M I T
A Sequence Modeling Problem: Predict the Next Word
“This morning I took my cat for a walk.”

1 9 1
6 . S
M I T
6.S191 Introduction to Deep Learning
H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
A Sequence Modeling Problem: Predict the Next Word
“This morning I took my cat for a walk.”
given these words

1 9 1
6 . S
M I T
6.S191 Introduction to Deep Learning
H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
A Sequence Modeling Problem: Predict the Next Word
“This morning I took my cat for a walk.”
given these words

1 9 1 predict the
next word

6 . S
M I T
6.S191 Introduction to Deep Learning
H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
Idea #1: Use a Fixed Window
“This morning I took my cat for a walk.”
given these
two words

1 9 1 predict the
next word

6 . S
M I T
6.S191 Introduction to Deep Learning
H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
Idea #1: Use a Fixed Window
“This morning I took my cat for a walk.”
given these
two words

1 9 1 predict the
next word

. S
One-hot feature encoding: tells us what each word is

[1000001000] 6
M I T for a

prediction
6.S191 Introduction to Deep Learning
H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
Problem #1: Can’t Model Long-Term Dependencies

“France is where I grew up, but I now live in Boston. I speak fluent ___.”

1 9 1
.
J’aime 6.S191!
6 S
I T
We need information from the distant past to accurately

M predict the correct word.

6.S191 Introduction to Deep Learning


H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
Idea #2: Use Entire Sequence as Set of Counts
“This morning I took my cat for a”

1 9 1
. S
“bag of words”

6
[ 0 1 0 0 1 0 0 … 0 0 1 1 0 0 0 1]

M I T
prediction

6.S191 Introduction to Deep Learning


H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
Problem #2: Counts Don’t Preserve Order

The food was good, not bad at all.


1 9 1
6 . S vs.

M I T
The food was bad, not good at all.

6.S191 Introduction to Deep Learning


H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
Idea #3: Use a Really Big Fixed Window
“This morning I took my cat for a walk.”
given these
words

1 9 1 predict the
next word

morning I
.
[10000000010010001000 00010 … ]

6 S
took this cat

M I T
prediction

6.S191 Introduction to Deep Learning


H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
Problem #3: No Parameter Sharing
[10000000010010001000 00010 … ]
this morning took the

1 9 1 cat

. S
Each of these inputs has a separate parameter:

6
M I T
6.S191 Introduction to Deep Learning
H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
Problem #3: No Parameter Sharing
[10000000010010001000 00010 … ]
this morning took the

1 9 1 cat

. S
Each of these inputs has a separate parameter:

6
M I T
[0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 … ]
this morning

6.S191 Introduction to Deep Learning


H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
Problem #3: No Parameter Sharing
[10000000010010001000 00010 … ]
this morning took the

1 9 1 cat

. S
Each of these inputs has a separate parameter:

6
M I T
[0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 … ]

Things we learn about the sequence won’t transfer if


this morning

they appear elsewhere in the sequence.


6.S191 Introduction to Deep Learning
H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
Sequence Modeling: Design Criteria
To model sequences, we need to:
1. Handle variable-length sequences

1 9 1
2. Track long-term dependencies
3.
6 .
Maintain information about order S RNN

4.

M I T
Share parameters across the sequence

Today: Recurrent Neural Networks (RNNs) as


an approach to sequence modeling problems

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
1 9 1
. S
Recurrent Neural Networks (RNNs)
6
M I T
Standard Feed-Forward Neural Network

#"

1 9 1
6 . S
!
One to One
M I T
“Vanilla” neural network

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
Recurrent Neural Networks for Sequence Modeling

#"

1 9 1
6 . S
!
One to One
M I T
“Vanilla” neural network
Many to One
Sentiment Classification

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
Recurrent Neural Networks for Sequence Modeling

#"

1 9 1
6 . S
!
One to One
M I T
“Vanilla” neural network
Many to One
Sentiment Classification
Many to Many
Music Generation

6.S191 Lab!

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
Recurrent Neural Networks for Sequence Modeling

#"

1 9 1
6 . S … and many other
architectures and
applications

!
One to One
M I T
“Vanilla” neural network
Many to One
Sentiment Classification
Many to Many
Music Generation

6.S191 Lab!

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
Standard “Vanilla” Neural Network
$#"
1
output vector

1 9
6 . S
M I
input vector
T !"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
Recurrent Neural Network (RNN)
$#"
1
output vector

1 9
RNN ℎ"
6 . S
M I
input vector
T !"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
Recurrent Neural Network (RNN)
$#"
1
output vector

1 9
RNN ℎ"
6 . S
T
recurrent cell

M I
input vector !"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
Recurrent Neural Network (RNN)

1
output vector $#" Apply a recurrence relation at every

9
time step to process a sequence:

. S 1
T
RNN
recurrent cell
ℎ"
6
M I
input vector !"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
Recurrent Neural Network (RNN)

1
output vector -," Apply a recurrence relation at every

ℎ = $1(ℎ9 , * )
time step to process a sequence:

RNN ℎ"
6 . S "
cell state
%
function
"'(
old state
"
input vector at
time step t

T
parameterized
recurrent cell

I
by W

M
input vector *"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
Recurrent Neural Network (RNN)

1
output vector -," Apply a recurrence relation at every

ℎ = $1(ℎ9 , * )
time step to process a sequence:

RNN ℎ"
6 . S "
cell state
%
function
"'(
old state
"
input vector at
time step t

T
parameterized
recurrent cell

I
by W

M
input vector *"
Note: the same function and set of
parameters are used at every time step

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNN Intuition

my_rnn = RNN()
hidden_state = [0, 0, 0, 0]

1 9 1 output vector $#"

for word in sentence:


6 . S
sentence = ["I", "love", "recurrent", "neural"]

RNN

M I T
prediction, hidden_state = my_rnn(word, hidden_state)

next_word_prediction = prediction
# >>> "networks!"
input vector
recurrent cell

!"
ℎ"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNN Intuition

my_rnn = RNN()
hidden_state = [0, 0, 0, 0]

1 9 1 output vector $#"

for word in sentence:


6 . S
sentence = ["I", "love", "recurrent", "neural"]

RNN

M I T
prediction, hidden_state = my_rnn(word, hidden_state)

next_word_prediction = prediction
# >>> "networks!"
input vector
recurrent cell

!"
ℎ"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNN Intuition

my_rnn = RNN()
hidden_state = [0, 0, 0, 0]

1 9 1 output vector $#"

for word in sentence:


6 . S
sentence = ["I", "love", "recurrent", "neural"]

RNN

M I T
prediction, hidden_state = my_rnn(word, hidden_state)

next_word_prediction = prediction
# >>> "networks!"
input vector
recurrent cell

!"
ℎ"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNN State Update and Output
$#"
1
output vector

1 9
RNN ℎ"
6 . S
T
recurrent cell

M I
input vector !"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNN State Update and Output
$#"
1
output vector

1 9
RNN ℎ"
6 . S
T
recurrent cell

M I
input vector !"
Input Vector
!"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNN State Update and Output
$#"
1
output vector

1 9
RNN ℎ"
6 . S Update Hidden State
ℎ" = tanh(,.-- ℎ"/0 + ,.2- !" )
T
recurrent cell

M I
input vector !"
Input Vector
!"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNN State Update and Output
Output Vector
"!#
1
output vector

"!# = %(&' ℎ#

1 9
RNN ℎ#
6 . S Update Hidden State
ℎ# = tanh(%(&& ℎ#01 + %(3& *# )
T
recurrent cell

M I
input vector *#
Input Vector
*#

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNNs: Computational Graph Across Time

1 9 1
$#"

6 . S
RNN =
M I T
Represent as computational graph unrolled across time

!"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNNs: Computational Graph Across Time

1 9 1
$#" $#%

6 . S
RNN =
M I T
!" !%

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNNs: Computational Graph Across Time

1 9 1
$#" $#%

6 . S $#&

RNN =
M I T
!" !% !&

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNNs: Computational Graph Across Time

1 9 1
$#" $#%

6 . S $#& $#' … $#"

RNN =
M I T
!" !% !& !' … !"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNNs: Computational Graph Across Time

1 9 1
$#" $#%

6 . S $#& $#' … $#"

RNN =
M I T
()* ()* ()* ()*

!" !% !& !' … !"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNNs: Computational Graph Across Time

1 9 1
$#" $#%

6 . S $#& $#' … $#"

RNN =
M I T
()*
(**
()*
(**
()*
(**
()*

!" !% !& !' … !"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNNs: Computational Graph Across Time

1 9 1
$#" $#%
(*+
6
(*+. S $#&
(*+
$#' …
(*+
$#"

RNN =
M I T
()*
(**
()*
(**
()*
(**
()*

!" !% !& !' … !"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNNs: Computational Graph Across Time

Re-use the same weight matrices at every time step

1 9 1
$#" $#%
(*+
6
(*+. S $#&
(*+
$#' …
(*+
$#"

RNN =
M I T()*
(**
()*
(**
()*
(**
()*

!" !% !& !' … !"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNNs: Computational Graph Across Time
Forward pass

(% (&

1 9
(' 1 ()

$#" $#%
*,-
6
*,-. S $#&
*,-
$#' …
*,-
$#"

RNN

M I T
=
*+,
*,,
*+,
*,,
*+,
*,,
*+,

!" !% !& !' … !"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNNs: Computational Graph Across Time
Forward pass
(
(% (&

1 9
(' 1 ()

$#" $#%
*,-
6
*,-. S $#&
*,-
$#' …
*,-
$#"

RNN

M I T
=
*+,
*,,
*+,
*,,
*+,
*,,
*+,

!" !% !& !' … !"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNNs from Scratch
class MyRNNCell(tf.keras.layers.Layer):
def __init__(self, rnn_units, input_dim, output_dim):

1
super(MyRNNCell, self).__init__()
output vector $#"
# Initialize weight matrices
self.W_xh = self.add_weight([rnn_units, input_dim])

1 9
S
self.W_hh = self.add_weight([rnn_units, rnn_units])

.
self.W_hy = self.add_weight([output_dim, rnn_units])

6
# Initialize hidden state to zeros

RNN
self.h = tf.zeros([rnn_units, 1])

M I T
def call(self, x):
# Update the hidden state
self.h = tf.math.tanh( self.W_hh * self.h + self.W_xh * x )

# Compute the output


output = self.W_hy * self.h
input vector
recurrent cell

!"
ℎ"

# Return the current output and hidden state


return output, self.h

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNN Implementation in TensorFlow

1 9 1 output vector $#"

S
tf.keras.layers.SimpleRNN(rnn_units)

6 . RNN

M I T input vector
recurrent cell

!"
ℎ"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
1 9 1
. S
Backpropagation Through Time (BPTT)
6
M I T
Recall: Backpropagation in Feed Forward Models

"

1 9 1
Backpropagation algorithm:

6 . S 1. Take the derivative (gradient) of the


loss with respect to each parameter
2. Shift parameters in order to

!
M I T minimize loss

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNNs: Backpropagation Through Time
Forward pass
(
(% (&

1 9
(' 1 ()

$#" $#%
*,-
6
*,-. S $#&
*,-
$#' …
*,-
$#"

RNN

M I=T *+,
*,,
*+,
*,,
*+,
*,,
*+,

!" !% !& !' … !"

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
RNNs: Backpropagation Through Time
Forward pass
(
1
Backward pass

(% (&

1 9
(' ()

$#" $#%
*,-
6
*,-. S $#&
*,-
$#' …
*,-
$#"

RNN

M I=T *+,
*,,
*+,
*,,
*+,
*,,
*+,

!" !% !& !' … !"

6.S191 Introduction to Deep Learning


Mozer Complex Systems 1989. 1/27/20
introtodeeplearning.com @MITDeepLearning
Standard RNN Gradient Flow
ℎ"
'()
'))
'()
'))
'()

…9
1 1 '))
ℎ&
'()

#" #$

6 . S #% #&

M I T
6.S191 Introduction to Deep Learning
1/27/20
introtodeeplearning.com @MITDeepLearning
Standard RNN Gradient Flow
ℎ"
'()
'))
'()
'))
'()

…9
1 1 '))
ℎ&
'()

#" #$

6 . S #%

Computing the gradient wrt ℎ" involves many factors of *++ + repeated gradient computation!
#&

M I T
6.S191 Introduction to Deep Learning
1/27/20
introtodeeplearning.com @MITDeepLearning
Standard RNN Gradient Flow: Exploding Gradients
ℎ"
'()
'))
'()
'))
'()

…9
1 1 '))
ℎ&
'()

#" #$

6 . S #%

Computing the gradient wrt ℎ" involves many factors of *++ + repeated gradient computation!
#&

M I T
Many values > 1:
exploding gradients
Gradient clipping to
scale big gradients

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
Standard RNN Gradient Flow:Vanishing Gradients
ℎ"
'()
'))
'()
'))
'()

…9
1 1 '))
'()
ℎ&

#" #$

6 . S #%

Computing the gradient wrt ℎ" involves many factors of *++ + repeated gradient computation!
#&

M I T
Many values > 1:
exploding gradients
Gradient clipping to
Many values < 1:
vanishing gradients
1. Activation function
2. Weight initialization
scale big gradients
3. Network architecture

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
The Problem of Long-Term Dependencies
Why are vanishing gradients a problem?

1 9 1
6 . S
M I T
6.S191 Introduction to Deep Learning
1/27/20
introtodeeplearning.com @MITDeepLearning
The Problem of Long-Term Dependencies
Why are vanishing gradients a problem?

Multiply many small numbers together

1 9 1
6 . S
M I T
6.S191 Introduction to Deep Learning
1/27/20
introtodeeplearning.com @MITDeepLearning
The Problem of Long-Term Dependencies
Why are vanishing gradients a problem?

Multiply many small numbers together

1 9 1
Errors due to further back time steps
have smaller and smaller gradients
6 . S
M I T
6.S191 Introduction to Deep Learning
1/27/20
introtodeeplearning.com @MITDeepLearning
The Problem of Long-Term Dependencies
Why are vanishing gradients a problem?

Multiply many small numbers together

1 9 1
Errors due to further back time steps
have smaller and smaller gradients
6 . S
M I T
Bias parameters to capture short-term
dependencies

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
The Problem of Long-Term Dependencies
“The clouds are in the ___”
Why are vanishing gradients a problem?

Multiply many small numbers together

1 9 1
Errors due to further back time steps
have smaller and smaller gradients
6 . S
M I T
Bias parameters to capture short-term
dependencies

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
The Problem of Long-Term Dependencies
“The clouds are in the ___”
Why are vanishing gradients a problem?

1
$#" $#% $#& $#' $#(

Multiply many small numbers together

1 9
Errors due to further back time steps
have smaller and smaller gradients
6 . S !" !% !& !' !(

M I T
Bias parameters to capture short-term
dependencies

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
The Problem of Long-Term Dependencies
“The clouds are in the ___”
Why are vanishing gradients a problem?

1
$#" $#% $#& $#' $#(

Multiply many small numbers together

1 9
Errors due to further back time steps
have smaller and smaller gradients
6 . S !" !% !& !'

“I grew up in France, … and l speak fluent___ ”


!(

M I T
Bias parameters to capture short-term
dependencies

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
The Problem of Long-Term Dependencies
“The clouds are in the ___”
Why are vanishing gradients a problem?

1
$#" $#% $#& $#' $#(

Multiply many small numbers together

1 9
Errors due to further back time steps
have smaller and smaller gradients
6 . S !" !% !& !'

“I grew up in France, … and l speak fluent___ ”


!(

M I T
Bias parameters to capture short-term
dependencies
$#" $#% … $#) $#)*%

!" !% … !) !)*%

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
Trick #1: Activation Functions

1 9 1
ReLU derivative

6 . S Using ReLU prevents


! " from shrinking the
gradients when # > 0

M I T tanh derivative

sigmoid derivative

6.S191 Introduction to Deep Learning


H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
Trick #2: Parameter Initialization

Initialize weights to identity matrix


1 9 1
Initialize biases to zero
6 . S
M I T
This helps prevent the weights from shrinking to zero.

6.S191 Introduction to Deep Learning


H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
Solution #3: Gated Cells
Idea: use a more complex recurrent unit with gates to
control what information is passed through

1 9 1
6 . S
gated cell

T
LSTM, GRU, etc.

M I
Long Short Term Memory (LSTMs) networks rely on a gated cell to
track information throughout many time steps.

6.S191 Introduction to Deep Learning


H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
1 9 1
. S
Long Short Term Memory (LSTM) Networks
6
M I T
Standard RNN
In a standard RNN, repeating modules contain a simple computation node
#"&% #"

1 9 1 #"$%

tanh
ℎ"&%

6 . S tanh
ℎ"

tanh

M
!"&%
I T !" !"$%

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
Long Short Term Memory (LSTMs)
LSTM modules contain computational blocks that control information flow
#")* #"

1 9 1 #"+*

tanh

6 . S tanh tanh

T
$ $ tanh $ $ $ tanh $ $ $ tanh $

!")*

M I !"

LSTM cells are able to track information throughout many timesteps


!"+*

tf.keras.layers.LSTM(num_units)

6.S191 Introduction to Deep Learning


Hochreiter & Schmidhuber, Neural Computation 1997. 1/27/20
introtodeeplearning.com @MITDeepLearning
Long Short Term Memory (LSTMs)
Information is added or removed through structures called gates

1 9 1
6 . S !

M I T
Gates optionally let information through, for example via a
sigmoid neural net layer and pointwise multiplication

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
Long Short Term Memory (LSTMs)
How do LSTMs work?
1) Forget 2) Store 3) Update 4) Output

1 9 1!"

(")*
6 . S ("

M I T
ℎ")*
-"

# # tanh #
tanh

ℎ"

,"
6.S191 Introduction to Deep Learning
Hochreiter & Schmidhuber, Neural Computation 1997. 1/27/20
introtodeeplearning.com @MITDeepLearning
Long Short Term Memory (LSTMs)
1) Forget 2) Store 3) Update 4) Output
LSTMs forget irrelevant parts of the previous state

1 9 1!"

(")*
6 . S ("

M I T
ℎ")*
-"

# # tanh #
tanh

ℎ"

,"
6.S191 Introduction to Deep Learning
Olah, “Understanding LSTMs”. 1/27/20
introtodeeplearning.com @MITDeepLearning
Long Short Term Memory (LSTMs)
1) Forget 2) Store 3) Update 4) Output
LSTMs store relevant new information into the cell state

1 9 1!"

#"$%
6 . S #"

M I Tℎ"$%
& &
-"

tanh &
tanh

ℎ"

,"
6.S191 Introduction to Deep Learning
Olah, “Understanding LSTMs”. 1/27/20
introtodeeplearning.com @MITDeepLearning
Long Short Term Memory (LSTMs)
1) Forget 2) Store 3) Update 4) Output
LSTMs selectively update cell state values

1 9 1%"

,"#$
6 . S ,"

M I T
ℎ"#$
' ' tanh '
tanh

ℎ"

&"
6.S191 Introduction to Deep Learning
Olah, “Understanding LSTMs”. 1/27/20
introtodeeplearning.com @MITDeepLearning
Long Short Term Memory (LSTMs)
1) Forget 2) Store 3) Update 4) Output
The output gate controls what information is sent to the next time step

1 9 1!"

(")*
6 . S ("

M I T ℎ")*
# # tanh
-"
#
tanh

ℎ"

,"
6.S191 Introduction to Deep Learning
Olah, “Understanding LSTMs”. 1/27/20
introtodeeplearning.com @MITDeepLearning
Long Short Term Memory (LSTMs)
1) Forget 2) Store 3) Update 4) Output

1 9 1!"

(")*
6 . S ("

M I T
ℎ")*
-"

# # tanh #
tanh

ℎ"

,"
6.S191 Introduction to Deep Learning
1/27/20
introtodeeplearning.com @MITDeepLearning
LSTM Gradient Flow
Uninterrupted gradient flow!

#) #"

1 9 1 #*

+, +)

6 . S +" +*

T
tanh tanh tanh

!)
$

M I
tanh $

!"
$ $ tanh $ $

!*
$ tanh $

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
LSTMs: Key Concepts
1. Maintain a separate cell state from what is outputted
2. Use gates to control the flow of information

1 9 1
6 . S
• Forget gate gets rid of irrelevant information
• Store relevant information from current input

T
• Selectively update cell state

M I
• Output gate returns a filtered version of the cell state
3. Backpropagation through time with uninterrupted gradient flow

6.S191 Introduction to Deep Learning


1/27/20
introtodeeplearning.com @MITDeepLearning
1 9 1
. S
RNN Applications
6
M I T
Example Task: Music Generation
Input: sheet music

1
F# G C A

9
Output: next character in sheet music

1
6 . S
M I T
E F# G C

6.S191 Lab!

6.S191 Introduction to Deep Learning


H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
Example Task: Sentiment Classification

1
sentiment
<positive>

1 9
S
Input: sequence of words

6 . Output: probability of having positive sentiment

M I T loss = tf.nn.softmax_cross_entropy_with_logits(y, predicted)

I love this class!

6.S191 Introduction to Deep Learning


Socher+, EMNLP 2013. 1/27/20
introtodeeplearning.com @MITDeepLearning
Example Task: Sentiment Classification
Tweet sentiment classification

1
sentiment
<positive>

1 9
6 . S
M I T
I love this class!

6.S191 Introduction to Deep Learning


H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
Example Task: Machine Translation
le chien mange

1 9 1
6 . S
M I
the
T dog eats <start> le chien

Encoder (English) Decoder (French)

6.S191 Introduction to Deep Learning


H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
Example Task: Machine Translation
le chien mange

encoding
bottleneck

1 9 1
6 . S
M I
the
T dog eats <start> le chien

Encoder (English) Decoder (French)

6.S191 Introduction to Deep Learning


H. Suresh, 6.S191 2018. 1/27/20
introtodeeplearning.com @MITDeepLearning
Attention Mechanisms
le chien mange

Attention mechanisms in neural networks


provide learnable memory access

1 9 1
6 . S
M I Tthe dog eats <start> le chien

Encoder (English) Decoder (French)

6.S191 Introduction to Deep Learning


Sutskever+, NIPS 2014; Bahdanau+ ICLR 2015. 1/27/20
introtodeeplearning.com @MITDeepLearning
Trajectory Prediction: Self-Driving Cars

1 9 1
6 . S
M I T
6.S191 Introduction to Deep Learning
Waymo. 1/27/20
introtodeeplearning.com @MITDeepLearning
Environmental Modeling

1 9 1
Particulates

6 . S SO2

M I T
Winds Humidity
6.S191 Introduction to Deep Learning
earth.nullschool.net 1/27/20
introtodeeplearning.com @MITDeepLearning
Deep Learning for Sequence Modeling: Summary
1. RNNs are well suited for sequence modeling tasks
2. Model sequences via a recurrence relation
3. Training RNNs with backpropagation through time
1 9 1
. S
4. Gated cells like LSTMs let us model long-term dependencies

6
5. Models for music generation, classification, machine translation, and more

M I T
6.S191 Introduction to Deep Learning
1/27/20
introtodeeplearning.com @MITDeepLearning
6.S191: Introduction to Deep Learning

1
Lab 1: Introduction to TensorFlow and Music Generation with RNNs

Link to download labs:


1 9
. S
http://introtodeeplearning.com#schedule

6
T
1. Open the lab in Google Colab

M I 2. Start executing code blocks and filling in the #TODOs


3. Need help? Find a TA or come to the front!!

You might also like