LSTM Lecture
LSTM Lecture
s0 s1 s2
...
x0 x1 x2
h0 h0
x0
h1 h1 o0
x1 ...
h2 h2 on
xn
hn hn
The Perceptron
inputs weights sum non-linearity
x0
w
0
x1 w
1
w2
x2 Σ
wn
xn
b
bias
Perceptron Forward Pass
inputs weights sum non-linearity
x0
w
0
x1 w
1
w2 output
x2 Σ
wn
xn
b
bias
Perceptron Forward Pass
inputs weights sum non-linearity
x0
w
0
x1 w
1
w2 output
x2 Σ
wn
xn
b
bias
Perceptron Forward Pass
inputs weights sum non-linearity
x0
w
0
x1 w
1
w2 output
x2 Σ
wn
xn
b
bias
Perceptron Forward Pass
inputs weights sum non-linearity
x0
w
0
x1 w
1
w2 output
x2 Σ
wn
xn
b
bias
Perceptron Forward Pass
inputs weights sum non-linearity
x0
w
0
x1 w
1
w2 output
x2 Σ
wn
xn
b
bias
Perceptron Forward Pass
inputs weights sum non-linearity
Activation Function
x0
w
0
x1 w
1
w2 output
x2 Σ
wn
xn
b
bias
Sigmoid Activation
inputs weights sum non-linearity
x0
w
0
x1 w
1
w2 output
x2 Σ
wn
xn
b
bias
Common Activation Functions
Importance of Activation Functions
● Activation functions add non-linearity to our network’s function
● Most real-world problems + data are non-linear
Perceptron Forward Pass
inputs weights sum non-linearity
2
0.1
3 0.5
2.5 output
-1 Σ
0.2
5
3.0
bias
Perceptron Forward Pass
inputs weights sum non-linearity
(2*0.1) + 2
0.1
(3*0.5) + 3 0.5
2.5 output
(-1*2.5) + -1 Σ
0.2
(5*0.2) + 5
3.0
1
(1*3.0)
) bias
Perceptron Forward Pass
inputs weights sum non-linearity
2
0.1
3 0.5
2.5 output
-1 Σ
0.2
5
3.0
bias
How do we build neural networks
with perceptrons?
Perceptron Diagram Simplified
inputs weights sum non-linearity
x0
w
0
x1 w
1
w2 output
x2 Σ
wn
xn
b
bias
Perceptron Diagram Simplified
inputs output
x0
x1
o0
x2
xn
Multi-Output Perceptron
Input layer output layer
x0
x1 o0
x2 o1
xn
Multi-Layer Perceptron (MLP)
h0
x0
h1 o0
x1
h2 on
xn
hn
Multi-Layer Perceptron (MLP)
h0
x0
h1 o0
x1
h2 on
xn
hn
Deep Neural Network
h0 h0
x0
h1 h1 o0
x1 ...
h2 h2 on
xn
hn hn
Training Neural Networks
Training Neural Networks: Loss function
Predicted Actual
N = # examples
Training Neural Networks: Objective
Loss is a function of the model’s parameters
How to minimize loss?
+
How to minimize loss?
Compute:
+
How to minimize loss?
+
How to minimize loss?
+
+
How to minimize loss?
Repeat!
This is called Stochastic Gradient Descent (SGD)
Repeat!
Stochastic Gradient Descent (SGD)
● Initialize θ randomly
● For N Epochs
○ For each training example (x, y):
W1 W2
x0 h0 o0 J( )
Calculating the Gradient: Backpropagation
W1 W2
x0 h0 o0 J( )
Calculating the Gradient: Backpropagation
W1 W2
x0 h0 o0 J( )
W1 W2
x0 h0 o0 J( )
W1 W2
x0 h0 o0 J( )
W1 W2
x0 h0 o0 J( )
Calculating the Gradient: Backpropagation
W1 W2
x0 h0 o0 J( )
W1 W2
x0 h0 o0 J( )
W1 W2
x0 h0 o0 J( )
W1 W2
x0 h0 o0 J( )
s0 s1 s2
...
x0 x1 x2
What is a sequence?
speech waveform
Successes of deep models
Machine translation Question Answering
Left:
https://research.googleblog.com/2016/09/a-
neural-network-for-machine.html
Right:
https://rajpurkar.github.io/SQuAD-explorer/
how do we model sequences?
idea: represent a sequence as a bag of words
“I dislike rain.”
[01010001]
prediction
problem: bag of words does not preserve order
problem: bag of words does not preserve order
[0001000100100000100000001]
[0001000100100000100000001]
vs
[1000001000000010001000100 ]
“In France, I had a great time and I learnt some of the _____
language.”
.
.
.
. .
. .
. .
.
.
.
. .
. .
. .
x0 : “it” W
s1
s0
U
t=0
RNNS remember their previous state:
x1 : “was” W
s2 1
2
s1
U
t=1
“unfolding” the RNN across time:
time
s0 s1 s2
...
U U U
W W W
x0 x1 x2
“unfolding” the RNN across time:
time
s0 s1 s2
... notice that W and U stay
U U U the same!
W W W
x0 x1 x2
“unfolding” the RNN across time:
time
s0 s1 s2
... sn can contain
U U U information from all
past timesteps
W W W
x0 x1 x2
possible task: language model
KING LEAR:
O, if you were a feeble sight, the
courtesy of your law,
all the works of language Your sight and several breath, will
shakespeare model wear the gods
With his heads, and my hands are
wonder'd at the deeds,
So drop upon your lordship's head,
and your opinion
Shall be against your honour.
possible task: language model
y0 y1 y2
alas my honor yi is actually a probability
distribution over possible
V V V next words, aka a softmax
s0 s1 s2
...
U U U
W W W
<start> alas my
x0 x1 x2
possible task: language model
http://kingjamesprogramming.tumblr.com/
possible task: classification (i.e. sentiment)
:(
:)
possible task: classification (i.e. sentiment)
y
negative y is a probability
distribution over
V possible classes (like
positive, negative,
neutral), aka a softmax
s0 s1 sn
...
U U
W W W
K K K K
s0 s1 s2
c0 c1 c2 c3
U U L L L
W W W J J J J
backpropagation!
(through time)
remember: backpropagation
J0 J1 J2
y0 y1 y2
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
we have a loss at each timestep:
(since we’re making a prediction at each timestep)
loss at each
J0 J1 J2 timestep
y0 y1 y2
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
we sum the losses across time:
y0 y1 y2
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
let’s try it out for W with the chain rule:
J0 J1 J2
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
let’s try it out for W with the chain rule:
J0 J1 J2
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
let’s try it out for W with the chain rule:
J0 J1 J2
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
let’s try it out for W with the chain rule:
J0 J1 J2
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
let’s try it out for W with the chain rule:
J0 J1 J2
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
let’s try it out for W with the chain rule:
J0 J1 J2
V V V
s0 s1 s2
...
but wait…
U U U
W W W
x0 x1 x2
let’s try it out for W with the chain rule:
J0 J1 J2
V V V
s0 s1 s2
...
but wait…
U U U
W W W
x0 x1 x2
let’s try it out for W with the chain rule:
J0 J1 J2
V V V
s0 s1 s2
...
but wait…
U U U
W W W
x1
s1 also depends on W so we can’t
x0 x2
just treat as a constant!
how does s2 depend on W?
J0 J1 J2
y0 y1 y2
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
how does s2 depend on W?
J0 J1 J2
y0 y1 y2
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
how does s2 depend on W?
J0 J1 J2
y0 y1 y2
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
how does s2 depend on W?
J0 J1 J2
y0 y1 y2
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
backpropagation through time:
Contributions of W in previous
timesteps to the error at timestep t
backpropagation through time:
Contributions of W in previous
timesteps to the error at timestep t
why are RNNs hard to train?
problem: vanishing gradient
problem: vanishing gradient
problem: vanishing gradient
y0 y1 y2
s0 s1 s2
at k = 0:
x0 x1 x2
problem: vanishing gradient
y0 y1 y2 y3 yn
s0 s1 s2 s3 sn
. . .
x0 x1 x2 x3 xn
problem: vanishing gradient
s0 s1 s2 s3 sn
. . .
x0 x1 x2 x3 xn
problem: vanishing gradient
problem: vanishing gradient
so what?
errors due to further back timesteps have increasingly
smaller gradients.
so what?
parameters become biased to capture shorter-term
dependencies.
“In France, I had a great time and I learnt some
of the _____ language.”
tanh derivative
sigmoid derivative
solution #2: initialization
rather each node being just a simple RNN cell, make each node
a more complex unit with gates controlling what information is
passed through.
vs
sj sj+1
solution #3: more on LSTMs
sj sj+1
forget
irrelevant parts
of previous
state
solution #3: more on LSTMs
sj sj+1
selectively
update cell
state values
solution #3: more on LSTMs
sj sj+1
output certain
parts of cell
state
solution #3: more on LSTMs
sj sj+1
K K K K
s0 s1 s2
c0 c1 c2 c3
U U L L L
W W W J J J J
K K K K
s0 s1 s2
c0 c1 c2 c3
U U
L L L
W W W J J J J
K K K K
s0 s1 s2 c3
c0 c1 c2
U U
L L L
W W W J J J J
K K K K
s0 s1 s2 c3
c0 c1 c2
U U
L L L
J J J J
K K K K
s0 s1 s2 c3
c0 c1 c2
U U
L L L
J J J
s* , le s* , chien s* , mange
solution: attend over all encoder states
le chien mange <end>
K K K K
s0 s1 s2 c3
c0 c1 c2
U U
L L L
J J
s* , chien s* , mange
now we can model sequences!
● why recurrent neural networks?
● building models for language, classification, and machine translation
● training them with backpropagation through time
● solving the vanishing gradient problem with activation functions,
initialization, and gated cells (like LSTMs)
● using attention mechanisms
and there’s lots more to do!
● extending our models to timeseries + waveforms
● complex language models to generate long text or books
● language models to generate code
● controlling cars + robots
● predicting stock market trends
● summarizing books + articles
● handwriting generation
● multilingual translation models
● … many more!
Using TensorFlow
Deep Learning Frameworks
● GPU Acceleration
● Automatic Differentiation
https://cs224d.stanford.edu/lectures/CS224d-Lecture7.pdf
TensorFlow Basics
● Create a session
import tensorflow as tf
session = tf.InteractiveSession()
or
session = tf.Session()
What is a graph
● Encapsulates the computation you want to perform
What are graphs made of?
● Placeholders (aka Graph Inputs)
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
What are graphs made of?
● Constants
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
k = tf.constant(1.0)
What are graphs made of?
● Operations
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
k = tf.constant(1.0)
c = tf.add(a, b)
d = tf.subtract(b, k)
e = tf.multiply(c, d)
How do we run the graph?
● Select nodes to evaluate
● Enter: tf.Variable
tf.Variable: Initialization
● Can initialize to specific values
b1 = tf.Variable(tf.zeros((2,2)), name="bias")
w1 = tf.Variable(tf.random_normal((2,2)), name="w1")
Building a Neural Network Graph
n_input_nodes = 2
n_output_nodes = 1
x = tf.placeholder(tf.float32, (None, 2))
y = tf.placeholder(tf.float32, (None, 1))
W = tf.Variable(tf.random_normal((n_input_nodes,
n_output_nodes)))
b = tf.Variable(tf.zeros(n_output_nodes))
z = tf.matmul(x, W) + b
out = tf.sigmoid(z)
Adding a loss function
n_input_nodes = 2
n_output_nodes = 1
x = tf.placeholder(tf.float32, (None, 2))
W = tf.Variable(tf.random_normal((n_input_nodes,
n_output_nodes)))
b = tf.Variable(tf.zeros(n_output_nodes))
z = tf.matmul(x, W) + b
out = tf.sigmoid(z)
loss = tf.reduce_mean(
tf.nn.sigmoid_cross_entropy_with_logits(
logits=z, labels=y))
Add an optimizer: SGD
learning_rate = 0.02
loss = tf.reduce_mean(
tf.nn.sigmoid_cross_entropy_with_logits(
logits=output, labels=y))
optimizer = tf.train.GradientDescentOptimizer(
learning_rate).minimize(loss)
tensorboard --logdir=path/to/log-directory
Summary Logs
● Summaries are operations! So just part of the graph:
with tf.variable_scope("foo"):
with tf.variable_scope("bar"):
v = tf.Variable("v", [1])
v.name
>>> "foo/bar/v:0"
Sharing weights tf.get_variable()
with tf.variable_scope("foo"):
with tf.variable_scope("bar"):
v = tf.get_variable("v", [1])
v.name
>>> "foo/bar/v:0"
Why share weights?
● Imagine we want to learn a feature detector that we run over multiple inputs,
and aggregate features and produce a prediction, all in 1 graph
● Need to share the weights to ensure:
○ A shared, single representation is learned
○ Gradients get propagated for all inputs
Attempt 1
def cnn_feature_extractor(image):
...
with tf.variable_scope("feature_extractor"):
v = tf.Variable("v", [1])
...
features = tf.relu(h4)
return features
feat_1 = cnn_feature_extractor(image_1)
feat_2 = cnn_feature_extractor(image_2)
pred = predict(feat_1, feat_2)
Name Scoping for cleaner code
● Networks often re-use similar structures, gets tedious to write each of them
def make_layer(input, input_size, output_size, scope_name):
tf.variable_scope(scope_name):
W = tf.Variable("w", tf.random_normal((input_size,
output_size)))
b = tf.Variable("b", tf.zeros(output_size))
z = tf.matmul(input, W) + b
return z
Name Scoping for cleaner code
● Networks often re-use similar structures, gets tedious to write each of them
...
input = ...
h0 = make_layer(input, 10, 20, "h0")
h1 = make_layer(h0, 20, 20, "h1")
...
tf.get_variable("h0/w")
tf.get_variable("h1/b")
Name Scoping Makes for Clean Graph Visualizations
Checkpointing + Saving Models
# Create a saver.
saver = tf.train.Saver(...variables...)
# Launch the graph and train, saving the model every 1,000 steps.
sess = tf.Session()
for step in xrange(1000000):
sess.run(..training_op..)
if step % 1000 == 0:
# Append the step number to the checkpoint name:
saver.save(sess, 'my-model', global_step=step)
Loading Models
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
# Later, launch the model, use the saver to restore variables from disk, and
# do some work with the model.
with tf.Session() as sess:
# Restore variables from disk.
saver.restore(sess, "/tmp/model.ckpt")
print("Model restored.")
# Do some work with the model
TensorFlow as core of other Frameworks
● Keras, TFLearn, TF-slim, others all based on TensorFlow
● Research often means tinkering with inner workings - worthwhile to
understand the core of any framework you are using
TensorFlow Tutorial:
- Pair up into pairs of 2
- Go to https://github.com/yala/introdeeplearning
- Follow install instructions
- If you need help, come down to the front