0% found this document useful (0 votes)

109 views163 pages

LSTM Lecture

This document provides an overview of neural networks and sequence modeling with LSTMs. It discusses neural network concepts like the perceptron, activation functions, multi-layer perceptrons, and training neural networks using gradient descent to minimize a loss function. It also outlines an LSTM and TensorFlow tutorial to model sequences.

Uploaded by

Gunnar Calvert

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

109 views163 pages

LSTM Lecture

Uploaded by

Gunnar Calvert

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 163

Computational Tutorial:

An introduction to LSTMs in Tensorflow

y0 y1 y2

s0 s1 s2
...

x0 x1 x2

Harini Suresh Nick Locascio

Part 1: Neural Networks Overview

Part 2: Sequence Modeling with LSTMs

Part 3: TensorFlow Fundamentals

Part 4: LSTMs + Tensorflow Tutorial

Part 1: Neural Networks Overview
Neural Network

Input hidden output

layer layers layer

h0 h0
x0
h1 h1 o0
x1 ...

h2 h2 on
xn
hn hn
The Perceptron
inputs weights sum non-linearity

x0
w
0

x1 w
1

w2
x2 Σ
wn
xn
b

bias
Perceptron Forward Pass
inputs weights sum non-linearity

x0
w
0

x1 w
1

w2 output
x2 Σ
wn
xn
b

bias
Perceptron Forward Pass
inputs weights sum non-linearity

x0
w
0

x1 w
1

w2 output
x2 Σ
wn
xn
b

bias
Perceptron Forward Pass
inputs weights sum non-linearity

x0
w
0

x1 w
1

w2 output
x2 Σ
wn
xn
b

bias
Perceptron Forward Pass
inputs weights sum non-linearity

x0
w
0

x1 w
1

w2 output
x2 Σ
wn
xn
b

bias
Perceptron Forward Pass
inputs weights sum non-linearity

x0
w
0

x1 w
1

w2 output
x2 Σ
wn
xn
b

bias
Perceptron Forward Pass
inputs weights sum non-linearity
Activation Function
x0
w
0

x1 w
1

w2 output
x2 Σ
wn
xn
b

bias
Sigmoid Activation
inputs weights sum non-linearity

x0
w
0

x1 w
1

w2 output
x2 Σ
wn
xn
b

bias
Common Activation Functions
Importance of Activation Functions
● Activation functions add non-linearity to our network’s function
● Most real-world problems + data are non-linear
Perceptron Forward Pass
inputs weights sum non-linearity

2
0.1

3 0.5

2.5 output
-1 Σ
0.2
5
3.0

bias
Perceptron Forward Pass
inputs weights sum non-linearity

(2*0.1) + 2
0.1

(3*0.5) + 3 0.5

2.5 output
(-1*2.5) + -1 Σ
0.2

(5*0.2) + 5
3.0

1
(1*3.0)
) bias
Perceptron Forward Pass
inputs weights sum non-linearity

2
0.1

3 0.5

2.5 output
-1 Σ
0.2
5
3.0

bias
How do we build neural networks
with perceptrons?
Perceptron Diagram Simplified
inputs weights sum non-linearity

x0
w
0

x1 w
1

w2 output
x2 Σ
wn
xn
b

bias
Perceptron Diagram Simplified
inputs output

x1
o0
x2

xn
Multi-Output Perceptron
Input layer output layer

x1 o0

x2 o1

xn
Multi-Layer Perceptron (MLP)

input hidden output

layer layer layer

h0
x0
h1 o0
x1
h2 on
xn
hn
Multi-Layer Perceptron (MLP)

input hidden output

layer layer layer

h0
x0
h1 o0
x1
h2 on
xn
hn
Deep Neural Network

Input hidden output

layer layers layer

h0 h0
x0
h1 h1 o0
x1 ...

h2 h2 on
xn
hn hn
Training Neural Networks
Training Neural Networks: Loss function

Predicted Actual
N = # examples
Training Neural Networks: Objective
Loss is a function of the model’s parameters
How to minimize loss?

Start at random point

+
How to minimize loss?

Compute:

+
How to minimize loss?

Move in direction opposite

of gradient to new point

+
How to minimize loss?

Move in direction opposite

of gradient to new point

+
+
How to minimize loss?

Repeat!
This is called Stochastic Gradient Descent (SGD)

Repeat!
Stochastic Gradient Descent (SGD)
● Initialize θ randomly
● For N Epochs
○ For each training example (x, y):

■ Compute Loss Gradient:

■ Update θ with update rule:

Stochastic Gradient Descent (SGD)
● Initialize θ randomly
● For N Epochs
○ For each training example (x, y):

■ Compute Loss Gradient:

■ Update θ with update rule:

Stochastic Gradient Descent (SGD)
● Initialize θ randomly
● For N Epochs
○ For each training example (x, y):

■ Compute Loss Gradient:

■ Update θ with update rule:

● How to Compute Gradient?

Calculating the Gradient: Backpropagation

W1 W2
x0 h0 o0 J( )
Calculating the Gradient: Backpropagation

W1 W2
x0 h0 o0 J( )

Apply the chain rule

Calculating the Gradient: Backpropagation

W1 W2
x0 h0 o0 J( )

Apply the chain rule

Calculating the Gradient: Backpropagation

W1 W2
x0 h0 o0 J( )

Apply the chain rule

Calculating the Gradient: Backpropagation

W1 W2
x0 h0 o0 J( )
Calculating the Gradient: Backpropagation

W1 W2
x0 h0 o0 J( )

Apply the chain rule

Calculating the Gradient: Backpropagation

W1 W2
x0 h0 o0 J( )

Apply the chain rule

Calculating the Gradient: Backpropagation

W1 W2
x0 h0 o0 J( )

Apply the chain rule Apply the chain rule

Calculating the Gradient: Backpropagation

W1 W2
x0 h0 o0 J( )

Apply the chain rule Apply the chain rule

Core Fundamentals Review
● Perceptron Classifier
● Stacking Perceptrons to form neural networks
● How to formulate problems with neural networks
● Train neural networks with backpropagation
Part 2: Sequence Modeling
with Neural Networks
Harini Suresh
y0 y1 y2

s0 s1 s2
...

x0 x1 x2
What is a sequence?

● “I took the dog for a walk this morning.” sentence

●
function

speech waveform
Successes of deep models
Machine translation Question Answering

Left:
https://research.googleblog.com/2016/09/a-
neural-network-for-machine.html
Right:
https://rajpurkar.github.io/SQuAD-explorer/
how do we model sequences?
idea: represent a sequence as a bag of words

“I dislike rain.”

[01010001]

prediction
problem: bag of words does not preserve order
problem: bag of words does not preserve order

“The food was good, not bad at all.”

vs
“The food was bad, not good at all.”
idea: maintain an ordering within feature vector

[0001000100100000100000001]

On Monday it was snowing

One hot feature

vector indicates
what each word is prediction
problem: hard to deal with different word orders

“On Monday, it was snowing.”

vs
“It was snowing on Monday.”
problem: hard to deal with different word orders

[0001000100100000100000001]

On Monday it was snowing

[1000001000000010001000100 ]

It was snowing on Monday

problem: hard to deal with different word orders

“On Monday it was snowing.”

vs
“It was snowing on Monday.”

We would have to relearn the rules of language at

each point in the sentence.
idea: markov models
problem: we can’t model long-term dependencies

markov assumption: each state depends only on the

last state.
problem: we can’t model long-term dependencies

“In France, I had a great time and I learnt some of the _____
language.”

We need information from the far past and future to

accurately guess the correct word.
let’s turn to recurrent neural networks! (RNNs)

1. to maintain word order

2. to share parameters across the sequence
3. to keep track of long-term dependencies
example network:

.
.
.
. .
. .
. .

input hidden output

example network:

.
.
.
. .
. .
. .

let’s take a look at

this one hidden unit

input hidden output

RNNS remember their previous state:

x0 : “it” W

s0
U

t=0
RNNS remember their previous state:

x1 : “was” W

s2 1
2

s1
U

t=1
“unfolding” the RNN across time:

time

s0 s1 s2
...
U U U

W W W

x0 x1 x2
“unfolding” the RNN across time:

time

s0 s1 s2
... notice that W and U stay
U U U the same!

W W W

x0 x1 x2
“unfolding” the RNN across time:

time

s0 s1 s2
... sn can contain
U U U information from all
past timesteps

W W W

x0 x1 x2
possible task: language model

KING LEAR:
O, if you were a feeble sight, the
courtesy of your law,
all the works of language Your sight and several breath, will
shakespeare model wear the gods
With his heads, and my hands are
wonder'd at the deeds,
So drop upon your lordship's head,
and your opinion
Shall be against your honour.
possible task: language model
y0 y1 y2
alas my honor yi is actually a probability
distribution over possible
V V V next words, aka a softmax

s0 s1 s2
...
U U U

W W W

<start> alas my
x0 x1 x2
possible task: language model

37:29 The righteous shall inherit

the land, and leave it for an
inheritance unto the children of
King James Bible, Gad according to the number of
language
Structure and Interpretation steps that is linear in b.
model
of Computer Programs
hath it not been for the singular
taste of old Unix, “new Unix”
would not exist.

http://kingjamesprogramming.tumblr.com/
possible task: classification (i.e. sentiment)

:)
possible task: classification (i.e. sentiment)
y
negative y is a probability
distribution over
V possible classes (like
positive, negative,
neutral), aka a softmax
s0 s1 sn
...
U U

W W W

don’t fly luggage

x0 x1 xn
possible task: machine translation

le chien mange <end>

K K K K

s0 s1 s2
c0 c1 c2 c3
U U L L L

W W W J J J J

the dog s2 , <go> s2 , le s2 , chien s2 , mange

eats
how do we train an RNN?
how do we train an RNN?

backpropagation!
(through time)
remember: backpropagation

1. take the derivative (gradient) of the loss with

respect to each parameter

2. shift parameters in the opposite direction in order

to minimize loss
we have a loss at each timestep:
(since we’re making a prediction at each timestep)

J0 J1 J2

y0 y1 y2

V V V
s0 s1 s2
...
U U U

W W W
x0 x1 x2
we have a loss at each timestep:
(since we’re making a prediction at each timestep)
loss at each
J0 J1 J2 timestep

y0 y1 y2

V V V
s0 s1 s2
...
U U U

W W W
x0 x1 x2
we sum the losses across time:

loss at time t = Jt( )

= our
parameters, like
weights

total loss = J( ) = Σt Jt( )

what are our gradients?