0% found this document useful (0 votes)
31 views

UDRC RNN LSTM LibrariesTutorial

The document discusses RNN, LSTM and deep learning libraries like Caffe, Torch, Theano and TensorFlow. It explains RNN flexibility and applications, how LSTM works and improves on vanilla RNN, and provides an overview of Caffe including its main classes and use of protocol buffers.

Uploaded by

Soumana Sanou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

UDRC RNN LSTM LibrariesTutorial

The document discusses RNN, LSTM and deep learning libraries like Caffe, Torch, Theano and TensorFlow. It explains RNN flexibility and applications, how LSTM works and improves on vanilla RNN, and provides an overview of Caffe including its main classes and use of protocol buffers.

Uploaded by

Soumana Sanou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 144

RNN LSTM and Deep Learning Libraries

UDRC Summer School

Muhammad Awais
[email protected]
Outline

➢ Recurrent Neural Network


➢ Application of RNN
➢ LSTM
➢ Caffe
➢ Torch
➢ Theano
➢ TensorFlow
Flexibility of Recurrent Neural Networks

Vanilla Neural Networks


Flexibility of Recurrent Neural Networks

e.g. Image Captioning


image -> sequence of words
Flexibility of Recurrent Neural Networks

e.g. Sentiment Classification


sequence of words -> sentiment
Flexibility of Recurrent Neural Networks

e.g. Machine Translation


seq of words -> seq of words
Flexibility of Recurrent Neural Networks

e.g. Video classification on frame level


Recurrent Neural Networks

RNN

x
Recurrent Neural Networks

usually want to predict


y a vector at some time
steps

RNN

x
Recurrent Neural Networks

We can process a sequence of vectors x by


applying a recurrence formula at every time step: y

RNN
new state old state input vector at
some time step
some function x
with parameters W
Recurrent Neural Networks

We can process a sequence of vectors x by


applying a recurrence formula at every time step: y

RNN
new state old state input vector at
some time step
some function x
with parameters W
Notice: the same function and the same set
of parameters are used at every time step.
Recurrent Neural Networks
The state consists of a single “hidden” vector h:

RNN

x
Recurrent Neural Networks

Character-level y
language model
example RNN

Vocabulary:
x
[h,e,l,o]

Example training
sequence:
“hello”
Recurrent Neural Networks

Character-level
language model
example

Vocabulary:
[h,e,l,o]

Example training
sequence:
“hello”
Recurrent Neural Networks

Character-level
language model
example

Vocabulary:
[h,e,l,o]

Example training
sequence:
“hello”
Recurrent Neural Networks

Character-level
language model
example

Vocabulary:
[h,e,l,o]

Example training
sequence:
“hello”
Recurrent Neural Networks
Image Captioning

Explain Images with Multimodal Recurrent Neural Networks, Mao et al.


Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei
Show and Tell: A Neural Image Caption Generator, Vinyals et al.
Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al.
Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick
Recurrent Neural Networks

Recurrent Neural Network

Convolutional Neural Network


Recurrent Neural Networks
test image
Recurrent Neural Networks
test image
Recurrent Neural Networks
test image

X
Recurrent Neural Networks
test image

x0
<STA
RT>

<START>
Recurrent Neural Networks
test image

y0

before:
h = tanh(Wxh * x + Whh * h)
h0
Wih
now:
h = tanh(Wxh * x + Whh * h + Wih * v)
x0
<STA
RT>

v <START>
Recurrent Neural Networks
test image

y0

sample!
h0

x0
<STA straw
RT>

<START>
Recurrent Neural Networks
test image

y0 y1

h0 h1

x0
<STA straw
RT>

<START>
Recurrent Neural Networks
test image

y0 y1

h0 h1
sample!

x0
<STA straw hat
RT>

<START>
Recurrent Neural Networks
test image

y0 y1 y2

h0 h1 h2

x0
<STA straw hat
RT>

<START>
Recurrent Neural Networks
test image

y0 y1 y2

sample
<END> token
h0 h1 h2 => finish.

x0
<STA straw hat
RT>

<START>
Recurrent Neural Networks
Image Sentence Datasets

Microsoft COCO
[Tsung-Yi Lin et al. 2014]
mscoco.org

currently:
~120K images
~5 sentences each
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks

depth

time
Recurrent Neural Networks

LSTM:

depth

time
Long Short Term Memory (LSTM)

h
Long Short Term Memory (LSTM)
[Hochreiter et al., 1997]

vector from
below (x)
x sigmoid i

h sigmoid f
W
vector from sigmoid o
before (h)
tanh g

4n x 2n 4n 4*n
Long Short Term Memory (LSTM)

cell
state c

f
Long Short Term Memory (LSTM)

cell
state c

x +

f i g
Long Short Term Memory (LSTM)

cell
state c

x +
c
x tanh

f i g o x

h
Long Short Term Memory (LSTM)
higher layer, or
prediction

cell
state c

x +
c
x tanh

f i g o x

h
Long Short Term Memory (LSTM)

LSTM one timestep one timestep

cell
state c

x + x +

tanh tanh
x x

f f
i g x i g x
o o

h h h
x x
Long Short Term Memory (LSTM)

Summary
- RNNs allow a lot of flexibility in architecture design
- Vanilla RNNs are simple but don’t work very well
- Common to use LSTM or GRU: their additive interactions
improve gradient flow
- Backward flow of gradients in RNN can explode or vanish.
Exploding is controlled with gradient clipping. Vanishing is
controlled with additive interactions (LSTM)
- Better/simpler architectures are a hot topic of current research
- Better understanding (both theoretical and empirical) is needed.
Deep Learning Libraries
Caffe, Torch, Theano, TensorFlow
Caffe
http://caffe.berkeleyvision.org
Caffe overview

From U.C. Berkeley


Written in C++
Has Python and MATLAB bindings
Good for training or finetuning feedforward models
Caffe
SoftmaxLossLayer
Main classes
data
fc1
Blob: Stores data and diffs
derivatives (header source)
Layer: Transforms bottom InnerProductLayer
blobs to top blobs (header + source)
Net: Many layers; computes data data data
gradients via forward / W X y
diffs diffs diffs
backward (header source)
Solver: Uses gradients to
update weights (header source) DataLayer
45
Caffe

Protocol Buffers
“Typed JSON” .proto file

from Google

Define “message types” in


.proto files

https://developers.google.com/protocol-buffers/
Caffe

Protocol Buffers
“Typed JSON” .proto file

from Google

Define “message types” in


.proto files

Serialize instances to text .prototxt file


files (.prototxt)
name: “John Doe”
id: 1234
email: “[email protected]

https://developers.google.com/protocol-buffers/
Caffe

Protocol Buffers
“Typed JSON” .proto file Java class
from Google

Define “message types” in


.proto files

Serialize instances to text .prototxt file


C++ class
files (.prototxt)
name: “John Doe”
id: 1234
email: “[email protected]
Compile classes for
different languages
Caffe

Protocol Buffers

https://github.com/BVLC/caffe/blob/master/src/caffe/proto/caffe.proto
<- All Caffe proto types defined here, good documentation!
Caffe

Training / Finetuning
No need to write code!
1. Convert data (run a script)
2. Define net (edit prototxt)
3. Define solver (edit prototxt)
4. Train (with pretrained weights) (run a script)
Caffe

Step 1: Convert Data


DataLayer reading from LMDB is the easiest
Create LMDB using convert_imageset
Need text file where each line is
“[path/to/image.jpeg] [label]”
Create HDF5 file yourself using h5py
Caffe

Step 2: Define Net


Caffe

Step 2: Define Net


Layers and Blobs
often have same
name!
Caffe

Step 2: Define Net


Layers and Blobs
often have same
name!

Learning rates
(weight + bias)

Regularization
(weight + bias)
Caffe

Step 2: Define Net Number of output


classes

Layers and Blobs


often have same
name!

Learning rates
(weight + bias)

Regularization
(weight + bias)
Caffe

Step 2: Define Net Number of output


classes

Layers and Blobs


often have same
name!

Set these to 0 to
freeze a layer

Learning rates
(weight + bias)

Regularization
(weight + bias)
Caffe

Step 2: Define Net

● .prototxt can get ugly for


big models

● ResNet-152 prototxt is
6775 lines long!

● Not “compositional”; can’t


easily define a residual
block and reuse

https://github.com/KaimingHe/deep-residual-networks/blob/master/prototxt/ResNet-152-deploy.prototxt
Caffe

Step 2: Define Net (finetuning)


Original prototxt: Modified prototxt:
layer {
Same name: layer {
name: "fc7" weights copied name: "fc7"
type: "InnerProduct" type: "InnerProduct"
inner_product_param { inner_product_param {
num_output: 4096 num_output: 4096
}
Pretrained weights: }
“fc7.weight”: [values]
} }
“fc7.bias”: [values]
[... ReLU, Dropout] [... ReLU, Dropout]
“fc8.weight”: [values]
layer { layer {
“fc8.bias”: [values]
name: "fc8" name: "my-fc8"
type: "InnerProduct" type: "InnerProduct"
inner_product_param { inner_product_param {
num_output: 1000 num_output: 10
} }
} }
Caffe

Step 2: Define Net (finetuning)


Original prototxt: Modified prototxt:
layer {
Same name: layer {
name: "fc7" weights copied name: "fc7"
type: "InnerProduct" type: "InnerProduct"
inner_product_param { inner_product_param {
num_output: 4096 num_output: 4096
}
Pretrained weights: }
“fc7.weight”: [values]
} }
“fc7.bias”: [values]
[... ReLU, Dropout] [... ReLU, Dropout]
“fc8.weight”: [values]
layer { layer {
“fc8.bias”: [values]
name: "fc8" name: "my-fc8"
type: "InnerProduct" type: "InnerProduct"
inner_product_param { inner_product_param {
num_output: 1000 num_output: 10
} }
} }
Caffe

Step 2: Define Net (finetuning)


Original prototxt: Modified prototxt:
layer {
Same name: layer {
name: "fc7" weights copied name: "fc7"
type: "InnerProduct" type: "InnerProduct"
inner_product_param { inner_product_param {
num_output: 4096 num_output: 4096
}
Pretrained weights: }
“fc7.weight”: [values]
} }
“fc7.bias”: [values]
[... ReLU, Dropout] [... ReLU, Dropout]
“fc8.weight”: [values]
layer { layer {
“fc8.bias”: [values]
name: "fc8" name: "my-fc8"
type: "InnerProduct" type: "InnerProduct"
inner_product_param { inner_product_param {
num_output: 1000 Different name: num_output: 10
} }
}
weights reinitialized }
Caffe

Step 3: Define Solver


Write a prototxt file defining a
SolverParameter
If finetuning, copy existing
solver.prototxt file
Change net to be your net
Change snapshot_prefix to your
output
Reduce base learning rate (divide
by 100)
Maybe change max_iter and
snapshot
Caffe

Step 4: Train!
./build/tools/caffe train \
-gpu 0 \
-model path/to/trainval.prototxt \
-solver path/to/solver.prototxt \
-weights path/to/pretrained_weights.caffemodel

https://github.com/BVLC/caffe/blob/master/tools/caffe.cpp
Caffe

Step 4: Train!
./build/tools/caffe train \
-gpu 0 \
-model path/to/trainval.prototxt \
-solver path/to/solver.prototxt \
-weights path/to/pretrained_weights.caffemodel

-gpu -1

https://github.com/BVLC/caffe/blob/master/tools/caffe.cpp
Caffe

Step 4: Train!
./build/tools/caffe train \
-gpu 0 \
-model path/to/trainval.prototxt \
-solver path/to/solver.prototxt \
-weights path/to/pretrained_weights.caffemodel

-gpu all

https://github.com/BVLC/caffe/blob/master/tools/caffe.cpp
Caffe

Pros / Cons
(+) Good for feedforward networks
(+) Good for finetuning existing networks
(+) Train models without writing any code!
(+) Python and matlab interfaces are pretty useful!
(-) Need to write C++ / CUDA for new GPU layers
(-) Not good for recurrent networks
(-) Cumbersome for big networks (GoogLeNet, ResNet)
Torch
http://torch.ch
Torch

From NYU + IDIAP


Written in C and Lua
Used a lot a Facebook, DeepMind
Torch

Lua
High level scripting language, easy to
interface with C
Similar to Javascript:
One data structure:
table == JS object
Prototypical inheritance
metatable == JS prototype
First-class functions
Some gotchas:
1-indexed =(
Variables global by default =(
Small standard library http://tylerneylon.com/a/learn-lua/
Torch

Tensors
Torch tensors are just like numpy arrays
Torch

Tensors
Torch tensors are just like numpy arrays
Torch

Tensors
Torch tensors are just like numpy arrays
Torch

Tensors
Like numpy, can easily change data type:
Torch

Tensors
Unlike numpy, GPU is just a datatype away:
Torch

Tensors
Documentation on GitHub:

https://github.com/torch/torch7/blob/master/doc/tensor.md https://github.com/torch/torch7/blob/master/doc/maths.md
Torch

nn
nn module lets you easily build
and train neural nets
Torch

nn
nn module lets you easily build
and train neural nets

Build a two-layer ReLU net


Torch

nn
nn module lets you easily build
and train neural nets

Get weights and gradient for


entire network
Torch

nn
nn module lets you easily build
and train neural nets

Use a softmax loss function


Torch

nn
nn module lets you easily build
and train neural nets

Generate random data


Torch

nn
nn module lets you easily build
and train neural nets

Forward pass: compute


scores and loss
Torch

nn
nn module lets you easily build
and train neural nets

Backward pass: Compute


gradients. Remember to set
weight gradients to zero!
Torch

nn
nn module lets you easily build
and train neural nets

Update: Make a gradient


descent step
Torch

cunn
Running on GPU is easy:
Torch

cunn
Running on GPU is easy:

Import a few new packages


Torch

cunn
Running on GPU is easy:

Import a few new packages

Cast network and criterion


Torch

cunn
Running on GPU is easy:

Import a few new packages

Cast network and criterion

Cast data and labels


Torch

optim
optim package implements different
update rules: momentum, Adam, etc
Torch

optim
optim package implements different
update rules: momentum, Adam, etc

Import optim package


Torch

optim
optim package implements different
update rules: momentum, Adam, etc

Import optim package

Write a callback function that returns


loss and gradients
Torch

optim
optim package implements different
update rules: momentum, Adam, etc

Import optim package

Write a callback function that returns


loss and gradients

state variable holds hyperparameters,


cached values, etc; pass it to adam
Torch

Modules
Caffe has Nets and Layers;
Torch just has Modules
Torch

Modules
Caffe has Nets and Layers;
Torch just has Modules

Modules are classes written in


Lua; easy to read and write

Forward / backward written in Lua


using Tensor methods

Same code runs on CPU / GPU


https://github.com/torch/nn/blob/master/Linear.lua
Torch

Modules
Caffe has Nets and Layers;
Torch just has Modules

Modules are classes written in


Lua; easy to read and write

updateOutput: Forward pass;


compute output

https://github.com/torch/nn/blob/master/Linear.lua
Torch

Modules
Caffe has Nets and Layers;
Torch just has Modules

Modules are classes written in


Lua; easy to read and write

updateGradInput: Backward;
compute gradient of input

https://github.com/torch/nn/blob/master/Linear.lua
Torch

Modules
Caffe has Nets and Layers;
Torch just has Modules

Modules are classes written in


Lua; easy to read and write

accGradParameters: Backward;
compute gradient of weights

https://github.com/torch/nn/blob/master/Linear.lua
Torch

Modules
Tons of built-in modules and loss functions

https://github.com/torch/nn
Torch

Modules
Writing your own modules is easy!
Torch

Modules
Container modules allow you to combine multiple modules
Torch

Modules
Container modules allow you to combine multiple modules

mod1

mod2

out
Torch

Modules
Container modules allow you to combine multiple modules

x x

mod1
mod1 mod2
mod2

out out[1] out[2]


Torch

Modules
Container modules allow you to combine multiple modules

x x
x1 x2

mod1
mod1 mod2 mod1 mod2
mod2

out[1] out[2] out[1] out[2]


out
Torch

nngraph
Use nngraph to build modules
that combine their inputs in
complex ways

Inputs: x, y, z
Outputs: c
a=x+y
b=a☉z
c=a+b
Torch

x y z
nngraph
+
Use nngraph to build modules
a
that combine their inputs in
complex ways

Inputs: x, y, z
b
Outputs: c
a=x+y +
b=a☉z
c=a+b c
Torch

x y z
nngraph
+
Use nngraph to build modules
a
that combine their inputs in
complex ways

Inputs: x, y, z
b
Outputs: c
a=x+y +
b=a☉z
c=a+b c
Torch

Pretrained Models
loadcaffe: Load pretrained Caffe models: AlexNet, VGG, some others
https://github.com/szagoruyko/loadcaffe

GoogLeNet v1: https://github.com/soumith/inception.torch


GoogLeNet v3: https://github.com/Moodstocks/inception-v3.torch
ResNet: https://github.com/facebook/fb.resnet.torch
Torch

Package Management
After installing torch, use luarocks
to install or update Lua packages

(Similar to pip install from Python)


Torch

Torch: Other useful packages


torch.cudnn: Bindings for NVIDIA cuDNN kernels
https://github.com/soumith/cudnn.torch
torch-hdf5: Read and write HDF5 files from Torch
https://github.com/deepmind/torch-hdf5
lua-cjson: Read and write JSON files from Lua
https://luarocks.org/modules/luarocks/lua-cjson
cltorch, clnn: OpenCL backend for Torch, and port of nn
https://github.com/hughperkins/cltorch, https://github.com/hughperkins/clnn
torch-autograd: Automatic differentiation; sort of like more powerful nngraph,
similar to Theano or TensorFlow
https://github.com/twitter/torch-autograd
fbcunn: Facebook: FFT conv, multi-GPU (DataParallel, ModelParallel)
https://github.com/facebook/fbcunn
Torch

Pros / Cons
(-) Lua
(-) Less plug-and-play than Caffe
You usually write your own training code
(+) Lots of modular pieces that are easy to combine
(+) Easy to write your own layer types and run on GPU
(+) Most of the library code is in Lua, easy to read
(+) Lots of pretrained models!
(-) Not great for RNNs
Theano
http://deeplearning.net/software/theano/
Theano

From Yoshua Bengio’s group at University of Montreal

Embracing computation graphs, symbolic computation

High-level wrappers: Keras, Lasagne


Theano
Computational Graphs
x y z

c
Theano
Computational Graphs
x y z

c
Theano
Computational Graphs
x y z Define symbolic variables;
these are inputs to the
+
graph
a

c
Theano
Computational Graphs
x y z
Compute intermediates
+
and outputs symbolically
a

c
Theano
Computational Graphs
x y z

+
Compile a function that
a produces c from x, y, z
(generates code)

c
Theano
Computational Graphs
x y z

a Run the function, passing


some numpy arrays
☉ (may run on GPU)

c
Theano
Computational Graphs
x y z

☉ Repeat the same


computation using numpy
b operations (runs on CPU)
+

c
Theano
Simple Neural Net
Theano
Simple Neural Net
Define symbolic variables:
x = data
y = labels
w1 = first-layer weights
w2 = second-layer weights
Theano
Simple Neural Net

Forward: Compute scores


(symbolically)
Theano
Simple Neural Net

Forward: Compute probs, loss


(symbolically)
Theano
Simple Neural Net

Compile a function that


computes loss, scores
Theano
Simple Neural Net

Stuff actual numpy arrays into


the function
Theano
Computing Gradients
Theano
Computing Gradients

Same as before: define


variables, compute scores and
loss symbolically
Theano
Computing Gradients

Theano computes gradients for


us symbolically!
Theano
Computing Gradients

Now the function returns loss,


scores, and gradients
Theano
Computing Gradients

Use the function to perform


gradient descent!
Theano
Pros / Cons
(+) Python + numpy
(+) Computational graph is nice abstraction
(+) RNNs fit nicely in computational graph
(-) Raw Theano is somewhat low-level
(+) High level wrappers (Keras, Lasagne) ease the pain
(-) Error messages can be unhelpful
(-) Large models can have long compile times
(-) Much “fatter” than Torch; more magic
(-) Patchy support for pretrained models
TensorFlow
https://www.tensorflow.org
TensorFlow

From Google

Very similar to Theano - all about computation graphs

Easy visualizations (TensorBoard)

Multi-GPU and multi-node training


TensorFlow
TensorFlow: Two-Layer Net
TensorFlow
TensorFlow: Two-Layer Net
Create placeholders for data
and labels: These will be fed
to the graph
TensorFlow
TensorFlow: Two-Layer Net

Create Variables to hold


weights; similar to Theano
shared variables

Initialize variables with numpy


arrays
TensorFlow
TensorFlow: Two-Layer Net

Forward: Compute scores,


probs, loss (symbolically)
TensorFlow
TensorFlow: Two-Layer Net

Running train_step will use


SGD to minimize loss
TensorFlow
TensorFlow: Two-Layer Net

Create an artificial dataset; y is


one-hot like Keras
TensorFlow
TensorFlow: Two-Layer Net

Actually train the model


TensorFlow
TensorFlow: Multi-GPU
Data parallelism:
synchronous or asynchronous
TensorFlow
TensorFlow: Multi-GPU
Data parallelism: Model parallelism:
synchronous or asynchronous Split model across GPUs
TensorFlow
TensorFlow: Distributed
Single machine: Many machines:
Like other frameworks Not open source (yet) =(
TensorFlow
TensorFlow: Pros / Cons
(+) Python + numpy
(+) Computational graph abstraction, like Theano; great for RNNs
(+) Much faster compile times than Theano
(+) Slightly more convenient than raw Theano?
(+) TensorBoard for visualization
(+) Data AND model parallelism; best of all frameworks
(+/-) Distributed models, but not open-source yet
(-) Slower than other frameworks right now
(-) Much “fatter” than Torch; more magic
(-) Not many pretrained models
Comparison between Libraries

Caffe Torch Theano TensorFlow

Language C++, Python Lua Python Python

Pretrained Yes ++ Yes ++ Yes (Lasagne) Inception

Multi-GPU: Yes Yes Yes Yes


Data parallel cunn.DataParallelTable platoon

Multi-GPU: No Yes Experimental Yes (best)


Model parallel fbcunn.ModelParallel

Readable Yes (C++) Yes (Lua) No No


source code

Good at RNN No Mediocre Yes Yes (best)


Any Question???
Thanks

You might also like