Large Scale Deep Learning
Large Scale Deep Learning
Vincent Vanhoucke
Quick Introduction
Emphasize what matters at scale, when models and data get large.
Talk about some of the most exciting lines of research in the field.
Plankton Identification
Galaxy classification
Speech Recognition
Speech Recognition with Deep Recurrent Neural Networks
Alex Graves, Abdel-rahman Mohamed, Geoffrey Hinton
Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks
Tara N. Sainath, Oriol Vinyals, Andrew Senior, Hasim Sak
Parsing
Grammar as a Foreign Language
Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, Geoffrey Hinton
Language Modeling
One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, Tony Robinson
Neural Networks … without the Neuro-Babble
max(0, X)
Step 3: Repeat!
X A1 A2 A3 A4 Y
Your pick!
Orders of Magnitude
A1 A2 A3 kitten
15-25 layers
10-200M parameters
1-5B multiply-adds / image
Training Models
1. The Maths
2. The Stats
3. The Hacks
4. The Computer Science
The Maths
A Neural Network in Equations Inputs
Images
Outputs Spectrograms
Labels Features
Predictions Words
Cat / No Cat …
Phonemes
Next Word
y=nn(x,w)
…
Weights
Parameters
Training Data
Targets
Training Sample True Labels
Correct Labels
x, y̅
Objective: y=nn(x,w) ≈ y̅
The Loss: ‘How Close are We?’
y=nn(x,w)
Sum over
all the
training ∑L(y̅, y)
data.
w
w’ = w - α ∂wL(w)
w’
Gradient
Derivative
Delta
The Loss is a Very Complicated Function of the Weights
L(y̅, nn(x,w))
1. nn() is a very non-linear, non-convex function!
2. Depends on all the Inputs and Targets in the training set!
Activations
Hidden States
Remember the ‘Chain Rule’ from High School:
g(f(x))’=g’(f(x)).f’(x)
∂g∘f ∂g ∂f
= .
∂x ∂f ∂x
Chaining:
∂j∘i∘h∘g∘f ∂j ∂i ∂h ∂g ∂f
=
∂x ∂i ∂h ∂g ∂f ∂x
Graphical View of the Chain Rule
f g
x f(x) g∘f(x)
∂f ∂g
∂f(x) ∂g(f(x))
∂g∘f/∂x = ∂f(x).∂g(f(x))
_._
Graphical View of the Chain Rule
f g
x f(x) g∘f(x)
∂f ∂g
∂f(x) ∂g(f(x))
∂g∘f/∂x = ∂f(x).∂g(f(x))
_._
Back-Propagation using the Chain Rule
f
x f(x)
∂f
∂f(x).∂x ∂x
You can compute the gradient with respect to any quantity by:
● Taking the gradient ∂x sent back from up the chain from you.
● Multiplying it by your local gradient ∂f(x) with respect to that quantity.
Back-Propagation using the Chain Rule
f g
x f(x) = y g(y) L()
∂f ∂g
h y L(y)
z
Sharing Gradient Computation via Back-Propagation
w w
0 1
Example: 1-layer Neural Network
Y=max(W.X+B, 0)
W._ _+B max(_, 0)
X H0 H1 Y
Y=max(W.X+B, 0)
W._ _+B max(_, 0)
X H0 H1 Y
Compute analytically the Gradients:
∂Y/∂H1=
∂H0/∂X=W ∂H1/∂H0=1 _>0?1:0
∂X ∂H0 ∂H1 ∂Y
Example: 1-layer Neural Network
Y=max(W.X+B, 0)
W._ _+B max(_, 0)
X H0 H1 Y
W 1 _>0?1:0
∂X ∂H0 ∂H1 ∂Y
∂X = W⊤.∂H0 ∂H0 = ∂H1 ∂H1 = H1 > 0 ? ∂Y : 0
∂Y ∂L(y̅, y) = -2(y̅ - y)
The Loss is a Very Complicated Function of the Weights
L(y̅, nn(x,w))
1. nn() is a very non-linear, non-convex function!
2. Depends on all the Inputs and Targets in the training set!
Two strategies
1. Momentum + learning rate decay:
Works best if you manage to get it to run.
2. AdaGrad:
Works more often, not always gets you the best result.
Two more tricks:
1. Parameter averaging.
2. Gradient clipping.
Momentum
g’ = μ g + ∂wL(w)
w’ = w - α g’
μ=0.9
Learning Rate Decay
-βt
α=α0e
AdaGrad
2
n ← n + ∂ wL
Use it to discount the learning rate:
w ← w - α ∂wL
√n
Parameter Averaging
Only use the averaged parameters at test time, not for training!
Gradient Clipping
X A1 A2 A3 A4 Y
|Y| ~ |A4||A3||A2||A1||X|
Initialize weights using N (0, σ) such that:
output activations ~ input activations.
Initialize biases to be positive: start in the linear regime of the ReLU.
Weight Initialization just before the Loss
Your loss is typically very dependent on the scale of the top activations:
High Temperature:
soft distribution, classifier not certain, small gradients.
Low Temperature:
peaky distribution, classifier very (over?)confident, big gradients.
Key: Start with a very high temperature, small weights in your last layer.
They will anneal to peakier distribution as the classifier gets more confident.
First, lower your learning rate
L1 L3
L2b
Unless you have shared memory, this means a lot more memory and data transfer.
Model Parallelism
On a single core: Instruction parallelism (SIMD, SIMT). Pretty much free.
1- Data reuse: compute is limited by how much data can fit at any time on
the lowest level cache (e.g. L1 cache on CPU). Try to maximally reuse the
data in cache, or get more cache (i.e. more machines!).
Limits:
● Per-example efficiency of gradient descent diminishes as the batch
size increases.
● Cutting a batch smaller yield diminishing returns as matrix
multiplies become less efficient.
● Cost of synchronization grows with K: need to wait for stragglers.
Asynchronous Data Parallelism - Pipelining
X1 A1 A2 A3 Y1
X2 A1 A2 A3 Y2
X3 A1 A2 A3 Y3
Time Steps
Asynchronous Data Parallelism - Pipelining
X1 A1 A2 A3 Y1
X2 A1 A2 A3 Y2
X3 A1 A2 A3 Y3
Wt+1 ← Wt - α ∂Wt-k
Wt+1 ← Wt - α ∂Wt-k
● Too big (Overfitting): too many degrees of freedom. The model will
attempt to explain every little detail of the training data and will fail to
generalize.
Regularization
Solution: train a model that is way too big, but nudge the parameters
towards a more parsimonious representation.
Regularization Techniques
L(y̅, y) + ε|w|2
x½
X A1 A2 A3 Y
Models for Text
Embeddings
[ 0 0 0 0 0 1 0 0 0 0 0 0 0]
Instead of training embeddings on the supervised task at hand, train them first to
represent semantic similarity using unsupervised training on a large text corpus.
Vjump
word2vec: “The quick brown fox jumps over the lazy dog”
Vjump
CBOW: “The quick brown fox jumps over the lazy dog”
Word Embeddings
Training a very simple model on lots of text mitigates the rare word problem.
The spaces learned have very good syntactic and semantic clustering.
Finding good embeddings for large bodies of text is a very active area of
research. (rel: topic modeling, paraphrasing, document understanding)
nn(x0, W) nn(x1, W)
nn(t0, W) nn(t1, W)
t0 t1 x0 x1
Good news: Backprop ‘just works’: simply add up all the gradients.
Models for Images
Convolutional Networks
Depth
applied over patches as a sliding
window, producing feature maps
of a certain depth. Patch
t ← t+1
Xt
X1 X2 X3
Recurrent Connections
(trainable weights)
Tied Weights
Recurrent Neural Networks
Can be implemented via explicit unrolling or dynamically by keeping
state across invocations, or a combination of both.
Thankfully, LSTMs...
LSTMs: Long Short-Term Memory Networks
But:
● Very effective at modeling long-term dependencies.
● Very sound theoretical and practical justifications.
● A central inspiration behind lots of recent work on using deep
learning to learn complex programs:
Memory Networks, Neural Turing Machines.
A Simple Model of Memory
Instruction Input
Output WRITE? READ?
WRITE X, M X M Y
READ M, Y
FORGET M
FORGET?
Key Idea: Make Your Program Differentiable
Sigmoids
W R
WRITE? READ?
X M Y X M Y
FORGET?
F
LSTM Cells as replacement for Recurrent Connections
R, W, and F are ‘control’ connections that affect the state of the memory
through a sigmoidal [0, 1] multiplicative gate.
Gating behavior makes it possible for the memory cell to retain information
longer and discard it quickly, while keeping the whole machine continuous
and differentiable.
This translates into much better stability in training and modeling of much
longer-range interactions compared to a RNN.
Unsupervised Learning
Generative Models and Unsupervised Learning
General themes:
Variational Auto-Encoders
Adversarial Learning
X X
X Z X
X Z X
Bottlenecks:
X Z X
L1
X Z X
Noise: Denoising Autoencoders
X+N Z X
X Z S X
X Z X
X Z
This is true for the inputs, but also for every layer up the stack:
mean 0
variance 1
Idea #2: Ok, let’s just subtract the mean, and divide by the variance.
Problem: leads to degenerate gradients!
Idea #3: Let’s use a noisy, local estimate of the mean and variance, e.
g. one computed per mini-batch.
Problem: still strictly less powerful representationally: all filters in the layer
are constrained to the same dynamic range.
Solutions
Idea #4: Add a learned affine transform per activation to rescale the inputs.
Doesn’t that defeat the purpose? No! Tightly bounds the rate of change of the
input distribution: a few linear weights instead of many, many nonlinear factors.
Problem: What happens at test time, when there is no such thing as a mini-batch
to normalize over?
Idea #5: Replace the mini-batch mean and variance by the global mean and
variance over the training set, at test time only.
Problem: That sounds really crazy…
Batch Normalization
Before:
After:
(x - μ) / σ αx + β
Results
A turning point: speech recognition went from “it mostly doesn’t work” to
“it mostly works” in the public’s perception.
In The Beginning
Fully-connected networks.
Irrelevant to non-tonal
languages, and surprisingly
weak cues for tonal languages.
Train recurrent models that also incorporate Lexical and Language Modeling.
Concept #2: Look at each feature map using a variety of filter sizes, not
just one, and concatenate them.
5x5
3x3
1x1
The Inception Architecture
Project
Pool
The Inception Architecture
Main Classifier
Auxiliary Classifiers
Only used in training
The Inception Architecture
ImageNet challenge:
X1, X2, X3 → Y
X1 X2 X3
X → Y1, Y2, Y3
Y1 Y2 Y3
X Y1 Y2
Y1 Y2 Y3 Y4
X1 X2 X3 Y1 Y2 Y3
Machine Translation:
Parsing:
Out-of-vocabulary words:
Addressing the Rare Word Problem in Neural Machine Translation
Thang Luong et al., ACL’15
Y1 Y2 Y3 Y4
X1 X2 X3 Y1 Y2 Y3
Differentiable Attention: Y1 Y2 Y3 Y4
During decoding, look back at the
input sequence and derive
‘attentional’ embeddings A1, A2, A3
X1 X2 X3 A1 Y1 A2 A3
Y2` Y3
Main idea: if X2 translates to Y2,
the model can make A2 look like X2.
a close-up of a child
MSCOCO Challenge: http://mscoco.org
Hot Topics In Deep Learning
Speech Recognition
Object Recognition
Machine Translation
Image Captioning
Memory and Computation
Hot Topics in Deep Learning: Memory and Computation
Also:
Memory Networks
Jason Weston et al. ICLR’15
Memory
The cost is that these models don’t scale well with the size of the space
to be explored...yet.
A3 kitten
Courtesy:
Alexander Mordvintsev
Factoring Style and Content!