DocumentsTraining Neural Networks - Part II
DocumentsTraining Neural Networks - Part II
- Regularization
- Transfer Learning
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017
Optimization
W_2
W_1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Optimization: Problems with SGD
What if loss changes quickly in one direction and slowly in another?
What does gradient descent do?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Optimization: Problems with SGD
What if loss changes quickly in one direction and slowly in another?
What does gradient descent do?
Very slow progress along shallow dimension, jitter along steep
direction
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Optimization: Problems with SGD
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Optimization: Problems with SGD
Zero gradient,
gradient descent gets
stuck
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Optimization: Problems with SGD
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Optimization: Problems with SGD
Our gradients come from
minibatches so they can be noisy!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
SGD + Momentum
SGD SGD+Momentum
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
SGD + Momentum Gradient Noise
Poor Conditioning
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
SGD + Momentum
Momentum update:
Velocity
actual step
Gradient
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Nesterov Momentum
Momentum update: Nesterov Momentum
Gradient
Velocity Velocity
actual step actual step
Gradient
Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983
Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004
Sutskever et al, “On the importance of initialization and momentum in deel learning”, ICML 2013
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Nesterov Momentum
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Nesterov Momentum
Annoying, usually we want
update in terms of
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Nesterov Momentum
Annoying, usually we want
update in terms of
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Nesterov Momentum
SGD
SGD+Momentum
Nesterov
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
AdaGrad
Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
AdaGrad
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
AdaGrad
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
RMSProp
AdaGrad
RMSProp
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
RMSProp
SGD
SGD+Momentum
RMSProp
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Adam (almost)
Moment ->
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
Second Moment,
Fei-Fei Li & Justin Johnsonuncentered
& Serena Variance
Yeung -> Lecture 7 - April 25, 2017
Adam (almost)
Momentum
AdaGrad / RMSProp
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Adam (full form)
Momentum
Bias correction
AdaGrad / RMSProp
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Adam (full form)
Momentum
Bias correction
AdaGrad / RMSProp
Bias correction for the fact that Adam with beta1 = 0.9,
first and second moment beta2 = 0.999, and learning_rate = 1e-3 or 5e-4
estimates start at zero is a great starting point for many models!
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Adam
SGD
SGD+Momentum
RMSProp
Adam
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
RAdam
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
learning rate as a hyperparameter.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
learning rate as a hyperparameter.
exponential decay:
1/t decay:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
learning rate as a hyperparameter.
Loss
Learning rate decay!
Epoch
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
learning rate as a hyperparameter.
Loss
Learning rate decay!
Epoch
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
First-Order Optimization
Loss
w1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
First-Order Optimization
(1) Use gradient form linear approximation
(2) Step to minimize the approximation
Loss
w1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Second-Order Optimization
(1) Use gradient and Hessian to form quadratic approximation
(2) Step to the minima of the approximation
Loss
w1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Second-Order Optimization
second-order Taylor expansion:
Solving for the critical point we obtain the Newton parameter update:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Second-Order Optimization
second-order Taylor expansion:
Solving for the critical point we obtain the Newton parameter update:
No hyperparameters!
No learning rate!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Second-Order Optimization
Solving for the critical point we obtain the Newton parameter update:
Hessian has O(N^2) elements
Inverting takes O(N^3)
N = (Tens or Hundreds of)
Millions
Q2: Why is this bad for deep learning?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Second-Order Optimization
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
L-BFGS
- Usually works very well in full batch, deterministic mode
i.e. if you have a single, deterministic f(x) then L-BFGS will probably work
very nicely
- Does not transfer very well to mini-batch setting. Gives bad results.
Adapting L-BFGS to large-scale, stochastic setting is an active area of
research.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
In practice:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Agenda: Training Neural Networks – Part II
- Optimization
- Regularization
- Transfer Learning
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017
Beyond Training Error
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Model Ensembles
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Model Ensembles: Tips and Tricks
Instead of training independent models, use multiple snapshots of a
single model during training!
Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016
Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017
Figures copyright Yixuan Li and Geoff Pleiss, 2017. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Model Ensembles: Tips and Tricks
Instead of training independent models, use multiple snapshots of a
single model during training!
Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016 Cyclic learning rate schedules can
Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017
Figures copyright Yixuan Li and Geoff Pleiss, 2017. Reproduced with permission.
make this work even better!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 56 April 25, 2017
Model Ensembles: Tips and Tricks
Instead of using actual parameter vector, keep a moving average of
the parameter vector and use that at test time (Polyak averaging)
Polyak and Juditsky, “Acceleration of stochastic approximation by averaging”, SIAM Journal on Control and Optimization, 1992.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
How to improve single-model performance?
Regularization
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: Add term to loss
In common use:
L2 regularization (Weight decay)
L1 regularization
Elastic net (L1 +
L2)
Regularization: Dropout
In each forward pass, randomly set some neurons to zero
Probability of dropping is a hyperparameter; 0.5 is common
Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: Dropout Example forward
pass with a
3-layer network
using dropout
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: Dropout
How can this possibly be a good idea?
has an ear X
has a tail
is furry X cat
score
has claws
mischiev
ous
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
X
Regularization: Dropout
How can this possibly be a good idea?
Another interpretation:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Dropout: Test time
Want to approximate the
integral
Cons
ider
a
w1
singl
w2
e
neur
x y
on.
a
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Dropout: Test time
Want to approximate the
integral
Cons
a ider
At
a test time we have:
w1
singl
w2
e
neur
x y
on.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Dropout: Test time
Want to approximate the
integral
Cons
a ider
At
a test time we have:
w1
singl training we have:
During
w2
e
neur
x y
on.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Dropout: Test time
Want to approximate the
integral
Cons
a ider
At
a test time we have:
w1
singl training we have:
During
w2
e
neur
x y
on. At test time, multiply
by dropout probability
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Dropout: Test time
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Dropout Summary
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
More common: “Inverted dropout”
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: A common pattern
Training: Add some kind of
randomness
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: A common pattern
Training: Add some kind of Example: Batch
randomness Normalization
Training: Normalize
using stats from
Testing: Average out randomness random minibatches
(sometimes approximate)
Testing: Use fixed stats
to normalize
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: Data Augmentation
“cat”
Load image
and label
Compute
loss
CNN
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: Data Augmentation
“cat”
Load image
and label
Compute
loss
CNN
Transform image
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Data Augmentation
Horizontal Flips
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Data Augmentation
Random crops and scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Data Augmentation
Random crops and scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Data Augmentation More Complex:
Color Jitter
Simple: Randomize 1. Apply PCA to all [R, G, B]
contrast and brightness pixels in training set
2. Sample a “color offset”
along principal component
directions
3. Add offset to all pixels of a
training image
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
(As seen in [Krizhevsky et al. 2012], ResNet, etc)
Data Augmentation
Get creative for your problem!
Random mix/combinations of :
- translation
- rotation
- stretching
- shearing,
- lens distortions, … (go
crazy)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: A common pattern
Examples:
Dropout
Batch Normalization Data
Augmentation
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization Data
Augmentation
DropConnect
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization Data
Augmentation DropConnect
Fractional Max Pooling
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 84 April 25, 2017
Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization Data
Augmentation DropConnect
Fractional Max Pooling
Stochastic Depth
Huang et al, “Deep Networks with Stochastic Depth”, ECCV
2016
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization Data
Augmentation DropConnect
Fractional Max Pooling
Stochastic Depth
Huang et al, “Deep Networks with Stochastic Depth”, ECCV
2016
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Agenda: Training Neural Networks – Part II
- Optimization
- Regularization
- Transfer Learning
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017
Transfer Learning
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Transfer Learning
TE
“You need a lot of a data if you want to
train/use Deep Neural
D US
Networks”
B
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014
1. Train on Imagenet
FC-1000
FC-4096
FC-4096
MaxPool
Conv-512
Conv-512
MaxPool
Conv-512
Conv-512
MaxPool
Conv-256
Conv-256
MaxPool
Conv-128
Conv-128
MaxPool
Conv-64
Conv-64
Image
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014
MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512
MaxPool MaxPool
Freeze these
Conv-256 Conv-256
Conv-256 Conv-256
MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128
MaxPool MaxPool
Conv-64 Conv-64
Conv-64 Conv-64
Image Image
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 89 April 25, 2017
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014
MaxPool MaxPool
this and train
MaxPool
Conv-512 Conv-512 Conv-512
With bigger
Conv-512 Conv-512 Conv-512
dataset, train
MaxPool MaxPool MaxPool
Conv-512 Conv-512 Conv-512
more layers
Conv-512 Conv-512 Conv-512
MaxPool MaxPool
Freeze these MaxPool
Conv-256 Conv-256
Conv-256
Freeze these
Conv-256 Conv-256 Conv-256
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512
very little data ? ?
MaxPool
Conv-512
More specific
Conv-512
MaxPool
Conv-256
Conv-256
MaxPool
More generic
Conv-128
Conv-128 quite a lot of ? ?
MaxPool
Conv-64
data
Conv-64
Image
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512
very little data Use Linear ?
MaxPool
Conv-512
More specific Classifier on
Conv-512 top layer
MaxPool
Conv-256
Conv-256
MaxPool
More generic
Conv-128
Conv-128 quite a lot of Finetune a ?
MaxPool
Conv-64
data few
Conv-64 layers
Image
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512
very little data Use Linear You’re in
MaxPool
Conv-512
More specific Classifier on trouble… Try
Conv-512 top layer linear classifier
MaxPool
Conv-256
from different
Conv-256 stages
MaxPool
More generic
Conv-128
Conv-128 quite a lot of Finetune a Finetune a
MaxPool
Conv-64
data few larger number
Conv-64 layers of layers
Image
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Object Detection
(Fast R-CNN) Image Captioning: CNN + RNN
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Object Detection
(Fast R-CNN)
CNN pretrained Image Captioning: CNN + RNN
on ImageNet
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Object Detection
(Fast R-CNN)
CNN pretrained Image Captioning: CNN + RNN
on ImageNet
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Takeaway for your projects and beyond:
Have some dataset of interest but it has < ~1M images?
http://cs231n.stanford.edu/2017/syllabus.html