cs231n Training Neural Networks II
cs231n Training Neural Networks II
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 1 April 25, 2017
Administrative
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 2 April 25, 2017
Administrative: Google Cloud
- STOP YOUR INSTANCES when not in use!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 3 April 25, 2017
Administrative: Google Cloud
- STOP YOUR INSTANCES when not in use!
- Keep track of your spending!
- GPU instances are much more expensive than CPU
instances - only use GPU instance when you need it
(e.g. for A2 only on TensorFlow / PyTorch notebooks)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 4 April 25, 2017
Last time: Activation Functions
Sigmoid Leaky ReLU
tanh Maxout
ReLU ELU
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 5 April 25, 2017
Last time: Activation Functions
Sigmoid Leaky ReLU
tanh Maxout
ReLU ELU
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 6 April 25, 2017
Last time: Weight Initialization
Initialization too small:
Activations go to zero, gradients also zero,
No learning
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 7 April 25, 2017
Last time: Data Preprocessing
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 8 April 25, 2017
Last time: Data Preprocessing
Before normalization: classification loss After normalization: less sensitive to small
very sensitive to changes in weight matrix; changes in weights; easier to optimize
hard to optimize
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 9 April 25, 2017
Last time: Batch Normalization
Input:
Learnable params:
Intermediates:
Output:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 10 April 25, 2017
Last time: Babysitting Learning
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 11 April 25, 2017
Last time: Hyperparameter Search
Coarse to fine search
Grid Layout Random Layout
Unimportant
Unimportant
Parameter
Parameter
Important Important
Parameter Parameter
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 12 April 25, 2017
Today
- Fancier optimization
- Regularization
- Transfer Learning
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 13 April 25, 2017
Optimization
W_2
W_1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 14 April 25, 2017
Optimization: Problems with SGD
What if loss changes quickly in one direction and slowly in another?
What does gradient descent do?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 15 April 25, 2017
Optimization: Problems with SGD
What if loss changes quickly in one direction and slowly in another?
What does gradient descent do?
Very slow progress along shallow dimension, jitter along steep direction
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 16 April 25, 2017
Optimization: Problems with SGD
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 17 April 25, 2017
Optimization: Problems with SGD
Zero gradient,
gradient descent
gets stuck
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 18 April 25, 2017
Optimization: Problems with SGD
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 19 April 25, 2017
Optimization: Problems with SGD
Our gradients come from
minibatches so they can be noisy!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 20 April 25, 2017
SGD + Momentum
SGD SGD+Momentum
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 21 April 25, 2017
SGD + Momentum
Gradient Noise
Local Minima Saddle points
Poor Conditioning
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 22 April 25, 2017
SGD + Momentum
Momentum update:
Velocity
actual step
Gradient
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 23 April 25, 2017
Nesterov Momentum
Momentum update: Nesterov Momentum
Gradient
Velocity Velocity
actual step actual step
Gradient
Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983
Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004
Sutskever et al, “On the importance of initialization and momentum in deel learning”, ICML 2013
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 24 April 25, 2017
Nesterov Momentum
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 25 April 25, 2017
Nesterov Momentum
Annoying, usually we want
update in terms of
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 26 April 25, 2017
Nesterov Momentum
Annoying, usually we want
update in terms of
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 27 April 25, 2017
Nesterov Momentum
SGD
SGD+Momentum
Nesterov
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 28 April 25, 2017
AdaGrad
Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 29 April 25, 2017
AdaGrad
AdaGrad
RMSProp
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 32 April 25, 2017
RMSProp
SGD
SGD+Momentum
RMSProp
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 33 April 25, 2017
Adam (almost)
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 34 April 25, 2017
Adam (almost)
Momentum
AdaGrad / RMSProp
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 35 April 25, 2017
Adam (full form)
Momentum
Bias correction
AdaGrad / RMSProp
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 36 April 25, 2017
Adam (full form)
Momentum
Bias correction
AdaGrad / RMSProp
Bias correction for the fact that Adam with beta1 = 0.9,
first and second moment beta2 = 0.999, and learning_rate = 1e-3 or 5e-4
estimates start at zero is a great starting point for many models!
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 37 April 25, 2017
Adam
SGD
SGD+Momentum
RMSProp
Adam
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 38 April 25, 2017
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
learning rate as a hyperparameter.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 39 April 25, 2017
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
learning rate as a hyperparameter.
exponential decay:
1/t decay:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 40 April 25, 2017
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
learning rate as a hyperparameter.
Loss
Learning rate decay!
Epoch
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 41 April 25, 2017
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
learning rate as a hyperparameter.
Loss
Learning rate decay!
Epoch
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 42 April 25, 2017
First-Order Optimization
Loss
w1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 43 April 25, 2017
First-Order Optimization
(1) Use gradient form linear approximation
(2) Step to minimize the approximation
Loss
w1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 44 April 25, 2017
Second-Order Optimization
(1) Use gradient and Hessian to form quadratic approximation
(2) Step to the minima of the approximation
Loss
w1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 45 April 25, 2017
Second-Order Optimization
second-order Taylor expansion:
Solving for the critical point we obtain the Newton parameter update:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 46 April 25, 2017
Second-Order Optimization
second-order Taylor expansion:
Solving for the critical point we obtain the Newton parameter update:
No hyperparameters!
No learning rate!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 47 April 25, 2017
Second-Order Optimization
second-order Taylor expansion:
Solving for the critical point we obtain the Newton parameter update:
Hessian has O(N^2) elements
Inverting takes O(N^3)
N = (Tens or Hundreds of) Millions
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 48 April 25, 2017
Second-Order Optimization
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 49 April 25, 2017
Second-Order Optimization
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 50 April 25, 2017
L-BFGS
- Usually works very well in full batch, deterministic mode
i.e. if you have a single, deterministic f(x) then L-BFGS will
probably work very nicely
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 51 April 25, 2017
In practice:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 52 April 25, 2017
Beyond Training Error
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 53 April 25, 2017
Model Ensembles
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 54 April 25, 2017
Model Ensembles: Tips and Tricks
Instead of training independent models, use multiple
snapshots of a single model during training!
Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016
Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017
Figures copyright Yixuan Li and Geoff Pleiss, 2017. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 55 April 25, 2017
Model Ensembles: Tips and Tricks
Instead of training independent models, use multiple
snapshots of a single model during training!
Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016 Cyclic learning rate schedules can
Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017
Figures copyright Yixuan Li and Geoff Pleiss, 2017. Reproduced with permission.
make this work even better!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 56 April 25, 2017
Model Ensembles: Tips and Tricks
Instead of using actual parameter vector, keep a
moving average of the parameter vector and use that
at test time (Polyak averaging)
Polyak and Juditsky, “Acceleration of stochastic approximation by averaging”, SIAM Journal on Control and Optimization, 1992.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 57 April 25, 2017
How to improve single-model performance?
Regularization
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 58 April 25, 2017
Regularization: Add term to loss
In common use:
(Weight decay)
L2 regularization
L1 regularization
Elastic net (L1 + L2)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 59 April 25, 2017
Regularization: Dropout
In each forward pass, randomly set some neurons to zero
Probability of dropping is a hyperparameter; 0.5 is common
Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 60 April 25, 2017
Regularization: Dropout Example forward
pass with a
3-layer network
using dropout
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 61 April 25, 2017
Regularization: Dropout
How can this possibly be a good idea?
Forces the network to have a redundant representation;
Prevents co-adaptation of features
has an ear X
has a tail
is furry X cat
score
has claws
mischievous X
look
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 62 April 25, 2017
Regularization: Dropout
How can this possibly be a good idea?
Another interpretation:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 64 April 25, 2017
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
a
w1 w2
x y
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 65 April 25, 2017
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
a
At test time we have:
w1 w2
x y
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 66 April 25, 2017
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
a
At test time we have:
w1 w2 During training we have:
x y
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 67 April 25, 2017
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
a
At test time we have:
w1 w2 During training we have:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 68 April 25, 2017
Dropout: Test time
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 69 April 25, 2017
Dropout Summary
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 70 April 25, 2017
More common: “Inverted dropout”
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 71 April 25, 2017
Regularization: A common pattern
Training: Add some kind
of randomness
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 72 April 25, 2017
Regularization: A common pattern
Training: Add some kind Example: Batch
of randomness Normalization
Training:
Normalize using
Testing: Average out randomness stats from random
(sometimes approximate) minibatches
“cat”
Load image
and label
Compute
loss
CNN
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 74 April 25, 2017
Regularization: Data Augmentation
“cat”
Load image
and label
Compute
loss
CNN
Transform image
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 75 April 25, 2017
Data Augmentation
Horizontal Flips
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 76 April 25, 2017
Data Augmentation
Random crops and scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 77 April 25, 2017
Data Augmentation
Random crops and scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 79 April 25, 2017
Data Augmentation More Complex:
Color Jitter 1. Apply PCA to all [R, G, B]
Simple: Randomize pixels in training set
contrast and brightness
2. Sample a “color offset”
along principal component
directions
3. Add offset to all pixels of a
training image
(As seen in [Krizhevsky et al. 2012], ResNet, etc)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 80 April 25, 2017
Data Augmentation
Get creative for your problem!
Random mix/combinations of :
- translation
- rotation
- stretching
- shearing,
- lens distortions, … (go crazy)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 81 April 25, 2017
Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization
Data Augmentation
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 82 April 25, 2017
Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 83 April 25, 2017
Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 84 April 25, 2017
Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
Stochastic Depth
Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 85 April 25, 2017
Transfer Learning
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 86 April 25, 2017
Transfer Learning
ED
“You need a lot of a data if you want to
ST
train/use CNNs”
BU
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 87 April 25, 2017
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014
1. Train on Imagenet
FC-1000
FC-4096
FC-4096
MaxPool
Conv-512
Conv-512
MaxPool
Conv-512
Conv-512
MaxPool
Conv-256
Conv-256
MaxPool
Conv-128
Conv-128
MaxPool
Conv-64
Conv-64
Image
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 88 April 25, 2017
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014
MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512
MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128
MaxPool MaxPool
Conv-64 Conv-64
Conv-64 Conv-64
Image Image
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 89 April 25, 2017
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 90 April 25, 2017
FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512
MaxPool
very little data ? ?
Conv-512
More specific
Conv-512
MaxPool
Conv-256
Conv-256
More generic
MaxPool
Conv-128
Conv-128
quite a lot of ? ?
MaxPool
Conv-64
data
Conv-64
Image
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 91 April 25, 2017
FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512
MaxPool
very little data Use Linear ?
Conv-512
More specific Classifier on
Conv-512
top layer
MaxPool
Conv-256
Conv-256
More generic
MaxPool
Conv-128
Conv-128
quite a lot of Finetune a ?
MaxPool
Conv-64
data few layers
Conv-64
Image
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 92 April 25, 2017
FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512
MaxPool
very little data Use Linear You’re in
Conv-512
More specific Classifier on trouble… Try
Conv-512
top layer linear classifier
MaxPool
Conv-256 from different
Conv-256
More generic stages
MaxPool
Conv-128
Conv-128
quite a lot of Finetune a Finetune a
MaxPool
Conv-64
data few layers larger number
Conv-64 of layers
Image
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 93 April 25, 2017
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Object Detection
(Fast R-CNN) Image Captioning: CNN + RNN
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 94 April 25, 2017
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Object Detection
(Fast R-CNN)
CNN pretrained Image Captioning: CNN + RNN
on ImageNet
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 95 April 25, 2017
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Object Detection
(Fast R-CNN)
CNN pretrained Image Captioning: CNN + RNN
on ImageNet
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 96 April 25, 2017
Takeaway for your projects and beyond:
Have some dataset of interest but it has < ~1M images?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 97 April 25, 2017
Summary
- Optimization
- Momentum, RMSProp, Adam, etc
- Regularization
- Dropout, etc
- Transfer learning
- Use this for your projects!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 98 April 25, 2017
Next time: Deep Learning Software!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 99 April 25, 2017