0% found this document useful (0 votes)
10 views91 pages

DocumentsTraining Neural Networks - Part II

Uploaded by

张立波
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views91 pages

DocumentsTraining Neural Networks - Part II

Uploaded by

张立波
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 91

Training Neural Networks – Part II

Lorenzo Servadei, Daniela Lopera, Sebastian Schober, Wolfgang Ecker


Agenda: Training Neural Networks – Part II
- Optimization

- Regularization

- Transfer Learning

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017
Optimization
W_2

W_1

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Optimization: Problems with SGD
What if loss changes quickly in one direction and slowly in another?
What does gradient descent do?

Loss function has high condition number: ratio of largest to smallest


singular value of the Hessian matrix is large

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Optimization: Problems with SGD
What if loss changes quickly in one direction and slowly in another?
What does gradient descent do?
Very slow progress along shallow dimension, jitter along steep
direction

Loss function has high condition number: ratio of largest to smallest


singular value of the Hessian matrix is large

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Optimization: Problems with SGD

What if the loss


function has a local
minima or saddle
point?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Optimization: Problems with SGD

What if the loss


function has a local
minima or saddle
point?

Zero gradient,
gradient descent gets
stuck

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Optimization: Problems with SGD

What if the loss


function has a local
minima or saddle
point?

Saddle points much


more common in
high dimension
Dauphin et al, “Identifying and attacking the saddle point problem in
high-dimensional non-convex optimization”, NIPS 2014

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Optimization: Problems with SGD
Our gradients come from
minibatches so they can be noisy!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
SGD + Momentum
SGD SGD+Momentum

- Build up “velocity” as a running mean of gradients


- Rho gives “friction”; typically rho=0.9 or 0.99

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
SGD + Momentum Gradient Noise

Local Minima Saddle points

Poor Conditioning

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
SGD + Momentum
Momentum update:

Velocity

actual step

Gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Nesterov Momentum
Momentum update: Nesterov Momentum

Gradient
Velocity Velocity
actual step actual step

Gradient

Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983
Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004
Sutskever et al, “On the importance of initialization and momentum in deel learning”, ICML 2013

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Nesterov Momentum

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Nesterov Momentum
Annoying, usually we want
update in terms of

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Nesterov Momentum
Annoying, usually we want
update in terms of

Change of variables and


rearrange:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Nesterov Momentum
SGD

SGD+Momentum

Nesterov

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
AdaGrad

Added element-wise scaling of the gradient based on the


historical sum of squares in each dimension

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
AdaGrad

Q: What happens with AdaGrad?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
AdaGrad

Q2: What happens to the step size over long time?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
RMSProp

AdaGrad

RMSProp

Tieleman and Hinton, 2012

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
RMSProp
SGD

SGD+Momentum

RMSProp

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Adam (almost)

Moment ->

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Second Moment,
Fei-Fei Li & Justin Johnsonuncentered
& Serena Variance
Yeung -> Lecture 7 - April 25, 2017
Adam (almost)

Momentum
AdaGrad / RMSProp

Sort of like RMSProp with momentum

Q: What happens at first timestep?

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Adam (full form)

Momentum

Bias correction
AdaGrad / RMSProp

Bias correction for the fact that


first and second moment
estimates start at zero
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Adam (full form)

Momentum

Bias correction
AdaGrad / RMSProp

Bias correction for the fact that Adam with beta1 = 0.9,
first and second moment beta2 = 0.999, and learning_rate = 1e-3 or 5e-4
estimates start at zero is a great starting point for many models!
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Adam

SGD

SGD+Momentum

RMSProp

Adam

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
RAdam

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
learning rate as a hyperparameter.

Q: Which one of these learning rates


is best to use?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
learning rate as a hyperparameter.

=> Learning rate decay over time!


step decay:
e.g. decay learning rate by half every few epochs.

exponential decay:

1/t decay:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
learning rate as a hyperparameter.
Loss
Learning rate decay!

Epoch

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
learning rate as a hyperparameter.
Loss
Learning rate decay!

More critical with SGD+Momentum,


less common with Adam

Epoch

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
First-Order Optimization

Loss

w1

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
First-Order Optimization
(1) Use gradient form linear approximation
(2) Step to minimize the approximation
Loss

w1

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Second-Order Optimization
(1) Use gradient and Hessian to form quadratic approximation
(2) Step to the minima of the approximation
Loss

w1

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Second-Order Optimization
second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

Q: What is nice about this update?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Second-Order Optimization
second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

No hyperparameters!
No learning rate!

Q: What is nice about this update?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Second-Order Optimization

Hessian: Square matrix of


second-order partial
derivatives
Second-Order Optimization
second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:
Hessian has O(N^2) elements
Inverting takes O(N^3)
N = (Tens or Hundreds of)
Millions
Q2: Why is this bad for deep learning?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Second-Order Optimization

- Quasi-Newton methods (BGFS most popular):


instead of inverting the Hessian (O(n^3)), approximate inverse Hessian
with rank 1 updates over time (O(n^2) each).

- L-BFGS (Limited memory BFGS):


Does not form/store the full inverse Hessian.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
L-BFGS
- Usually works very well in full batch, deterministic mode
i.e. if you have a single, deterministic f(x) then L-BFGS will probably work
very nicely

- Does not transfer very well to mini-batch setting. Gives bad results.
Adapting L-BFGS to large-scale, stochastic setting is an active area of
research.

Le et al, “On optimization methods for deep learning, ICML 2011”

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
In practice:

- Adam is a good default choice in most cases

- If you can afford to do full batch updates then try out


L-BFGS (and don’t forget to disable all sources of noise)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Agenda: Training Neural Networks – Part II
- Optimization

- Regularization

- Transfer Learning

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017
Beyond Training Error

Better optimization algorithms But we really care about error on new


help reduce training loss data - how to reduce the gap?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Model Ensembles

1. Train multiple independent models


2. At test time average their results

Enjoy 2% extra performance

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Model Ensembles: Tips and Tricks
Instead of training independent models, use multiple snapshots of a
single model during training!

Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016
Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017
Figures copyright Yixuan Li and Geoff Pleiss, 2017. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Model Ensembles: Tips and Tricks
Instead of training independent models, use multiple snapshots of a
single model during training!

Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016 Cyclic learning rate schedules can
Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017
Figures copyright Yixuan Li and Geoff Pleiss, 2017. Reproduced with permission.
make this work even better!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 56 April 25, 2017
Model Ensembles: Tips and Tricks
Instead of using actual parameter vector, keep a moving average of
the parameter vector and use that at test time (Polyak averaging)

Polyak and Juditsky, “Acceleration of stochastic approximation by averaging”, SIAM Journal on Control and Optimization, 1992.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
How to improve single-model performance?

Regularization
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: Add term to loss

In common use:
L2 regularization (Weight decay)

L1 regularization
Elastic net (L1 +
L2)
Regularization: Dropout
In each forward pass, randomly set some neurons to zero
Probability of dropping is a hyperparameter; 0.5 is common

Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: Dropout Example forward
pass with a
3-layer network
using dropout

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: Dropout
How can this possibly be a good idea?

Forces the network to have a redundant representation;


Prevents co-adaptation of features

has an ear X
has a tail
is furry X cat
score
has claws
mischiev
ous

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017

X
Regularization: Dropout
How can this possibly be a good idea?

Another interpretation:

Dropout is training a large ensemble of


models (that share parameters).

Each binary mask is one model

An FC layer with 4096 units has


24096 ~ 101233 possible masks!
Only ~ 1082 atoms in the
universe... Lecture 7 -
Fei-Fei Li & Justin Johnson & Serena Yeung April 25, 2017
Dropout: Test time Output Input
(image)
(label) Random
Dropout makes our output random!
mask

Want to “average out” the randomness at test-time

But this integral seems hard …

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Dropout: Test time
Want to approximate the
integral

Cons
ider
a
w1
singl
w2
e
neur
x y
on.
a
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Dropout: Test time
Want to approximate the
integral

Cons
a ider
At
a test time we have:
w1
singl
w2
e
neur
x y
on.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Dropout: Test time
Want to approximate the
integral

Cons
a ider
At
a test time we have:
w1
singl training we have:
During
w2
e
neur
x y
on.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Dropout: Test time
Want to approximate the
integral

Cons
a ider
At
a test time we have:
w1
singl training we have:
During
w2
e
neur
x y
on. At test time, multiply
by dropout probability

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Dropout: Test time

At test time all neurons are active always


=> We must scale the activations so that for each neuron: output at test time
= expected output at training time

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Dropout Summary

drop in forward pass

scale at test time

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
More common: “Inverted dropout”

test time is unchanged!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: A common pattern
Training: Add some kind of
randomness

Testing: Average out randomness


(sometimes approximate)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: A common pattern
Training: Add some kind of Example: Batch
randomness Normalization

Training: Normalize
using stats from
Testing: Average out randomness random minibatches
(sometimes approximate)
Testing: Use fixed stats
to normalize

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: Data Augmentation

“cat”
Load image
and label
Compute
loss
CNN

This image by Nikita is


licensed under CC-BY 2.0

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: Data Augmentation

“cat”
Load image
and label
Compute
loss
CNN

Transform image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Data Augmentation
Horizontal Flips

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Data Augmentation
Random crops and scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Data Augmentation
Random crops and scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch

Testing: average a fixed set of crops


ResNet:
1. Resize image at 5 scales: {224, 256, 384, 480, 640}
2. For each size, use 10 224 x 224 crops: 4 corners + center, +
flips
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Data Augmentation
Color Jitter
Simple: Randomize
contrast and brightness

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Data Augmentation More Complex:

Color Jitter
Simple: Randomize 1. Apply PCA to all [R, G, B]
contrast and brightness pixels in training set
2. Sample a “color offset”
along principal component
directions
3. Add offset to all pixels of a
training image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
(As seen in [Krizhevsky et al. 2012], ResNet, etc)
Data Augmentation
Get creative for your problem!
Random mix/combinations of :
- translation
- rotation
- stretching
- shearing,
- lens distortions, … (go
crazy)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: A common pattern

Training: Add random noise


Testing: Marginalize over the noise

Examples:
Dropout
Batch Normalization Data
Augmentation

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise

Examples:
Dropout
Batch Normalization Data
Augmentation
DropConnect

Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise

Examples:
Dropout
Batch Normalization Data
Augmentation DropConnect
Fractional Max Pooling

Graham, “Fractional Max Pooling”, arXiv 2014

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 84 April 25, 2017
Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise

Examples:
Dropout
Batch Normalization Data
Augmentation DropConnect
Fractional Max Pooling
Stochastic Depth
Huang et al, “Deep Networks with Stochastic Depth”, ECCV
2016
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise

Examples:
Dropout
Batch Normalization Data
Augmentation DropConnect
Fractional Max Pooling
Stochastic Depth
Huang et al, “Deep Networks with Stochastic Depth”, ECCV
2016
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Agenda: Training Neural Networks – Part II
- Optimization

- Regularization

- Transfer Learning

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017
Transfer Learning

“You need a lot of a data if you want to


train/use Deep Neural
Networks”

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Transfer Learning

TE
“You need a lot of a data if you want to
train/use Deep Neural
D US
Networks”
B

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014

Transfer Learning with CNNs Razavian et al, “CNN Features Off-the-Shelf: An


Astounding Baseline for Recognition”, CVPR Workshops
2014

1. Train on Imagenet
FC-1000
FC-4096
FC-4096

MaxPool
Conv-512
Conv-512

MaxPool
Conv-512
Conv-512

MaxPool
Conv-256
Conv-256

MaxPool
Conv-128
Conv-128

MaxPool
Conv-64
Conv-64

Image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014

Transfer Learning with CNNs Razavian et al, “CNN Features Off-the-Shelf: An


Astounding Baseline for Recognition”, CVPR Workshops
2014

1. Train on Imagenet 2. Small Dataset (C classes)


FC-1000 FC-C
FC-4096 FC-4096
FC-4096 FC-4096
Reinitialize
MaxPool MaxPool
this and train
Conv-512 Conv-512
Conv-512 Conv-512

MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512

MaxPool MaxPool
Freeze these
Conv-256 Conv-256
Conv-256 Conv-256

MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128

MaxPool MaxPool
Conv-64 Conv-64
Conv-64 Conv-64

Image Image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 89 April 25, 2017
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014

Transfer Learning with CNNs Razavian et al, “CNN Features Off-the-Shelf: An


Astounding Baseline for Recognition”, CVPR Workshops
2014

1. Train on Imagenet 2. Small Dataset (C classes) 3. Bigger dataset


FC-1000 FC-C FC-C
FC-4096 FC-4096
Reinitialize
FC-4096
Train these
FC-4096 FC-4096 FC-4096

MaxPool MaxPool
this and train
MaxPool
Conv-512 Conv-512 Conv-512
With bigger
Conv-512 Conv-512 Conv-512
dataset, train
MaxPool MaxPool MaxPool
Conv-512 Conv-512 Conv-512
more layers
Conv-512 Conv-512 Conv-512

MaxPool MaxPool
Freeze these MaxPool
Conv-256 Conv-256
Conv-256
Freeze these
Conv-256 Conv-256 Conv-256

MaxPool MaxPool MaxPool


Conv-128 Conv-128 Conv-128 Lower learning rate
Conv-128 Conv-128 Conv-128
when finetuning;
MaxPool MaxPool MaxPool
1/10 of original LR
Conv-64 Conv-64 Conv-64
Conv-64 Conv-64 Conv-64
is good starting
point
Image Image Image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512
very little data ? ?
MaxPool
Conv-512
More specific
Conv-512

MaxPool
Conv-256
Conv-256

MaxPool
More generic
Conv-128
Conv-128 quite a lot of ? ?
MaxPool
Conv-64
data
Conv-64

Image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512
very little data Use Linear ?
MaxPool
Conv-512
More specific Classifier on
Conv-512 top layer
MaxPool
Conv-256
Conv-256

MaxPool
More generic
Conv-128
Conv-128 quite a lot of Finetune a ?
MaxPool
Conv-64
data few
Conv-64 layers
Image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512
very little data Use Linear You’re in
MaxPool
Conv-512
More specific Classifier on trouble… Try
Conv-512 top layer linear classifier
MaxPool
Conv-256
from different
Conv-256 stages
MaxPool
More generic
Conv-128
Conv-128 quite a lot of Finetune a Finetune a
MaxPool
Conv-64
data few larger number
Conv-64 layers of layers
Image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Object Detection
(Fast R-CNN) Image Captioning: CNN + RNN

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for


Girshick, “Fast R-CNN”, ICCV 2015 Generating Image Descriptions”, CVPR 2015
Figure copyright Ross Girshick, 2015. Reproduced with permission. Figure copyright IEEE, 2015. Reproduced for educational
purposes.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)

Object Detection
(Fast R-CNN)
CNN pretrained Image Captioning: CNN + RNN
on ImageNet

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for


Girshick, “Fast R-CNN”, ICCV 2015 Generating Image Descriptions”, CVPR 2015
Figure copyright Ross Girshick, 2015. Reproduced with permission. Figure copyright IEEE, 2015. Reproduced for educational
purposes.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)

Object Detection
(Fast R-CNN)
CNN pretrained Image Captioning: CNN + RNN
on ImageNet

Word vectors pretrained


Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for
Girshick, “Fast R-CNN”, ICCV 2015
Figure copyright Ross Girshick, 2015. Reproduced with permission.
with word2vec Generating Image Descriptions”, CVPR 2015
Figure copyright IEEE, 2015. Reproduced for educational
purposes.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Takeaway for your projects and beyond:
Have some dataset of interest but it has < ~1M images?

1. Find a very large dataset that has similar data,


train a big Network there
2. Transfer learn to your dataset

Deep learning frameworks provide a “Model Zoo” of


pretrained models so you don’t need to train your own
Caffe:
https://github.com/BVLC/caffe/wiki/Model-Zoo
TensorFlow: https://github.com/tensorflow/models
PyTorch: https://github.com/pytorch/vision
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2017
Sources

http://cs231n.stanford.edu/2017/syllabus.html

You might also like