0% found this document useful (0 votes)

74 views99 pages

cs231n Training Neural Networks II

Uploaded by

dijita2089

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views99 pages

cs231n Training Neural Networks II

Uploaded by

dijita2089

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 99

Lecture 7:

Training Neural Networks,

Part 2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 1 April 25, 2017
Administrative

- Assignment 1 is being graded, stay tuned

- Project proposals due today by 11:59pm
- Assignment 2 is out, due Thursday May 4 at 11:59pm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 2 April 25, 2017
Administrative: Google Cloud
- STOP YOUR INSTANCES when not in use!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 3 April 25, 2017
Administrative: Google Cloud
- STOP YOUR INSTANCES when not in use!
- Keep track of your spending!
- GPU instances are much more expensive than CPU
instances - only use GPU instance when you need it
(e.g. for A2 only on TensorFlow / PyTorch notebooks)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 4 April 25, 2017
Last time: Activation Functions
Sigmoid Leaky ReLU

tanh Maxout

ReLU ELU

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 5 April 25, 2017
Last time: Activation Functions
Sigmoid Leaky ReLU

tanh Maxout

ReLU ELU

Good default choice

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 6 April 25, 2017
Last time: Weight Initialization
Initialization too small:
Activations go to zero, gradients also zero,
No learning

Initialization too big:

Activations saturate (for tanh),
Gradients zero, no learning

Initialization just right:

Nice distribution of activations at all layers,
Learning proceeds nicely

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 7 April 25, 2017
Last time: Data Preprocessing

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 8 April 25, 2017
Last time: Data Preprocessing
Before normalization: classification loss After normalization: less sensitive to small
very sensitive to changes in weight matrix; changes in weights; easier to optimize
hard to optimize

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 9 April 25, 2017
Last time: Batch Normalization
Input:

Learnable params:

Intermediates:

Output:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 10 April 25, 2017
Last time: Babysitting Learning

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 11 April 25, 2017
Last time: Hyperparameter Search
Coarse to fine search
Grid Layout Random Layout

Unimportant

Unimportant
Parameter

Parameter
Important Important
Parameter Parameter

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 12 April 25, 2017
Today

- Fancier optimization
- Regularization
- Transfer Learning

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 13 April 25, 2017
Optimization
W_2

W_1

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 14 April 25, 2017
Optimization: Problems with SGD
What if loss changes quickly in one direction and slowly in another?
What does gradient descent do?

Loss function has high condition number: ratio of largest to smallest

singular value of the Hessian matrix is large

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 15 April 25, 2017
Optimization: Problems with SGD
What if loss changes quickly in one direction and slowly in another?
What does gradient descent do?
Very slow progress along shallow dimension, jitter along steep direction

Loss function has high condition number: ratio of largest to smallest

singular value of the Hessian matrix is large

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 16 April 25, 2017
Optimization: Problems with SGD

What if the loss

function has a
local minima or
saddle point?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 17 April 25, 2017
Optimization: Problems with SGD

What if the loss

function has a
local minima or
saddle point?

Zero gradient,
gradient descent
gets stuck

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 18 April 25, 2017
Optimization: Problems with SGD

What if the loss

function has a
local minima or
saddle point?

Saddle points much

more common in
high dimension
Dauphin et al, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization”, NIPS 2014

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 19 April 25, 2017
Optimization: Problems with SGD
Our gradients come from
minibatches so they can be noisy!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 20 April 25, 2017
SGD + Momentum
SGD SGD+Momentum

- Build up “velocity” as a running mean of gradients

- Rho gives “friction”; typically rho=0.9 or 0.99

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 21 April 25, 2017
SGD + Momentum
Gradient Noise
Local Minima Saddle points

Poor Conditioning

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 22 April 25, 2017
SGD + Momentum
Momentum update:

Velocity

actual step

Gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 23 April 25, 2017
Nesterov Momentum
Momentum update: Nesterov Momentum

Gradient
Velocity Velocity
actual step actual step

Gradient

Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983
Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004
Sutskever et al, “On the importance of initialization and momentum in deel learning”, ICML 2013

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 24 April 25, 2017
Nesterov Momentum

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 25 April 25, 2017
Nesterov Momentum
Annoying, usually we want
update in terms of

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 26 April 25, 2017
Nesterov Momentum
Annoying, usually we want
update in terms of

Change of variables and

rearrange:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 27 April 25, 2017
Nesterov Momentum
SGD

SGD+Momentum

Nesterov

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 28 April 25, 2017
AdaGrad

Added element-wise scaling of the gradient based on the

historical sum of squares in each dimension

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 29 April 25, 2017
AdaGrad

Q: What happens with AdaGrad?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 30 April 25, 2017
AdaGrad

Q2: What happens to the step size over long time?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 31 April 25, 2017
RMSProp

AdaGrad

RMSProp

Tieleman and Hinton, 2012

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 32 April 25, 2017
RMSProp
SGD

SGD+Momentum

RMSProp

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 33 April 25, 2017
Adam (almost)

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 34 April 25, 2017
Adam (almost)

Momentum
AdaGrad / RMSProp

Sort of like RMSProp with momentum

Q: What happens at first timestep?

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 35 April 25, 2017
Adam (full form)

Momentum

Bias correction
AdaGrad / RMSProp

Bias correction for the fact that

first and second moment
estimates start at zero
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 36 April 25, 2017
Adam (full form)

Momentum

Bias correction
AdaGrad / RMSProp

Bias correction for the fact that Adam with beta1 = 0.9,
first and second moment beta2 = 0.999, and learning_rate = 1e-3 or 5e-4
estimates start at zero is a great starting point for many models!
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 37 April 25, 2017
Adam

SGD

SGD+Momentum

RMSProp

Adam

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 38 April 25, 2017
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
learning rate as a hyperparameter.

Q: Which one of these

learning rates is best to use?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 39 April 25, 2017
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
learning rate as a hyperparameter.

=> Learning rate decay over time!

step decay:
e.g. decay learning rate by half every few epochs.

exponential decay:

1/t decay:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 40 April 25, 2017
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
learning rate as a hyperparameter.
Loss
Learning rate decay!

Epoch

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 41 April 25, 2017
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
learning rate as a hyperparameter.
Loss
Learning rate decay!

More critical with SGD+Momentum,

less common with Adam

Epoch

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 42 April 25, 2017
First-Order Optimization

Loss

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 43 April 25, 2017
First-Order Optimization
(1) Use gradient form linear approximation
(2) Step to minimize the approximation
Loss

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 44 April 25, 2017
Second-Order Optimization
(1) Use gradient and Hessian to form quadratic approximation
(2) Step to the minima of the approximation
Loss

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 45 April 25, 2017
Second-Order Optimization
second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

Q: What is nice about this update?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 46 April 25, 2017
Second-Order Optimization
second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

No hyperparameters!
No learning rate!

Q: What is nice about this update?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 47 April 25, 2017
Second-Order Optimization
second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:
Hessian has O(N^2) elements
Inverting takes O(N^3)
N = (Tens or Hundreds of) Millions

Q2: Why is this bad for deep learning?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 48 April 25, 2017
Second-Order Optimization

- Quasi-Newton methods (BGFS most popular):

instead of inverting the Hessian (O(n^3)), approximate
inverse Hessian with rank 1 updates over time (O(n^2)
each).

- L-BFGS (Limited memory BFGS):

Does not form/store the full inverse Hessian.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 49 April 25, 2017
Second-Order Optimization

- Quasi-Newton methods (BGFS most popular):

instead of inverting the Hessian (O(n^3)), approximate
inverse Hessian with rank 1 updates over time (O(n^2)
each).

- L-BFGS (Limited memory BFGS):

Does not form/store the full inverse Hessian.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 50 April 25, 2017
L-BFGS
- Usually works very well in full batch, deterministic mode
i.e. if you have a single, deterministic f(x) then L-BFGS will
probably work very nicely

- Does not transfer very well to mini-batch setting. Gives

bad results. Adapting L-BFGS to large-scale, stochastic
setting is an active area of research.

Le et al, “On optimization methods for deep learning, ICML 2011”

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 51 April 25, 2017
In practice:

- Adam is a good default choice in most cases

- If you can afford to do full batch updates then try out

L-BFGS (and don’t forget to disable all sources of noise)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 52 April 25, 2017
Beyond Training Error

Better optimization algorithms But we really care about error on new

help reduce training loss data - how to reduce the gap?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 53 April 25, 2017
Model Ensembles

1. Train multiple independent models

2. At test time average their results

Enjoy 2% extra performance

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 54 April 25, 2017
Model Ensembles: Tips and Tricks
Instead of training independent models, use multiple
snapshots of a single model during training!

Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016
Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017
Figures copyright Yixuan Li and Geoff Pleiss, 2017. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 55 April 25, 2017
Model Ensembles: Tips and Tricks
Instead of training independent models, use multiple
snapshots of a single model during training!

Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016 Cyclic learning rate schedules can
Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017
Figures copyright Yixuan Li and Geoff Pleiss, 2017. Reproduced with permission.
make this work even better!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 56 April 25, 2017
Model Ensembles: Tips and Tricks
Instead of using actual parameter vector, keep a
moving average of the parameter vector and use that
at test time (Polyak averaging)

Polyak and Juditsky, “Acceleration of stochastic approximation by averaging”, SIAM Journal on Control and Optimization, 1992.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 57 April 25, 2017
How to improve single-model performance?

Regularization
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 58 April 25, 2017
Regularization: Add term to loss

In common use:
(Weight decay)
L2 regularization
L1 regularization
Elastic net (L1 + L2)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 59 April 25, 2017
Regularization: Dropout
In each forward pass, randomly set some neurons to zero
Probability of dropping is a hyperparameter; 0.5 is common

Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 60 April 25, 2017
Regularization: Dropout Example forward
pass with a
3-layer network
using dropout

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 61 April 25, 2017
Regularization: Dropout
How can this possibly be a good idea?
Forces the network to have a redundant representation;
Prevents co-adaptation of features

has an ear X
has a tail

is furry X cat
score
has claws
mischievous X
look

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 62 April 25, 2017
Regularization: Dropout
How can this possibly be a good idea?
Another interpretation:

Dropout is training a large ensemble of

models (that share parameters).

Each binary mask is one model

An FC layer with 4096 units has

24096 ~ 101233 possible masks!
Only ~ 1082 atoms in the universe...
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 63 April 25, 2017
Dropout: Test time Output Input
(label) (image)
Random
Dropout makes our output random! mask

Want to “average out” the randomness at test-time

But this integral seems hard …

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 64 April 25, 2017
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
a

w1 w2

x y

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 65 April 25, 2017
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
a
At test time we have:
w1 w2

x y

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 66 April 25, 2017
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
a
At test time we have:
w1 w2 During training we have:

x y

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 67 April 25, 2017
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
a
At test time we have:
w1 w2 During training we have:

x y At test time, multiply

by dropout probability

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 68 April 25, 2017
Dropout: Test time

At test time all neurons are active always

=> We must scale the activations so that for each neuron:
output at test time = expected output at training time

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 69 April 25, 2017
Dropout Summary

drop in forward pass

scale at test time

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 70 April 25, 2017
More common: “Inverted dropout”

test time is unchanged!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 71 April 25, 2017
Regularization: A common pattern
Training: Add some kind
of randomness

Testing: Average out randomness

(sometimes approximate)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 72 April 25, 2017
Regularization: A common pattern
Training: Add some kind Example: Batch
of randomness Normalization

Training:
Normalize using
Testing: Average out randomness stats from random
(sometimes approximate) minibatches

Testing: Use fixed

stats to normalize
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 73 April 25, 2017
Regularization: Data Augmentation

“cat”
Load image
and label
Compute
loss
CNN

This image by Nikita is

licensed under CC-BY 2.0

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 74 April 25, 2017
Regularization: Data Augmentation

“cat”
Load image
and label
Compute
loss
CNN

Transform image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 75 April 25, 2017
Data Augmentation
Horizontal Flips

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 76 April 25, 2017
Data Augmentation
Random crops and scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 77 April 25, 2017
Data Augmentation
Random crops and scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch

Testing: average a fixed set of crops

ResNet:
1. Resize image at 5 scales: {224, 256, 384, 480, 640}
2. For each size, use 10 224 x 224 crops: 4 corners + center, + flips
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 78 April 25, 2017
Data Augmentation
Color Jitter
Simple: Randomize
contrast and brightness

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 79 April 25, 2017
Data Augmentation More Complex:
Color Jitter 1. Apply PCA to all [R, G, B]
Simple: Randomize pixels in training set
contrast and brightness
2. Sample a “color offset”
along principal component
directions
3. Add offset to all pixels of a
training image
(As seen in [Krizhevsky et al. 2012], ResNet, etc)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 80 April 25, 2017
Data Augmentation
Get creative for your problem!
Random mix/combinations of :
- translation
- rotation
- stretching
- shearing,
- lens distortions, … (go crazy)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 81 April 25, 2017
Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization
Data Augmentation

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 82 April 25, 2017
Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect

Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 83 April 25, 2017
Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling

Graham, “Fractional Max Pooling”, arXiv 2014

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 84 April 25, 2017
Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
Stochastic Depth
Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 85 April 25, 2017
Transfer Learning

“You need a lot of a data if you want to

train/use CNNs”

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 86 April 25, 2017
Transfer Learning

ED
“You need a lot of a data if you want to

ST
train/use CNNs”
BU
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 87 April 25, 2017
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014

Transfer Learning with CNNs Razavian et al, “CNN Features Off-the-Shelf: An

Astounding Baseline for Recognition”, CVPR Workshops
2014

1. Train on Imagenet
FC-1000
FC-4096
FC-4096

MaxPool
Conv-512
Conv-512

MaxPool
Conv-256
Conv-256

MaxPool
Conv-128
Conv-128

MaxPool
Conv-64
Conv-64

Image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 88 April 25, 2017
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014

Transfer Learning with CNNs Razavian et al, “CNN Features Off-the-Shelf: An

Astounding Baseline for Recognition”, CVPR Workshops
2014

1. Train on Imagenet 2. Small Dataset (C classes)

FC-1000 FC-C
FC-4096 FC-4096
FC-4096 FC-4096
Reinitialize
this and train
MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512

MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512

MaxPool MaxPool Freeze these

Conv-256 Conv-256
Conv-256 Conv-256

MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128

MaxPool MaxPool
Conv-64 Conv-64
Conv-64 Conv-64

Image Image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 89 April 25, 2017
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014

Transfer Learning with CNNs Razavian et al, “CNN Features Off-the-Shelf: An

Astounding Baseline for Recognition”, CVPR Workshops
2014

1. Train on Imagenet 2. Small Dataset (C classes) 3. Bigger dataset

FC-1000 FC-C FC-C
FC-4096 FC-4096
Reinitialize
FC-4096 Train these
FC-4096 FC-4096 FC-4096
this and train
MaxPool MaxPool MaxPool
Conv-512 Conv-512 Conv-512
With bigger
Conv-512 Conv-512 Conv-512
dataset, train
MaxPool MaxPool MaxPool
more layers
Conv-512 Conv-512 Conv-512
Conv-512 Conv-512 Conv-512

MaxPool MaxPool Freeze these MaxPool

Conv-256 Conv-256 Conv-256
Freeze these
Conv-256 Conv-256 Conv-256

MaxPool MaxPool MaxPool

Conv-128 Conv-128 Conv-128 Lower learning rate
Conv-128 Conv-128 Conv-128 when finetuning;
MaxPool MaxPool MaxPool 1/10 of original LR
Conv-64 Conv-64
Conv-64
is good starting
Conv-64 Conv-64 Conv-64
point
Image Image Image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 90 April 25, 2017
FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512

MaxPool
very little data ? ?
Conv-512
More specific
Conv-512

MaxPool
Conv-256
Conv-256
More generic
MaxPool
Conv-128
Conv-128
quite a lot of ? ?
MaxPool
Conv-64
data
Conv-64

Image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 91 April 25, 2017
FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512

MaxPool
very little data Use Linear ?
Conv-512
More specific Classifier on
Conv-512
top layer
MaxPool
Conv-256
Conv-256
More generic
MaxPool
Conv-128
Conv-128
quite a lot of Finetune a ?
MaxPool
Conv-64
data few layers
Conv-64

Image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 92 April 25, 2017
FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512

MaxPool
very little data Use Linear You’re in
Conv-512
More specific Classifier on trouble… Try
Conv-512
top layer linear classifier
MaxPool
Conv-256 from different
Conv-256
More generic stages
MaxPool
Conv-128
Conv-128
quite a lot of Finetune a Finetune a
MaxPool
Conv-64
data few layers larger number
Conv-64 of layers
Image

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 93 April 25, 2017
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Object Detection
(Fast R-CNN) Image Captioning: CNN + RNN

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for

Girshick, “Fast R-CNN”, ICCV 2015 Generating Image Descriptions”, CVPR 2015
Figure copyright Ross Girshick, 2015. Reproduced with permission. Figure copyright IEEE, 2015. Reproduced for educational purposes.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 94 April 25, 2017
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Object Detection
(Fast R-CNN)
CNN pretrained Image Captioning: CNN + RNN
on ImageNet

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 95 April 25, 2017
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Object Detection
(Fast R-CNN)
CNN pretrained Image Captioning: CNN + RNN
on ImageNet

Word vectors pretrained

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for
Girshick, “Fast R-CNN”, ICCV 2015
Figure copyright Ross Girshick, 2015. Reproduced with permission.
with word2vec Generating Image Descriptions”, CVPR 2015
Figure copyright IEEE, 2015. Reproduced for educational purposes.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 96 April 25, 2017
Takeaway for your projects and beyond:
Have some dataset of interest but it has < ~1M images?

1. Find a very large dataset that has

similar data, train a big ConvNet there
2. Transfer learn to your dataset
Deep learning frameworks provide a “Model Zoo” of
pretrained models so you don’t need to train your own
Caffe: https://github.com/BVLC/caffe/wiki/Model-Zoo
TensorFlow: https://github.com/tensorflow/models
PyTorch: https://github.com/pytorch/vision

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 97 April 25, 2017
Summary

- Optimization
- Momentum, RMSProp, Adam, etc
- Regularization
- Dropout, etc
- Transfer learning
- Use this for your projects!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 98 April 25, 2017
Next time: Deep Learning Software!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 99 April 25, 2017

Lesson 5 Deep Neural Net Optimization Tuning Interpretability
100% (1)
Lesson 5 Deep Neural Net Optimization Tuning Interpretability
105 pages
Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
Soft Computing
No ratings yet
Soft Computing
39 pages
Modern Numerical Nonlinear Optimization (Neculai Andrei) PDF
No ratings yet
Modern Numerical Nonlinear Optimization (Neculai Andrei) PDF
824 pages
Unit 2 (Second Order Methods)
No ratings yet
Unit 2 (Second Order Methods)
9 pages
cs231n Training Neural Networks II
No ratings yet
cs231n Training Neural Networks II
99 pages
cs231n 2018 Midterm Review-2 PDF
No ratings yet
cs231n 2018 Midterm Review-2 PDF
86 pages
UNIT3
No ratings yet
UNIT3
17 pages
ML Week 3 Logistic Regression
60% (10)
ML Week 3 Logistic Regression
6 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
Gradient Descent Deep Learning Lecture
No ratings yet
Gradient Descent Deep Learning Lecture
5 pages
Lecture 11
No ratings yet
Lecture 11
46 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Lecture 6 Part 2
No ratings yet
Lecture 6 Part 2
136 pages
Lecture 4
No ratings yet
Lecture 4
146 pages
Lecture 7
No ratings yet
Lecture 7
138 pages
Lecture 7
No ratings yet
Lecture 7
118 pages
DocumentsTraining Neural Networks - Part II
No ratings yet
DocumentsTraining Neural Networks - Part II
91 pages
Lecture 3
No ratings yet
Lecture 3
105 pages
Machine Learning - Home - Coursera Quiz PDF
100% (1)
Machine Learning - Home - Coursera Quiz PDF
5 pages
cs231n 2018 Lecture04
No ratings yet
cs231n 2018 Lecture04
101 pages
Training Neural Networks
No ratings yet
Training Neural Networks
109 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
cs231n 2018 Lecture04
No ratings yet
cs231n 2018 Lecture04
101 pages
Lecture9 Dropout Optimization Cnns
No ratings yet
Lecture9 Dropout Optimization Cnns
79 pages
Comp3314 7. Gradient Backpropagation
No ratings yet
Comp3314 7. Gradient Backpropagation
80 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Rajesh (DL Unit3) 06dec2024
No ratings yet
Rajesh (DL Unit3) 06dec2024
67 pages
Optimization
No ratings yet
Optimization
51 pages
(Fall 2024) Deep Learning 2
No ratings yet
(Fall 2024) Deep Learning 2
46 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
2019 Adversarial Examples in Modern Machine Learning - A Review
No ratings yet
2019 Adversarial Examples in Modern Machine Learning - A Review
97 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Castep 2
No ratings yet
Castep 2
63 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Gradient Backprop Guide - Neural Networks (Stanford)
No ratings yet
Gradient Backprop Guide - Neural Networks (Stanford)
69 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
Optim
No ratings yet
Optim
33 pages
Adjoint Optimization: On State Constraint and Second Order Adjoint Computation
No ratings yet
Adjoint Optimization: On State Constraint and Second Order Adjoint Computation
50 pages
AuthorVersion PublishedTuningHyperparameters
No ratings yet
AuthorVersion PublishedTuningHyperparameters
30 pages
4.2 Backpropagation 1
No ratings yet
4.2 Backpropagation 1
78 pages
Lesson 4 Training ANNs
No ratings yet
Lesson 4 Training ANNs
34 pages
Otimization 2024 - Ver3
No ratings yet
Otimization 2024 - Ver3
42 pages
DL Class2
No ratings yet
DL Class2
30 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Optimization and Tips For Neural Network Training: Geena Kim
No ratings yet
Optimization and Tips For Neural Network Training: Geena Kim
24 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
DL 26-09
No ratings yet
DL 26-09
22 pages
Survey TPAMI 2023 Preprint
No ratings yet
Survey TPAMI 2023 Preprint
20 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Optimization For Deep Learning: Sebastian Ruder
No ratings yet
Optimization For Deep Learning: Sebastian Ruder
49 pages
Linear Learning With Allreduce: John Langford (With Help From Many)
No ratings yet
Linear Learning With Allreduce: John Langford (With Help From Many)
33 pages
A Review of Closed-Loop Reservoir Management: Jian Hou Kang Zhou Xian-Song Zhang Xiao-Dong Kang Hai Xie
No ratings yet
A Review of Closed-Loop Reservoir Management: Jian Hou Kang Zhou Xian-Song Zhang Xiao-Dong Kang Hai Xie
15 pages
Optimization
No ratings yet
Optimization
26 pages
DL CS 6 M2 Live Session Flow
No ratings yet
DL CS 6 M2 Live Session Flow
32 pages
PolyCube Hex Meshing
No ratings yet
PolyCube Hex Meshing
14 pages
Numerical Optimization: Lecture Notes #18 Quasi-Newton Methods - The BFGS Method
No ratings yet
Numerical Optimization: Lecture Notes #18 Quasi-Newton Methods - The BFGS Method
24 pages
Learning Customer Behaviors For Effective Load Forecasting
No ratings yet
Learning Customer Behaviors For Effective Load Forecasting
14 pages
08 Training
No ratings yet
08 Training
18 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
EDA Lecture Module 4
No ratings yet
EDA Lecture Module 4
20 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Takashi Imamichi Hidetoshi Numata Hideyuki Mizuta Tsuyoshi Id e IBM Research - Tokyo 1623-14 Shimo-Tsuruma, Yamato Kanagawa 242-8502, JAPAN
No ratings yet
Takashi Imamichi Hidetoshi Numata Hideyuki Mizuta Tsuyoshi Id e IBM Research - Tokyo 1623-14 Shimo-Tsuruma, Yamato Kanagawa 242-8502, JAPAN
12 pages
Shallow Parsing With Conditional Random Fields
No ratings yet
Shallow Parsing With Conditional Random Fields
8 pages
Shallow Parsing With Conditional Random Fields
No ratings yet
Shallow Parsing With Conditional Random Fields
8 pages
Quasi Newton Methods
No ratings yet
Quasi Newton Methods
17 pages
TeraChem Userguide 1.41
No ratings yet
TeraChem Userguide 1.41
21 pages
Deep Learning
No ratings yet
Deep Learning
4 pages
Optimization
No ratings yet
Optimization
6 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
TeraChem User Guide v1.0
No ratings yet
TeraChem User Guide v1.0
18 pages
Inteligencia Preguntas
No ratings yet
Inteligencia Preguntas
7 pages
Optimizers
No ratings yet
Optimizers
4 pages
Cs294a 2011 Assignment
No ratings yet
Cs294a 2011 Assignment
5 pages
L-BFGS Algorithm
No ratings yet
L-BFGS Algorithm
4 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Energy Minimization
No ratings yet
Energy Minimization
2 pages
Greenplum Text Analytics
No ratings yet
Greenplum Text Analytics
5 pages
Bagging-Based Logistic Regression With Spark A Medical Data Mining Method
No ratings yet
Bagging-Based Logistic Regression With Spark A Medical Data Mining Method
7 pages

cs231n Training Neural Networks II

Uploaded by

cs231n Training Neural Networks II

Uploaded by

Lecture 7:

Training Neural Networks,

- Assignment 1 is being graded, stay tuned

Good default choice

Initialization too big:

Initialization just right:

Loss function has high condition number: ratio of largest to smallest

Loss function has high condition number: ratio of largest to smallest

What if the loss

What if the loss

What if the loss

Saddle points much

- Build up “velocity” as a running mean of gradients

Change of variables and

Added element-wise scaling of the gradient based on the

Q: What happens with AdaGrad?

Q2: What happens to the step size over long time?

Tieleman and Hinton, 2012

Sort of like RMSProp with momentum

Q: What happens at first timestep?

Bias correction for the fact that

Q: Which one of these

=> Learning rate decay over time!

More critical with SGD+Momentum,

Q: What is nice about this update?

Q: What is nice about this update?

Q2: Why is this bad for deep learning?

- Quasi-Newton methods (BGFS most popular):

- L-BFGS (Limited memory BFGS):

- Quasi-Newton methods (BGFS most popular):

- L-BFGS (Limited memory BFGS):

- Does not transfer very well to mini-batch setting. Gives

Le et al, “On optimization methods for deep learning, ICML 2011”

- Adam is a good default choice in most cases

- If you can afford to do full batch updates then try out

Better optimization algorithms But we really care about error on new

1. Train multiple independent models

Enjoy 2% extra performance

Dropout is training a large ensemble of

Each binary mask is one model

An FC layer with 4096 units has

Want to “average out” the randomness at test-time

But this integral seems hard …

x y At test time, multiply

At test time all neurons are active always

drop in forward pass

scale at test time

test time is unchanged!

Testing: Average out randomness

Testing: Use fixed

This image by Nikita is

Testing: average a fixed set of crops

Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013

Graham, “Fractional Max Pooling”, arXiv 2014

“You need a lot of a data if you want to

Transfer Learning with CNNs Razavian et al, “CNN Features Off-the-Shelf: An

Transfer Learning with CNNs Razavian et al, “CNN Features Off-the-Shelf: An

1. Train on Imagenet 2. Small Dataset (C classes)

MaxPool MaxPool Freeze these

Transfer Learning with CNNs Razavian et al, “CNN Features Off-the-Shelf: An

1. Train on Imagenet 2. Small Dataset (C classes) 3. Bigger dataset

MaxPool MaxPool Freeze these MaxPool

MaxPool MaxPool MaxPool

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for

Word vectors pretrained

1. Find a very large dataset that has

You might also like