This Session
Training Aspects of CNN
Activation Functions
Dataset Preparation
Data Preprocessing
Weight Initialization
Activation Functions
Non-linearity Layer
Source: cs231n, Stanford University
Activation Functions: Sigmoid
Source: cs231n, Stanford University
Activation Functions: Sigmoid
Sigmoids saturate and kill gradients.
Source: cs231n, Stanford University
Activation Functions: Sigmoid
Source: cs231n, Stanford University
Activation Functions: Sigmoid
Sigmoids saturate and kill gradients.
Sigmoid outputs are not zero-centered.
Source: cs231n, Stanford University
Activation Functions: Sigmoid
Source: cs231n, Stanford University
Activation Functions: Sigmoid
Always all positive or all negative
(this is also why you want zero-mean data!)
Source: cs231n, Stanford University
Activation Functions: Sigmoid
Sigmoids saturate and kill gradients.
Sigmoid outputs are not zero-centered.
Exp() is a bit compute expensive.
Source: cs231n, Stanford University
Activation Functions: tanh
[LeCun et al., 1991]
Source: http://cs231n.github.io
Activation Functions: tanh
tanh neuron is simply a scaled
sigmoid neuron
Sigmoid
[LeCun et al., 1991]
Source: http://cs231n.github.io
Activation Functions: tanh
tanh neuron is simply a scaled
sigmoid neuron
Sigmoid
Like the sigmoid neuron, its activations saturate.
Unlike the sigmoid neuron its output is zero-centered.
In practice the tanh non-linearity is always preferred to the
sigmoid nonlinearity. [LeCun et al., 1991]
Source: http://cs231n.github.io
Activation Functions: ReLU
[Krizhevsky et al., 2012]
Source: http://cs231n.github.io
Activation Functions: ReLU
ReLU is 6 times faster in the convergence of stochastic gradient descent
compared to the sigmoid/tanh (Krizhevsky et al.).
ReLU is simple as compared to tanh/sigmoid that involve expensive
operations (exponentials, etc.)
[Krizhevsky et al., 2012]
Source: http://cs231n.github.io
Activation Functions: ReLU
ReLU is 6 times faster in the convergence of stochastic gradient descent
compared to the sigmoid/tanh (Krizhevsky et al.).
ReLU is simple as compared to tanh/sigmoid that involve expensive
operations (exponentials, etc.)
Dying ReLU problem: a large gradient flowing through a ReLU neuron could
cause the weights to update in such a way that the neuron will never activate
on any datapoint again.
[Krizhevsky et al., 2012]
Source: http://cs231n.github.io
Activation Functions: ReLU
Source: cs231n, Stanford University
Activation Functions: Leaky ReLU
[Mass et al., 2013]
Source: http://cs231n.github.io
Activation Functions: Leaky ReLU
Succeeded in some cases, but the results are not always
consistent.
Source: http://cs231n.github.io
Activation Functions: Parametric ReLU
In PReLU, the slope in the negative region is considered as a
parameter of each neuron and learnt from data.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet
classification. IEEE international conference on computer vision (CVPR).
Source: http://cs231n.github.io
Activation Functions: Maxout
Maxout neuron (introduced by Goodfellow et al.) generalizes
the ReLU and its leaky version.
The Maxout neuron computes the function:
[Goodfellow et al., 2013]
Source: http://cs231n.github.io
Activation Functions: Maxout
Maxout neuron (introduced by Goodfellow et al.) generalizes
the ReLU and its leaky version.
The Maxout neuron computes the function:
Both ReLU and Leaky ReLU are a special case of this form
(for example, for ReLU, we have w1=0,b1=0, w2=identity, and
b2=0).
[Goodfellow et al., 2013]
Source: http://cs231n.github.io
Activation Functions: Maxout
Maxout neuron (introduced by Goodfellow et al.) generalizes
the ReLU and its leaky version.
The Maxout neuron computes the function:
Both ReLU and Leaky ReLU are a special case of this form
(for example, for ReLU, we have w1=0,b1=0, w2=identity, and
b2=0).
Unlike the ReLU neurons it doubles the number of
parameters.
[Goodfellow et al., 2013]
Source: http://cs231n.github.io
Activation Functions: ELU
- Exponential Linear Unit
Clevert, Djork-Arné, Thomas Unterthiner, and Sepp Hochreiter. "Fast and accurate deep network learning by
exponential linear units (elus)." International Conference on Learning Representations (ICLR) 2016.
Activation Functions: ELU
- Exponential Linear Unit
- All benefits of ReLU
- Negative saturation regime compared with Leaky ReLU
adds some robustness to noise
- Computation requires exp()
Clevert, Djork-Arné, Thomas Unterthiner, and Sepp Hochreiter. "Fast and accurate deep network learning by
exponential linear units (elus)." International Conference on Learning Representations (ICLR) 2016.
Activation Functions: SELU
Scaled Exponential Linear Unit (SELU)
SELU induces self-normalization to automatically converge towards zero
mean and unit variance
Klambauer, Günter, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. "Self-normalizing neural networks."
In Advances in Neural Information Processing Systems (NIPS), 2017.
Activation Functions: Swish
- ReLU is special case of Swish
Ramachandran et al. "Swish: a self-gated activation function." ICLR Workshops, 2018.
Activation Functions: Swish
- ReLU is special case of Swish
CIFAR-10 accuracy
Ramachandran et al. "Swish: a self-gated activation function." ICLR Workshops, 2018.
Activation Functions: abReLU
Average Biased Rectified Linear Unit (abReLU)
A is input map average
abReLU makes sure that all the neurons producing the values more than
the average of all values in that layer must not be dead.
Dubey, S.R. and Chakraborty, S., Average Biased ReLU Based CNN Descriptor for Improved Face
Retrieval. arXiv preprint arXiv:1804.02051, 2018.
Activation Functions: In Practice
- Use ReLU. Be careful with your learning rates
- Try out Swish/Leaky ReLU / Maxout / ELU
- Try out tanh
- use sigmoid
Dataset Preparation
Train/Val/Test sets
In General People Do: Train/Test
- Split data into train and test,
- Choose hyperparameters that work best on test
data
In General People Do: Train/Test
- Split data into train and test,
- Choose hyperparameters that work best on test
data
BAD: No idea how algorithm will perform on
new data
K-Fold Validation
- Split data into folds,
- Try each fold as validation and average the results
K-Fold Validation
- Split data into folds,
- Try each fold as validation and average the results
Useful for small datasets, but not used too
frequently in deep learning
Better Approach: Train/Val/Test sets
- Split data into train, val, and test;
- Choose hyperparameters on val and evaluate on
test
Better Approach: Train/Val/Test sets
- Split data into train, val, and test;
- Choose hyperparameters on val and evaluate on
test
Division can be done based on the size of dataset:
Roughly 10k or 10% whichever is less for val and test sets.
Rest in train set.
Data Preprocessing
Data Preprocessing
Source: cs231n
Data Preprocessing
Always all positive or all negative
(this is also why you want zero-mean data!)
Source: cs231n, Stanford University
Data Preprocessing
In practice for Images: only centering is preferred
e.g. consider CIFAR-10 example with [32,32,3] images
- Subtract the mean image (e.g. AlexNet)
(mean image = [32,32,3] array)
- Subtract per-channel mean (e.g. VGGNet, ResNet, etc.)
(mean along each channel = 3 numbers)
Source: cs231n
Weight Initialization
Weight Initialization: Constant
Q: what happens when W=Constant init is used?
Weight Initialization: Constant
Q: what happens when W=Constant init is used?
- Every neuron will
compute the same output
and undergo the exact
same parameter
updates.
- There is no source of
asymmetry between
neurons if their weights
are initialized to be the
same.
Source: cs231n
Weight Initialization: Gaussian
First idea: Small random numbers
(Gaussian with zero mean and 1e-2 standard deviation)
Symmetry breaking: Weights are different for
different neurons
Source: cs231n
Weight Initialization: Gaussian
First idea: Small random numbers
(Gaussian with zero mean and 1e-2 standard deviation)
Symmetry breaking: Weights are different for
different neurons
Works ~okay for small networks, but problems with
deeper networks,
i.e. Almost all neurons will become zero
-> gradient diminishing problem.
Source: cs231n
Weight Initialization: Gaussian
First idea: Small random numbers
(Gaussian with zero mean and 1e-2 standard deviation)
Symmetry breaking: Weights are different for
different neurons
Works ~okay for small networks, but problems with
deeper networks,
i.e. Almost all neurons will become zero
-> gradient diminishing problem.
Increase the standard deviation to 1
Source: cs231n
Weight Initialization: Gaussian
First idea: Small random numbers
(Gaussian with zero mean and 1e-2 standard deviation)
Symmetry breaking: Weights are different for
different neurons
Works ~okay for small networks, but problems with
deeper networks,
i.e. Almost all neurons will become zero
-> gradient diminishing problem.
Increase the standard deviation to 1
Almost all neurons completely saturated, either -1
or 1. Gradients will be all zero.
-> gradient diminishing problem.
Source: cs231n
Weight Initialization: Gaussian
Source: cs231n
Weight Initialization: Gaussian
Source: cs231n
Weight Initialization: Gaussian
Source: cs231n
[Glorot et al., 2010]
Weight Initialization: Xavier
Calibrating the variances with 1/sqrt(fan_in)
Reasonable initialization.
(Mathematical derivation assumes linear activations)
Source: cs231n
Weight Initialization: Xavier
Source: cs231n
Weight Initialization: Xavier
Source: cs231n
Weight Initialization: XavierImproved
Source: cs231n
Understanding the difficulty of training deep feedforward neural
networks by Glorot and Bengio, 2010
Exact solutions to the nonlinear dynamics of learning in deep linear
neural networks by Saxe et al, 2013
Random walk initialization for training very deep feedforward
networks by Sussillo and Abbott, 2014
Delving deep into rectifiers: Surpassing human-level performance
on ImageNet classification by He et al., 2015
Data-dependent Initializations of Convolutional Neural Networks by
Krähenbühl et al., 2015
All you need is a good init by Mishkin and Matas, 2015
Source: cs231n
Things to remember
Training CNN
Activation Functions: ReLU is common, Swish can
be tried
Data Preparation: Train/Val/Test
Data preprocessing: Centering is common
Weight initialization: XavierImproved works well
Optimization
Source: cs231n
Mini-batch SGD
Loop:
1. Sample a batch of data
2. Forward prop it through the graph
(network), get loss
3. Backprop to calculate the gradients
4. Update the parameters using the gradient
Source: cs231n
Stochastic Gradient Descent (SGD)
The procedure of repeatedly evaluating the
gradient of loss function and then performing a
parameter update.
Vanilla (Original) Gradient Descent:
Source: cs231n
SGD: Problems
What if loss changes quickly in one direction and
slowly in another?
Source: cs231n
SGD: Problems
What if loss changes quickly in one direction and
slowly in another?
Very slow progress along shallow dimension, jitter
along steep direction
Source: cs231n
SGD: Problems
What if the loss function has
a local minima or saddle
point?
Source: cs231n
SGD: Problems
What if the loss function has
a local minima or saddle
point?
Zero gradient,
gradient descent
gets stuck
Source: cs231n
SGD: Problems
What if the loss function has
a local minima or saddle
point?
Zero gradient,
gradient descent
gets stuck
Saddle points much more
common in high
dimension
point problem in high-dimensional non-convex
Source: cs231n
SGD: Problems
Our gradients come from
minibatches so they can
be noisy!
Source: cs231n
SGD + Momentum
Source: cs231n
SGD + Momentum
Source: cs231n
SGD + Momentum
Source: cs231n
SGD + Momentum
- Build
consistent gradient
- Rho 0.99
Source: cs231n
SGD + Momentum
Source: cs231n
SGD + Momentum
Source: cs231n
Nesterov Momentum
Sutskever
and momentum in deel Source: cs231n
Nesterov Momentum
Sutskever
and momentum in deel Source: cs231n
Nesterov Momentum
Sutskever
and momentum in deel Source: cs231n
Nesterov Momentum
Sutskever
and momentum in deel Source: cs231n
Nesterov Momentum
Sutskever
and momentum in deel Source: cs231n
AdaGrad
Added element-wise scaling of the gradient based on
the historical sum of squares in each dimension
Duchi subgradient methods for online learning
Source: cs231n
AdaGrad
What happens to the step size over long time?
Duchi subgradient methods for online learning
Source: cs231n
AdaGrad
What happens to the step size over long time?
Effective learning rate diminishing
problem
Duchi subgradient methods for online learning
Source: cs231n
RMSProp
AdaGrad
RMSProp
Tieleman and Hinton, 2012 Source: cs231n
Kingma
Adam A method for stochastic
Source: cs231n
Kingma
Adam A method for stochastic
Sort of like RMSProp with Momentum
Source: cs231n
Kingma
Adam A method for stochastic
Sort of like RMSProp with Momentum
Problem:
Initially, second_moment=0 and beta2=0.999
After 1st iteration, second_moment -> close to zero
So, very large step for update of x
Source: cs231n
Kingma
Adam (with Bias correction) A method for stochastic
AdaGrad/ Bias Correction Momentum
RMSProp Bias correction for the fact that first and second
moment estimates start at zero
Source: cs231n
Kingma
Adam (with Bias correction) A method for stochastic
AdaGrad/ Bias Correction Momentum
RMSProp Bias correction for the fact that first and second
moment estimates start at zero
Adam with beta1 = 0.9,
beta2 = 0.999, and learning_rate = 1e-3 or 5e-4
is a great starting point for many models!
Source: cs231n
Learning Rate
SGD, SGD+Momentum, Adagrad, RMSProp, Adam
all have learning rate as a hyperparameter.
Q: Which one of these
learning rates is best to
use?
Source: cs231n
Learning Rate
SGD, SGD+Momentum, Adagrad, RMSProp, Adam
all have learning rate as a hyperparameter.
Source: cs231n
Learning Rate
SGD, SGD+Momentum, Adagrad, RMSProp, Adam
all have learning rate as a hyperparameter.
More critical with
SGD+Momentum,
less common with Adam
Source: cs231n
Optimizer and Learning Rate
In Practice:
- Adam is a good default choice in most
cases
- Learning rate with step decay is
commonly used
More Optimizer: http://ruder.io/optimizing-gradient-descent/
Regularization
Image Source: https://stackoverflow.com/questions/44909134/how-to-avoid-overfitting-
on-a-simple-feed-forward-network/44985765
Regularization
Data loss: Model predictions
should match training data
Source: cs231n
Regularization
Data loss: Model predictions Regularization: Prevent the model
should match training data from doing too well on training data
Source: cs231n
Regularization
Data loss: Model predictions Regularization: Prevent the model
should match training data from doing too well on training data
Source: cs231n
Regularization
Data loss: Model predictions Regularization: Prevent the model
should match training data from doing too well on training data
Source: cs231n
Regularization
Data loss: Model predictions Regularization: Prevent the model
should match training data from doing too well on training data
Why regularize?
- Express preferences over weights
- Make the model simple so it works on test data
- Improve optimization by adding curvature
Source: cs231n
Regularization
Source: cs231n
Regularization
Source: cs231n
Regularization
Which W to consider?
Source: cs231n
Regularization
Source: cs231n
Regularization
L2 regularization likes to
Source: cs231n
Dropout
Dropout
In each forward pass, randomly set some neurons to zero
Probability of dropping is a hyperparameter; 0.5 is common
Dropout: A simple way to prevent neural networks from overfitting [Srivastava JMLR 2014]
Dropout
How can this possibly be a good idea?
Source: cs231n
Dropout
How can this possibly be a good idea?
Source: cs231n
Dropout
How can this possibly be a good idea?
Dropout is training a large ensemble of
models (that share parameters).
Intuition: successful conspiracies
50 people planning a conspiracy
Strategy A: plan a big conspiracy involving 50
people
Likely to fail. 50 people need to play their
parts correctly.
Strategy B: plan 10 conspiracies each involving 5
people
Likely to succeed!
Source: cs231n & JB Huang
Dropout: Test Time
Source: cs231n
We drop and scale at train time and don't do anything at test time.
Source: cs231n
DropConnect
Dropping some connections
Networks using DropConnect Source: cs231n
Batch Normalization
Batch Normalization
want zero-mean unit-variance activations? lets
Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization
want zero-mean unit-variance activations? lets
consider a batch of activations at some layer. To make
each dimension zero-mean unit-variance, apply:
Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization
Usually inserted after Fully
Connected or Convolutional layers,
and before nonlinearity.
Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization
Usually inserted after Fully
Connected or Convolutional layers,
and before nonlinearity.
Problem:
do we necessarily want a
zero-mean unit-variance input?
Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization
Normalize:
And then allow the
network to squash
the range if it wants to:
Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization
Normalize:
And then allow the
network to squash
the range if it wants to:
Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization
Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization
- Improves gradient flow through the network
- Allows higher learning rates
- Reduces the strong dependence on initialization
- Acts as a form of regularization
Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization
Note: at test time BatchNorm layer functions differently:
The mean/std are not computed based on the batch.
Instead, a single fixed empirical mean of activations
during training is used.
(e.g. can be estimated during training with running
averages)
Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization: Recent Trends
Layer Normalization:
Ba, Kiros arXiv 2016
Instance Normalization:
Ulyanov et al, Improved Texture Networks: Maximizing
Quality and Diversity in Feed-forward Stylization and Texture
Synthesis, CVPR 2017
Group Normalization:
arXiv 2018
(Appeared 3/22/2018)
Decorrelated Normalization:
Decorrelated arXiv 2018
(Appeared 4/23/2018)
Data Augmentation
Data Augmentation (Jittering)
Source: cs231n
Data Augmentation (Jittering)
Horizontal Flips
Source: cs231n
Data Augmentation (Jittering)
Random crops and scales
Source: cs231n
Data Augmentation (Jittering)
Create virtual training samples
Get creative for your problem!
Horizontal flip
Random crop
Color casting
Randomize contrast
Randomize brightness
Geometric distortion
Rotation
Photometric changes
Deep Image [Wu et al. 2015]