0% found this document useful (0 votes)
6 views126 pages

Module_2

Uploaded by

anoop042004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views126 pages

Module_2

Uploaded by

anoop042004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 126

This Session

Training Aspects of CNN


Activation Functions

Dataset Preparation

Data Preprocessing

Weight Initialization
Activation Functions
Non-linearity Layer

Source: cs231n, Stanford University


Activation Functions: Sigmoid

Source: cs231n, Stanford University


Activation Functions: Sigmoid

Sigmoids saturate and kill gradients.

Source: cs231n, Stanford University


Activation Functions: Sigmoid

Source: cs231n, Stanford University


Activation Functions: Sigmoid

Sigmoids saturate and kill gradients.

Sigmoid outputs are not zero-centered.

Source: cs231n, Stanford University


Activation Functions: Sigmoid

Source: cs231n, Stanford University


Activation Functions: Sigmoid

Always all positive or all negative


(this is also why you want zero-mean data!)
Source: cs231n, Stanford University
Activation Functions: Sigmoid

Sigmoids saturate and kill gradients.

Sigmoid outputs are not zero-centered.

Exp() is a bit compute expensive.


Source: cs231n, Stanford University
Activation Functions: tanh

[LeCun et al., 1991]


Source: http://cs231n.github.io
Activation Functions: tanh

tanh neuron is simply a scaled


sigmoid neuron

Sigmoid

[LeCun et al., 1991]


Source: http://cs231n.github.io
Activation Functions: tanh

tanh neuron is simply a scaled


sigmoid neuron

Sigmoid

Like the sigmoid neuron, its activations saturate.

Unlike the sigmoid neuron its output is zero-centered.

In practice the tanh non-linearity is always preferred to the


sigmoid nonlinearity. [LeCun et al., 1991]
Source: http://cs231n.github.io
Activation Functions: ReLU

[Krizhevsky et al., 2012]


Source: http://cs231n.github.io
Activation Functions: ReLU

ReLU is 6 times faster in the convergence of stochastic gradient descent


compared to the sigmoid/tanh (Krizhevsky et al.).

ReLU is simple as compared to tanh/sigmoid that involve expensive


operations (exponentials, etc.)

[Krizhevsky et al., 2012]


Source: http://cs231n.github.io
Activation Functions: ReLU

ReLU is 6 times faster in the convergence of stochastic gradient descent


compared to the sigmoid/tanh (Krizhevsky et al.).

ReLU is simple as compared to tanh/sigmoid that involve expensive


operations (exponentials, etc.)

Dying ReLU problem: a large gradient flowing through a ReLU neuron could
cause the weights to update in such a way that the neuron will never activate
on any datapoint again.

[Krizhevsky et al., 2012]


Source: http://cs231n.github.io
Activation Functions: ReLU

Source: cs231n, Stanford University


Activation Functions: Leaky ReLU

[Mass et al., 2013]


Source: http://cs231n.github.io
Activation Functions: Leaky ReLU

Succeeded in some cases, but the results are not always


consistent.

Source: http://cs231n.github.io
Activation Functions: Parametric ReLU

In PReLU, the slope in the negative region is considered as a


parameter of each neuron and learnt from data.

He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet
classification. IEEE international conference on computer vision (CVPR).
Source: http://cs231n.github.io
Activation Functions: Maxout
Maxout neuron (introduced by Goodfellow et al.) generalizes
the ReLU and its leaky version.

The Maxout neuron computes the function:

[Goodfellow et al., 2013]


Source: http://cs231n.github.io
Activation Functions: Maxout
Maxout neuron (introduced by Goodfellow et al.) generalizes
the ReLU and its leaky version.

The Maxout neuron computes the function:

Both ReLU and Leaky ReLU are a special case of this form
(for example, for ReLU, we have w1=0,b1=0, w2=identity, and
b2=0).

[Goodfellow et al., 2013]


Source: http://cs231n.github.io
Activation Functions: Maxout
Maxout neuron (introduced by Goodfellow et al.) generalizes
the ReLU and its leaky version.

The Maxout neuron computes the function:

Both ReLU and Leaky ReLU are a special case of this form
(for example, for ReLU, we have w1=0,b1=0, w2=identity, and
b2=0).

Unlike the ReLU neurons it doubles the number of


parameters.

[Goodfellow et al., 2013]


Source: http://cs231n.github.io
Activation Functions: ELU

- Exponential Linear Unit

Clevert, Djork-Arné, Thomas Unterthiner, and Sepp Hochreiter. "Fast and accurate deep network learning by
exponential linear units (elus)." International Conference on Learning Representations (ICLR) 2016.
Activation Functions: ELU

- Exponential Linear Unit


- All benefits of ReLU
- Negative saturation regime compared with Leaky ReLU
adds some robustness to noise

- Computation requires exp()


Clevert, Djork-Arné, Thomas Unterthiner, and Sepp Hochreiter. "Fast and accurate deep network learning by
exponential linear units (elus)." International Conference on Learning Representations (ICLR) 2016.
Activation Functions: SELU
Scaled Exponential Linear Unit (SELU)

SELU induces self-normalization to automatically converge towards zero


mean and unit variance

Klambauer, Günter, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. "Self-normalizing neural networks."
In Advances in Neural Information Processing Systems (NIPS), 2017.
Activation Functions: Swish

- ReLU is special case of Swish

Ramachandran et al. "Swish: a self-gated activation function." ICLR Workshops, 2018.


Activation Functions: Swish

- ReLU is special case of Swish

CIFAR-10 accuracy

Ramachandran et al. "Swish: a self-gated activation function." ICLR Workshops, 2018.


Activation Functions: abReLU
Average Biased Rectified Linear Unit (abReLU)

A is input map average

abReLU makes sure that all the neurons producing the values more than
the average of all values in that layer must not be dead.

Dubey, S.R. and Chakraborty, S., Average Biased ReLU Based CNN Descriptor for Improved Face
Retrieval. arXiv preprint arXiv:1804.02051, 2018.
Activation Functions: In Practice

- Use ReLU. Be careful with your learning rates


- Try out Swish/Leaky ReLU / Maxout / ELU

- Try out tanh


- use sigmoid
Dataset Preparation
Train/Val/Test sets
In General People Do: Train/Test
- Split data into train and test,
- Choose hyperparameters that work best on test
data
In General People Do: Train/Test
- Split data into train and test,
- Choose hyperparameters that work best on test
data

BAD: No idea how algorithm will perform on


new data
K-Fold Validation
- Split data into folds,
- Try each fold as validation and average the results
K-Fold Validation
- Split data into folds,
- Try each fold as validation and average the results

Useful for small datasets, but not used too


frequently in deep learning
Better Approach: Train/Val/Test sets
- Split data into train, val, and test;
- Choose hyperparameters on val and evaluate on
test
Better Approach: Train/Val/Test sets
- Split data into train, val, and test;
- Choose hyperparameters on val and evaluate on
test

Division can be done based on the size of dataset:


Roughly 10k or 10% whichever is less for val and test sets.
Rest in train set.
Data Preprocessing
Data Preprocessing

Source: cs231n
Data Preprocessing

Always all positive or all negative


(this is also why you want zero-mean data!)
Source: cs231n, Stanford University
Data Preprocessing

In practice for Images: only centering is preferred


e.g. consider CIFAR-10 example with [32,32,3] images
- Subtract the mean image (e.g. AlexNet)
(mean image = [32,32,3] array)
- Subtract per-channel mean (e.g. VGGNet, ResNet, etc.)
(mean along each channel = 3 numbers)
Source: cs231n
Weight Initialization
Weight Initialization: Constant

Q: what happens when W=Constant init is used?


Weight Initialization: Constant

Q: what happens when W=Constant init is used?

- Every neuron will


compute the same output
and undergo the exact
same parameter
updates.
- There is no source of
asymmetry between
neurons if their weights
are initialized to be the
same.
Source: cs231n
Weight Initialization: Gaussian
First idea: Small random numbers
(Gaussian with zero mean and 1e-2 standard deviation)
Symmetry breaking: Weights are different for
different neurons

Source: cs231n
Weight Initialization: Gaussian
First idea: Small random numbers
(Gaussian with zero mean and 1e-2 standard deviation)
Symmetry breaking: Weights are different for
different neurons
Works ~okay for small networks, but problems with
deeper networks,
i.e. Almost all neurons will become zero
-> gradient diminishing problem.

Source: cs231n
Weight Initialization: Gaussian
First idea: Small random numbers
(Gaussian with zero mean and 1e-2 standard deviation)
Symmetry breaking: Weights are different for
different neurons
Works ~okay for small networks, but problems with
deeper networks,
i.e. Almost all neurons will become zero
-> gradient diminishing problem.
Increase the standard deviation to 1

Source: cs231n
Weight Initialization: Gaussian
First idea: Small random numbers
(Gaussian with zero mean and 1e-2 standard deviation)
Symmetry breaking: Weights are different for
different neurons
Works ~okay for small networks, but problems with
deeper networks,
i.e. Almost all neurons will become zero
-> gradient diminishing problem.
Increase the standard deviation to 1
Almost all neurons completely saturated, either -1
or 1. Gradients will be all zero.
-> gradient diminishing problem.
Source: cs231n
Weight Initialization: Gaussian

Source: cs231n
Weight Initialization: Gaussian

Source: cs231n
Weight Initialization: Gaussian

Source: cs231n
[Glorot et al., 2010]
Weight Initialization: Xavier
Calibrating the variances with 1/sqrt(fan_in)

Reasonable initialization.
(Mathematical derivation assumes linear activations)

Source: cs231n
Weight Initialization: Xavier

Source: cs231n
Weight Initialization: Xavier

Source: cs231n
Weight Initialization: XavierImproved

Source: cs231n
Understanding the difficulty of training deep feedforward neural
networks by Glorot and Bengio, 2010

Exact solutions to the nonlinear dynamics of learning in deep linear


neural networks by Saxe et al, 2013

Random walk initialization for training very deep feedforward


networks by Sussillo and Abbott, 2014

Delving deep into rectifiers: Surpassing human-level performance


on ImageNet classification by He et al., 2015

Data-dependent Initializations of Convolutional Neural Networks by


Krähenbühl et al., 2015

All you need is a good init by Mishkin and Matas, 2015

Source: cs231n
Things to remember
Training CNN
Activation Functions: ReLU is common, Swish can
be tried
Data Preparation: Train/Val/Test
Data preprocessing: Centering is common
Weight initialization: XavierImproved works well
Optimization
Source: cs231n
Mini-batch SGD

Loop:
1. Sample a batch of data
2. Forward prop it through the graph
(network), get loss
3. Backprop to calculate the gradients
4. Update the parameters using the gradient

Source: cs231n
Stochastic Gradient Descent (SGD)
The procedure of repeatedly evaluating the
gradient of loss function and then performing a
parameter update.

Vanilla (Original) Gradient Descent:

Source: cs231n
SGD: Problems
What if loss changes quickly in one direction and
slowly in another?

Source: cs231n
SGD: Problems
What if loss changes quickly in one direction and
slowly in another?

Very slow progress along shallow dimension, jitter


along steep direction

Source: cs231n
SGD: Problems
What if the loss function has
a local minima or saddle
point?

Source: cs231n
SGD: Problems
What if the loss function has
a local minima or saddle
point?

Zero gradient,
gradient descent
gets stuck

Source: cs231n
SGD: Problems
What if the loss function has
a local minima or saddle
point?

Zero gradient,
gradient descent
gets stuck

Saddle points much more


common in high
dimension
point problem in high-dimensional non-convex
Source: cs231n
SGD: Problems
Our gradients come from
minibatches so they can
be noisy!

Source: cs231n
SGD + Momentum

Source: cs231n
SGD + Momentum

Source: cs231n
SGD + Momentum

Source: cs231n
SGD + Momentum

- Build
consistent gradient
- Rho 0.99
Source: cs231n
SGD + Momentum

Source: cs231n
SGD + Momentum

Source: cs231n
Nesterov Momentum

Sutskever
and momentum in deel Source: cs231n
Nesterov Momentum

Sutskever
and momentum in deel Source: cs231n
Nesterov Momentum

Sutskever
and momentum in deel Source: cs231n
Nesterov Momentum

Sutskever
and momentum in deel Source: cs231n
Nesterov Momentum

Sutskever
and momentum in deel Source: cs231n
AdaGrad

Added element-wise scaling of the gradient based on


the historical sum of squares in each dimension

Duchi subgradient methods for online learning


Source: cs231n
AdaGrad

What happens to the step size over long time?

Duchi subgradient methods for online learning


Source: cs231n
AdaGrad

What happens to the step size over long time?

Effective learning rate diminishing


problem
Duchi subgradient methods for online learning
Source: cs231n
RMSProp
AdaGrad

RMSProp

Tieleman and Hinton, 2012 Source: cs231n


Kingma
Adam A method for stochastic

Source: cs231n
Kingma
Adam A method for stochastic

Sort of like RMSProp with Momentum

Source: cs231n
Kingma
Adam A method for stochastic

Sort of like RMSProp with Momentum

Problem:
Initially, second_moment=0 and beta2=0.999
After 1st iteration, second_moment -> close to zero
So, very large step for update of x
Source: cs231n
Kingma
Adam (with Bias correction) A method for stochastic

AdaGrad/ Bias Correction Momentum


RMSProp Bias correction for the fact that first and second
moment estimates start at zero

Source: cs231n
Kingma
Adam (with Bias correction) A method for stochastic

AdaGrad/ Bias Correction Momentum


RMSProp Bias correction for the fact that first and second
moment estimates start at zero
Adam with beta1 = 0.9,
beta2 = 0.999, and learning_rate = 1e-3 or 5e-4
is a great starting point for many models!
Source: cs231n
Learning Rate
SGD, SGD+Momentum, Adagrad, RMSProp, Adam
all have learning rate as a hyperparameter.

Q: Which one of these


learning rates is best to
use?

Source: cs231n
Learning Rate
SGD, SGD+Momentum, Adagrad, RMSProp, Adam
all have learning rate as a hyperparameter.

Source: cs231n
Learning Rate
SGD, SGD+Momentum, Adagrad, RMSProp, Adam
all have learning rate as a hyperparameter.

More critical with


SGD+Momentum,
less common with Adam

Source: cs231n
Optimizer and Learning Rate
In Practice:

- Adam is a good default choice in most


cases

- Learning rate with step decay is


commonly used

More Optimizer: http://ruder.io/optimizing-gradient-descent/


Regularization

Image Source: https://stackoverflow.com/questions/44909134/how-to-avoid-overfitting-


on-a-simple-feed-forward-network/44985765
Regularization

Data loss: Model predictions


should match training data

Source: cs231n
Regularization

Data loss: Model predictions Regularization: Prevent the model


should match training data from doing too well on training data

Source: cs231n
Regularization

Data loss: Model predictions Regularization: Prevent the model


should match training data from doing too well on training data

Source: cs231n
Regularization

Data loss: Model predictions Regularization: Prevent the model


should match training data from doing too well on training data

Source: cs231n
Regularization

Data loss: Model predictions Regularization: Prevent the model


should match training data from doing too well on training data

Why regularize?
- Express preferences over weights
- Make the model simple so it works on test data
- Improve optimization by adding curvature

Source: cs231n
Regularization

Source: cs231n
Regularization

Source: cs231n
Regularization

Which W to consider?
Source: cs231n
Regularization

Source: cs231n
Regularization

L2 regularization likes to

Source: cs231n
Dropout
Dropout
In each forward pass, randomly set some neurons to zero
Probability of dropping is a hyperparameter; 0.5 is common

Dropout: A simple way to prevent neural networks from overfitting [Srivastava JMLR 2014]
Dropout
How can this possibly be a good idea?

Source: cs231n
Dropout
How can this possibly be a good idea?

Source: cs231n
Dropout
How can this possibly be a good idea?
Dropout is training a large ensemble of
models (that share parameters).

Intuition: successful conspiracies


50 people planning a conspiracy

Strategy A: plan a big conspiracy involving 50


people
Likely to fail. 50 people need to play their
parts correctly.

Strategy B: plan 10 conspiracies each involving 5


people
Likely to succeed!

Source: cs231n & JB Huang


Dropout: Test Time

Source: cs231n
We drop and scale at train time and don't do anything at test time.

Source: cs231n
DropConnect
Dropping some connections

Networks using DropConnect Source: cs231n


Batch Normalization
Batch Normalization
want zero-mean unit-variance activations? lets

Batch Normalization: Accelerating Deep Network Training by


Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization
want zero-mean unit-variance activations? lets

consider a batch of activations at some layer. To make


each dimension zero-mean unit-variance, apply:

Batch Normalization: Accelerating Deep Network Training by


Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization

Usually inserted after Fully


Connected or Convolutional layers,
and before nonlinearity.

Batch Normalization: Accelerating Deep Network Training by


Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization

Usually inserted after Fully


Connected or Convolutional layers,
and before nonlinearity.

Problem:
do we necessarily want a
zero-mean unit-variance input?

Batch Normalization: Accelerating Deep Network Training by


Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization
Normalize:

And then allow the


network to squash
the range if it wants to:

Batch Normalization: Accelerating Deep Network Training by


Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization
Normalize:

And then allow the


network to squash
the range if it wants to:

Batch Normalization: Accelerating Deep Network Training by


Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization

Batch Normalization: Accelerating Deep Network Training by


Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization
- Improves gradient flow through the network

- Allows higher learning rates

- Reduces the strong dependence on initialization

- Acts as a form of regularization

Batch Normalization: Accelerating Deep Network Training by


Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization
Note: at test time BatchNorm layer functions differently:

The mean/std are not computed based on the batch.

Instead, a single fixed empirical mean of activations


during training is used.

(e.g. can be estimated during training with running


averages)

Batch Normalization: Accelerating Deep Network Training by


Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
Batch Normalization: Recent Trends
Layer Normalization:
Ba, Kiros arXiv 2016

Instance Normalization:
Ulyanov et al, Improved Texture Networks: Maximizing
Quality and Diversity in Feed-forward Stylization and Texture
Synthesis, CVPR 2017

Group Normalization:
arXiv 2018
(Appeared 3/22/2018)

Decorrelated Normalization:
Decorrelated arXiv 2018
(Appeared 4/23/2018)
Data Augmentation
Data Augmentation (Jittering)

Source: cs231n
Data Augmentation (Jittering)
Horizontal Flips

Source: cs231n
Data Augmentation (Jittering)
Random crops and scales

Source: cs231n
Data Augmentation (Jittering)
Create virtual training samples
Get creative for your problem!
Horizontal flip
Random crop
Color casting
Randomize contrast
Randomize brightness
Geometric distortion
Rotation
Photometric changes

Deep Image [Wu et al. 2015]

You might also like