0% found this document useful (0 votes)
15 views20 pages

Winter1516 Lecture53

Uploaded by

rsethi3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views20 pages

Winter1516 Lecture53

Uploaded by

rsethi3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

active ReLU

DATA CLOUD

dead ReLU
will never activate
=> never update
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 41 20 Jan 2016
active ReLU
DATA CLOUD

=> people like to initialize


ReLU neurons with slightly dead ReLU
positive biases (e.g. 0.01) will never activate
=> never update
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 42 20 Jan 2016
[Mass et al., 2013]
Activation Functions [He et al., 2015]

- Does not saturate


- Computationally efficient
- Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
- will not “die”.

Leaky ReLU

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 43 20 Jan 2016
[Mass et al., 2013]
Activation Functions [He et al., 2015]

- Does not saturate


- Computationally efficient
- Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
- will not “die”.

Parametric Rectifier (PReLU)


Leaky ReLU

backprop into \alpha


(parameter)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 44 20 Jan 2016
[Clevert et al., 2015]
Activation Functions
Exponential Linear Units (ELU)

- All benefits of ReLU


- Does not die
- Closer to zero mean outputs

- Computation requires exp()

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 45 20 Jan 2016
[Goodfellow et al., 2013]
Maxout “Neuron”
- Does not have the basic form of dot product ->
nonlinearity
- Generalizes ReLU and Leaky ReLU
- Linear Regime! Does not saturate! Does not die!

Problem: doubles the number of parameters/neuron :(

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 46 20 Jan 2016
TLDR: In practice:

- Use ReLU. Be careful with your learning rates


- Try out Leaky ReLU / Maxout / ELU
- Try out tanh but don’t expect much
- Don’t use sigmoid

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 47 20 Jan 2016
Data Preprocessing

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 48 20 Jan 2016
Step 1: Preprocess the data

(Assume X [NxD] is data matrix,


each example in a row)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 49 20 Jan 2016
Step 1: Preprocess the data
In practice, you may also see PCA and Whitening of the data

(data has diagonal (covariance matrix is the


covariance matrix) identity matrix)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 50 20 Jan 2016
TLDR: In practice for Images: center only
e.g. consider CIFAR-10 example with [32,32,3] images
- Subtract the mean image (e.g. AlexNet)
(mean image = [32,32,3] array)
- Subtract per-channel mean (e.g. VGGNet)
(mean along each channel = 3 numbers)

Not common to normalize


variance, to do PCA or
whitening

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 51 20 Jan 2016
Weight Initialization

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 52 20 Jan 2016
- Q: what happens when W=0 init is used?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 53 20 Jan 2016
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 54 20 Jan 2016
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)

Works ~okay for small networks, but can lead to


non-homogeneous distributions of activations
across the layers of a network.

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 55 20 Jan 2016
Lets look at
some
activation
statistics

E.g. 10-layer net with


500 neurons on each
layer, using tanh non-
linearities, and
initializing as
described in last slide.

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 56 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 57 20 Jan 2016
All activations
become zero!
Q: think about the
backward pass.
What do the
gradients look like?

Hint: think about backward


pass for a W*X gate.

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 58 20 Jan 2016
Almost all neurons
*1.0 instead of *0.01
completely
saturated, either -1
and 1. Gradients
will be all zero.

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 59 20 Jan 2016
“Xavier initialization”
[Glorot et al., 2010]

Reasonable initialization.
(Mathematical derivation
assumes linear activations)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 60 20 Jan 2016

You might also like