0% found this document useful (0 votes)
12 views20 pages

Winter1516 Lecture54

Uploaded by

rsethi3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views20 pages

Winter1516 Lecture54

Uploaded by

rsethi3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

but when using the ReLU

nonlinearity it breaks.

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 61 20 Jan 2016
He et al., 2015
(note additional /2)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 62 20 Jan 2016
He et al., 2015
(note additional /2)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 63 20 Jan 2016
Proper initialization is an active area of research…
Understanding the difficulty of training deep feedforward neural networks
by Glorot and Bengio, 2010

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by
Saxe et al, 2013

Random walk initialization for training very deep feedforward networks by Sussillo and
Abbott, 2014

Delving deep into rectifiers: Surpassing human-level performance on ImageNet


classification by He et al., 2015

Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015

All you need is a good init, Mishkin and Matas, 2015


Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 64 20 Jan 2016
[Ioffe and Szegedy, 2015]
Batch Normalization
“you want unit gaussian activations? just make them so.”

consider a batch of activations at some layer.


To make each dimension unit gaussian, apply:

this is a vanilla
differentiable function...

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 65 20 Jan 2016
[Ioffe and Szegedy, 2015]
Batch Normalization
“you want unit gaussian activations?
just make them so.”

1. compute the empirical mean and


variance independently for each
dimension.
N X
2. Normalize

D
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 66 20 Jan 2016
[Ioffe and Szegedy, 2015]
Batch Normalization

FC Usually inserted after Fully


BN
Connected / (or Convolutional, as
we’ll see soon) layers, and before
tanh
nonlinearity.
FC

BN Problem: do we
necessarily want a unit
tanh gaussian input to a
tanh layer?
...

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 67 20 Jan 2016
[Ioffe and Szegedy, 2015]
Batch Normalization
Normalize:

Note, the network can learn:


And then allow the network to squash
the range if it wants to:

to recover the identity


mapping.

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 68 20 Jan 2016
[Ioffe and Szegedy, 2015]
Batch Normalization
- Improves gradient flow through
the network
- Allows higher learning rates
- Reduces the strong dependence
on initialization
- Acts as a form of regularization
in a funny way, and slightly
reduces the need for dropout,
maybe

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 69 20 Jan 2016
[Ioffe and Szegedy, 2015]
Batch Normalization
Note: at test time BatchNorm layer
functions differently:

The mean/std are not computed


based on the batch. Instead, a single
fixed empirical mean of activations
during training is used.

(e.g. can be estimated during training


with running averages)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 70 20 Jan 2016
Babysitting the Learning Process

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 71 20 Jan 2016
Step 1: Preprocess the data

(Assume X [NxD] is data matrix,


each example in a row)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 72 20 Jan 2016
Step 2: Choose the architecture:
say we start with one hidden layer of 50 neurons:
50 hidden
neurons

10 output
output layer
input neurons, one
CIFAR-10
layer hidden layer per class
images, 3072
numbers
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 73 20 Jan 2016
Double check that the loss is reasonable:

disable regularization
loss ~2.3.
“correct “ for returns the loss and the
10 classes gradient for all parameters
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 74 20 Jan 2016
Double check that the loss is reasonable:

crank up regularization

loss went up, good. (sanity check)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 75 20 Jan 2016
Lets try to train now…

Tip: Make sure that


you can overfit very
small portion of the
training data The above code:
- take the first 20 examples from
CIFAR-10
- turn off regularization (reg = 0.0)
- use simple vanilla ‘sgd’

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 76 20 Jan 2016
Lets try to train now…

Tip: Make sure that


you can overfit very
small portion of the
training data

Very small loss,


train accuracy 1.00,
nice!
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 77 20 Jan 2016
Lets try to train now…

I like to start with small


regularization and find
learning rate that
makes the loss go
down.

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 78 20 Jan 2016
Lets try to train now…

I like to start with small


regularization and find
learning rate that
makes the loss go
down.
Loss barely changing

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 79 20 Jan 2016
Lets try to train now…

I like to start with small


regularization and find
learning rate that
makes the loss go
down.
Loss barely changing: Learning rate is
loss not going down: probably too low
learning rate too low

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 80 20 Jan 2016

You might also like