0% found this document useful (0 votes)
4 views

lecture_7

Lecture 7 focuses on training neural networks, covering topics such as activation functions, data preprocessing, weight initialization, and transfer learning. It discusses various activation functions like Sigmoid, ReLU, and their advantages and disadvantages in neural network training. The lecture also includes administrative announcements and personal updates from the instructors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

lecture_7

Lecture 7 focuses on training neural networks, covering topics such as activation functions, data preprocessing, weight initialization, and transfer learning. It discusses various activation functions like Sigmoid, ReLU, and their advantages and disadvantages in neural network training. The lecture also includes administrative announcements and personal updates from the instructors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 118

Lecture 7:

Training Neural Networks,


Part I

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 1 April 20, 2021


Administrative: Project Proposal

Due yesterday, 4/19 on GradeScope

1 person per group needs to submit, but tag all group members

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 2 April 20, 2021


Personal announcement

Ranjay:
- I am defending my PhD on Friday, April 23rd at 1pm PST.
- Stanford CS defenses are public events.
- Join CS 547's seminar if you want to watch it.
- If you are unable to find the zoom link to watch it and want to,
send me an email by Thursday 3pm.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 3 April 20, 2021


Administrative: A2

A2 is out, due Wednesday April 30th, 11:59pm

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 4 April 20, 2021


Where we are now...

Computational graphs

x
s (scores) hinge

* loss
+
L

W
R

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 5 April 20, 2021


Where we are now...
Neural Networks
Linear score function:
2-layer Neural Network

x W1 h W2 s
3072 100 10

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 6 April 20, 2021


Where we are now...
Convolutional Neural Networks

Illustration of LeCun et al. 1998 from CS231n 2017 Lecture 1

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 7 April 20, 2021


Where we are now...
Convolutional Layer
activation map
32x32x3 image
5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 8 April 20, 2021


Where we are now... For example, if we had 6 5x5 filters, we’ll
get 6 separate activation maps:
Convolutional Layer activation maps

32

28

Convolution Layer

32 28
3 6
We stack these up to get a “new image” of size 28x28x6!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 9 April 20, 2021


Where we are now...
Learning network parameters through optimization

Landscape image is CC0 1.0 public domain


Walking man image is CC0 1.0 public domain

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 10 April 20, 2021


Where we are now...

Mini-batch SGD
Loop:
1. Sample a batch of data
2. Forward prop it through the graph
(network), get loss
3. Backprop to calculate the gradients
4. Update the parameters using the gradient

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 11 April 20, 2021


Where we are now...

Hardware + Software
PyTorch

TensorFlow

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 12 April 20, 2021


Today: Training Neural Networks

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 13 April 20, 2021


Overview
1. One time setup
activation functions, preprocessing, weight
initialization, regularization, gradient checking
2. Training dynamics
babysitting the learning process,
parameter updates, hyperparameter optimization
3. Evaluation
model ensembles, test-time augmentation, transfer
learning
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 14 April 20, 2021
Part 1
- Activation Functions
- Data Preprocessing
- Weight Initialization
- Batch Normalization
- Transfer learning

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 15 April 20, 2021


Activation Functions

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 16 April 20, 2021


Activation Functions

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 17 April 20, 2021


Activation Functions
Sigmoid Leaky ReLU

tanh Maxout

ReLU ELU

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 18 April 20, 2021


Activation Functions
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron

Sigmoid

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 19 April 20, 2021


Activation Functions
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron

3 problems:

1. Saturated neurons “kill” the


Sigmoid gradients

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 20 April 20, 2021


x sigmoid
gate

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 21 April 20, 2021


x sigmoid
gate

What happens when x = -10?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 22 April 20, 2021


x sigmoid
gate

What happens when x = -10?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 23 April 20, 2021


x sigmoid
gate

What happens when x = -10?


What happens when x = 0?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 24 April 20, 2021


x sigmoid
gate

What happens when x = -10?


What happens when x = 0?
What happens when x = 10?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 25 April 20, 2021


x sigmoid
gate

What happens when x = -10?


What happens when x = 0?
What happens when x = 10?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 26 April 20, 2021


x sigmoid
gate

Why is this a problem?


If all the gradients flowing back will be
zero and weights will never change

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 27 April 20, 2021


Activation Functions
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron

3 problems:

1. Saturated neurons “kill” the


Sigmoid gradients
2. Sigmoid outputs are not
zero-centered

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 28 April 20, 2021


Consider what happens when the input to a neuron is
always positive...

What can we say about the gradients on w?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 29 April 20, 2021


Consider what happens when the input to a neuron is
always positive...

What can we say about the gradients on w?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 30 April 20, 2021


Consider what happens when the input to a neuron is
always positive...

What can we say about the gradients on w?


We know that local gradient of sigmoid is always positive

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 31 April 20, 2021


Consider what happens when the input to a neuron is
always positive...

What can we say about the gradients on w?


We know that local gradient of sigmoid is always positive
We are assuming x is always positive

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 32 April 20, 2021


Consider what happens when the input to a neuron is
always positive...

What can we say about the gradients on w?


We know that local gradient of sigmoid is always positive
We are assuming x is always positive

So!! Sign of gradient for all wi is the same as the sign of upstream scalar gradient!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 33 April 20, 2021


Consider what happens when the input to a neuron is
always positive... allowed
gradient
update
directions

zig zag path


allowed
gradient
update
directions

hypothetical
What can we say about the gradients on w? optimal w
vector
Always all positive or all negative :(

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 34 April 20, 2021


Consider what happens when the input to a neuron is
always positive... allowed
gradient
update
directions

zig zag path


allowed
gradient
update
directions

hypothetical
What can we say about the gradients on w? optimal w
vector
Always all positive or all negative :(
(For a single element! Minibatches help)
35
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 35 April 20, 2021
Activation Functions
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron

3 problems:

1. Saturated neurons “kill” the


Sigmoid gradients
2. Sigmoid outputs are not
zero-centered
3. exp() is a bit compute expensive

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 36 April 20, 2021


Activation Functions

- Squashes numbers to range [-1,1]


- zero centered (nice)
- still kills gradients when saturated :(

tanh(x)

[LeCun et al., 1991]

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 37 April 20, 2021


- Computes f(x) = max(0,x)
Activation Functions
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)

ReLU
(Rectified Linear Unit)
[Krizhevsky et al., 2012]

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 38 April 20, 2021


- Computes f(x) = max(0,x)
Activation Functions
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)

- Not zero-centered output


ReLU
(Rectified Linear Unit)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 39 April 20, 2021


- Computes f(x) = max(0,x)
Activation Functions
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)

- Not zero-centered output


ReLU - An annoyance:
(Rectified Linear Unit)
hint: what is the gradient when x < 0?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 40 April 20, 2021


x ReLU
gate

What happens when x = -10?


What happens when x = 0?
What happens when x = 10?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 41 April 20, 2021


active ReLU
DATA CLOUD

dead ReLU
will never activate
=> never update
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 42 April 20, 2021
active ReLU
DATA CLOUD

=> people like to initialize


ReLU neurons with slightly dead ReLU
positive biases (e.g. 0.01) will never activate
=> never update
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 43 April 20, 2021
[Mass et al., 2013]
Activation Functions [He et al., 2015]

- Does not saturate


- Computationally efficient
- Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
- will not “die”.

Leaky ReLU

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 44 April 20, 2021


[Mass et al., 2013]
Activation Functions [He et al., 2015]

- Does not saturate


- Computationally efficient
- Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
- will not “die”.

Parametric Rectifier (PReLU)


Leaky ReLU

backprop into \alpha


(parameter)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 45 April 20, 2021


[Clevert et al., 2015]
Activation Functions
Exponential Linear Units (ELU)
- All benefits of ReLU
- Closer to zero mean outputs
- Negative saturation regime
compared with Leaky ReLU
adds some robustness to noise

- Computation requires exp()


(Alpha default = 1)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 46 April 20, 2021


[Klambauer et al. ICLR 2017]
Activation Functions
Scaled Exponential Linear Units (SELU)
- Scaled versionof ELU that
works better for deep networks
- “Self-normalizing” property;
- Can train deep SELU networks
without BatchNorm
- (will discuss more later)

α = 1.6733, λ = 1.0507

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 47 April 20, 2021


[Goodfellow et al., 2013]
Maxout “Neuron”
- Does not have the basic form of dot product ->
nonlinearity
- Generalizes ReLU and Leaky ReLU
- Linear Regime! Does not saturate! Does not die!

Problem: doubles the number of parameters/neuron :(

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 48 April 20, 2021


[Ramachandran et al. 2018]
Activation Functions
Swish
- They trained a neural network
to generate and test out
different non-linearities.
- Swish outperformed all other
options for CIFAR-10 accuracy

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 49 April 20, 2021


TLDR: In practice:

- Use ReLU. Be careful with your learning rates


- Try out Leaky ReLU / Maxout / ELU / SELU
- To squeeze out some marginal gains
- Don’t use sigmoid or tanh

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 50 April 20, 2021


Data Preprocessing

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 51 April 20, 2021


Data Preprocessing

(Assume X [NxD] is data matrix,


each example in a row)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 52 April 20, 2021


Remember: Consider what happens when
the input to a neuron is always positive... allowed
gradient
update
directions

zig zag path


allowed
gradient
update
directions

hypothetical
What can we say about the gradients on w? optimal w
vector
Always all positive or all negative :(
(this is also why you want zero-mean data!)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 53 April 20, 2021
Data Preprocessing

(Assume X [NxD] is data matrix, each example in a row)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 54 April 20, 2021


Data Preprocessing
In practice, you may also see PCA and Whitening of the data

(data has diagonal (covariance matrix is the


covariance matrix) identity matrix)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 55 April 20, 2021


Data Preprocessing
Before normalization: classification loss After normalization: less sensitive to small
very sensitive to changes in weight matrix; changes in weights; easier to optimize
hard to optimize

56
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
TLDR: In practice for Images: center only
e.g. consider CIFAR-10 example with [32,32,3] images
- Subtract the mean image (e.g. AlexNet)
(mean image = [32,32,3] array)
- Subtract per-channel mean (e.g. VGGNet)
(mean along each channel = 3 numbers)
- Subtract per-channel mean and
Not common
Divide by per-channel std (e.g. ResNet) to do PCA or
(mean along each channel = 3 numbers) whitening

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 57 April 20, 2021


Weight Initialization

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 58 April 20, 2021


- Q: what happens when W=constant init is used?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 59 April 20, 2021


- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 60 April 20, 2021


- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)

Works ~okay for small networks, but problems with


deeper networks.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 61 April 20, 2021


Weight Initialization: Activation statistics
Forward pass for a 6-layer
net with hidden size 4096

What will happen to the activations for the last layer?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 62 April 20, 2021


Weight Initialization: Activation statistics
Forward pass for a 6-layer All activations tend to zero
net with hidden size 4096
for deeper network layers
Q: What do the gradients
dL/dW look like?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 63 April 20, 2021


Weight Initialization: Activation statistics
Forward pass for a 6-layer All activations tend to zero
net with hidden size 4096
for deeper network layers
Q: What do the gradients
dL/dW look like?
A: All zero, no learning =(

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 64 April 20, 2021


Weight Initialization: Activation statistics
Increase std of initial
weights from 0.01 to 0.05

What will happen to the activations for the last layer?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 65 April 20, 2021


Weight Initialization: Activation statistics
Increase std of initial All activations saturate
weights from 0.01 to 0.05
Q: What do the gradients
look like?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 66 April 20, 2021


Weight Initialization: Activation statistics
Increase std of initial All activations saturate
weights from 0.01 to 0.05
Q: What do the gradients
look like?
A: Local gradients all zero,
no learning =(

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 67 April 20, 2021


Weight Initialization: “Xavier” Initialization
“Xavier” initialization:
std = 1/sqrt(Din)

Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 68 April 20, 2021


Weight Initialization: “Xavier” Initialization
“Xavier” initialization: “Just right”: Activations are
std = 1/sqrt(Din) nicely scaled for all layers!

Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 69 April 20, 2021


Weight Initialization: “Xavier” Initialization
“Xavier” initialization: “Just right”: Activations are
std = 1/sqrt(Din) nicely scaled for all layers!

For conv layers, Din is


filter_size2 * input_channels

Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 70 April 20, 2021


Weight Initialization: “Xavier” Initialization
“Xavier” initialization: “Just right”: Activations are
std = 1/sqrt(Din) nicely scaled for all layers!

For conv layers, Din is


filter_size2 * input_channels

Let: y = x1w1+x2w2+...+xDinwDin

Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 71 April 20, 2021


Weight Initialization: “Xavier” Initialization
“Xavier” initialization: “Just right”: Activations are
std = 1/sqrt(Din) nicely scaled for all layers!

For conv layers, Din is


filter_size2 * input_channels

Let: y = x1w1+x2w2+...+xDinwDin
Assume: Var(x1) = Var(x2)= …=Var(xDin)

Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 72 April 20, 2021


Weight Initialization: “Xavier” Initialization
“Xavier” initialization: “Just right”: Activations are
std = 1/sqrt(Din) nicely scaled for all layers!

For conv layers, Din is


filter_size2 * input_channels

Let: y = x1w1+x2w2+...+xDinwDin
Assume: Var(x1) = Var(x2)= …=Var(xDin)
We want: Var(y) = Var(xi)

Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 73 April 20, 2021


Weight Initialization: “Xavier” Initialization
“Xavier” initialization: “Just right”: Activations are
std = 1/sqrt(Din) nicely scaled for all layers!

For conv layers, Din is


filter_size2 * input_channels

Let: y = x1w1+x2w2+...+xDinwDin Var(y) = Var(x1w1+x2w2+...+xDinwDin)


[substituting value of y]
Assume: Var(x1) = Var(x2)= …=Var(xDin)
We want: Var(y) = Var(xi)

Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 74 April 20, 2021


Weight Initialization: “Xavier” Initialization
“Xavier” initialization: “Just right”: Activations are
std = 1/sqrt(Din) nicely scaled for all layers!

For conv layers, Din is


filter_size2 * input_channels

Let: y = x1w1+x2w2+...+xDinwDin Var(y) = Var(x1w1+x2w2+...+xDinwDin)


= Din Var(xiwi)
Assume: Var(x1) = Var(x2)= …=Var(xDin) [Assume all xi, wi are iid]
We want: Var(y) = Var(xi)

Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 75 April 20, 2021


Weight Initialization: “Xavier” Initialization
“Xavier” initialization: “Just right”: Activations are
std = 1/sqrt(Din) nicely scaled for all layers!

For conv layers, Din is


filter_size2 * input_channels

Let: y = x1w1+x2w2+...+xDinwDin Var(y) = Var(x1w1+x2w2+...+xDinwDin)


= Din Var(xiwi)
Assume: Var(x1) = Var(x2)= …=Var(xDin) = Din Var(xi) Var(wi)
We want: Var(y) = Var(xi) [Assume all xi, wi are zero mean]

Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 76 April 20, 2021


Weight Initialization: “Xavier” Initialization
“Xavier” initialization: “Just right”: Activations are
std = 1/sqrt(Din) nicely scaled for all layers!

For conv layers, Din is


filter_size2 * input_channels

Let: y = x1w1+x2w2+...+xDinwDin Var(y) = Var(x1w1+x2w2+...+xDinwDin)


= Din Var(xiwi)
Assume: Var(x1) = Var(x2)= …=Var(xDin) = Din Var(xi) Var(wi)
We want: Var(y) = Var(xi) [Assume all xi, wi are iid]

So, Var(y) = Var(xi) only when Var(wi) = 1/Din


Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 77 April 20, 2021


Weight Initialization: What about ReLU?
Change from tanh to ReLU

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 78 April 20, 2021


Weight Initialization: What about ReLU?
Change from tanh to ReLU Xavier assumes zero
centered activation function

Activations collapse to zero


again, no learning =(

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 79 April 20, 2021


Weight Initialization: Kaiming / MSRA Initialization
ReLU correction: std = sqrt(2 / Din) “Just right”: Activations are
nicely scaled for all layers!

He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, ICCV 2015

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 80 April 20, 2021


Proper initialization is an active area of research…
Understanding the difficulty of training deep feedforward neural networks
by Glorot and Bengio, 2010

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013

Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014

Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et


al., 2015

Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015

All you need is a good init, Mishkin and Matas, 2015

Fixup Initialization: Residual Learning Without Normalization, Zhang et al, 2019

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks, Frankle and Carbin, 2019

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 81 April 20, 2021


Batch Normalization

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 82 April 20, 2021


[Ioffe and Szegedy, 2015]
Batch Normalization
“you want zero-mean unit-variance activations? just make them so.”

consider a batch of activations at some layer. To make


each dimension zero-mean unit-variance, apply:

this is a vanilla
differentiable function...

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 83 April 20, 2021


[Ioffe and Szegedy, 2015]
Batch Normalization

Input: Per-channel mean,


shape is D

Per-channel var,
shape is D
N X
Normalized x,
Shape is N x D

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
[Ioffe and Szegedy, 2015]
Batch Normalization

Input: Per-channel mean,


shape is D

Per-channel var,
shape is D
N X
Normalized x,
Shape is N x D

Problem: What if zero-mean, unit


D variance is too hard of a constraint?

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
[Ioffe and Szegedy, 2015]
Batch Normalization

Input: Per-channel mean,


shape is D

Learnable scale and


Per-channel var,
shift parameters: shape is D

Normalized x,
Learning = , Shape is N x D
= will recover the Output,
identity function! Shape is N x D

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
Estimates depend on minibatch;
Batch Normalization: Test-Time can’t do this at test-time!

Input: Per-channel mean,


shape is D

Learnable scale and


Per-channel var,
shift parameters: shape is D

Normalized x,
Learning = , Shape is N x D
= will recover the Output,
identity function! Shape is N x D

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
Batch Normalization: Test-Time

Input: (Running) average of


values seen during training
Per-channel mean,
shape is D

Learnable scale and


(Running) average of Per-channel var,
shift parameters: values seen during training shape is D

Normalized x,
During testing batchnorm Shape is N x D
becomes a linear operator!
Can be fused with the previous Output,
fully-connected or conv layer Shape is N x D

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
[Ioffe and Szegedy, 2015]
Batch Normalization

FC Usually inserted after Fully


BN
Connected or Convolutional layers,
and before nonlinearity.
tanh

FC

BN

tanh

...

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 89 April 20, 2021


[Ioffe and Szegedy, 2015]
Batch Normalization

FC
- Makes deep networks much easier to train!
BN - Improves gradient flow
- Allows higher learning rates, faster convergence
tanh - Networks become more robust to initialization
- Acts as regularization during training
FC - Zero overhead at test-time: can be fused with conv!
- Behaves differently during training and testing: this
BN
is a very common source of bugs!
tanh

...

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 90 April 20, 2021


Batch Normalization for ConvNets
Batch Normalization for
Batch Normalization for convolutional networks
fully-connected networks (Spatial Batchnorm, BatchNorm2D)

x: N × D x: N×C×H×W
Normalize Normalize

𝞵,𝝈: 1 × D 𝞵,𝝈: 1×C×1×1


ɣ,β: 1 × D ɣ,β: 1×C×1×1
y = ɣ(x-𝞵)/𝝈+β y = ɣ(x-𝞵)/𝝈+β

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
Layer Normalization Layer Normalization for
Batch Normalization for fully-connected networks
fully-connected networks Same behavior at train and test!
Can be used in recurrent networks

x: N × D x: N × D
Normalize Normalize

𝞵,𝝈: 1 × D 𝞵,𝝈: N × 1
ɣ,β: 1 × D ɣ,β: 1 × D
y = ɣ(x-𝞵)/𝝈+β y = ɣ(x-𝞵)/𝝈+β
Ba, Kiros, and Hinton, “Layer Normalization”, arXiv 2016

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
Instance Normalization
Batch Normalization for Instance Normalization for
convolutional networks convolutional networks
Same behavior at train / test!

x: N×C×H×W x: N×C×H×W
Normalize Normalize

𝞵,𝝈: 1×C×1×1 𝞵,𝝈: N×C×1×1


ɣ,β: 1×C×1×1 ɣ,β: 1×C×1×1
y = ɣ(x-𝞵)/𝝈+β y = ɣ(x-𝞵)/𝝈+β
Ulyanov et al, Improved Texture Networks: Maximizing Quality and Diversity in Feed-forward Stylization and Texture Synthesis, CVPR 2017

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
Comparison of Normalization Layers

Wu and He, “Group Normalization”, ECCV 2018

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
Group Normalization

Wu and He, “Group Normalization”, ECCV 2018

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
Transfer learning

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 96 April 20, 2021
“You need a lot of a data if you want to
train/use CNNs”

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 97 April 20, 2021
ED
“You need a lot of a data if you want to

ST
train/use CNNs”

BU
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 98 April 20, 2021
Transfer Learning with CNNs

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 99 April 20, 2021
Transfer Learning with CNNs

AlexNet:
64 x 3 x 11 x 11

(More on this in Lecture 13)


Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 100 April 20, 2021
Transfer Learning with CNNs

Test image L2 Nearest neighbors in feature space

(More on this in Lecture 13)


Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 101 April 20, 2021
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014

Transfer Learning with CNNs Razavian et al, “CNN Features Off-the-Shelf: An


Astounding Baseline for Recognition”, CVPR Workshops
2014

1. Train on Imagenet
FC-1000
FC-4096
FC-4096

MaxPool
Conv-512
Conv-512

MaxPool
Conv-512
Conv-512

MaxPool
Conv-256
Conv-256

MaxPool
Conv-128
Conv-128

MaxPool
Conv-64
Conv-64

Image

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 102 April 20, 2021
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014

Transfer Learning with CNNs Razavian et al, “CNN Features Off-the-Shelf: An


Astounding Baseline for Recognition”, CVPR Workshops
2014

1. Train on Imagenet 2. Small Dataset (C classes)


FC-1000 FC-C
FC-4096 FC-4096
FC-4096
Reinitialize
FC-4096
this and train
MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512

MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512

MaxPool MaxPool Freeze these


Conv-256 Conv-256
Conv-256 Conv-256

MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128

MaxPool MaxPool
Conv-64 Conv-64
Conv-64 Conv-64

Image Image

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 103 April 20, 2021
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014

Transfer Learning with CNNs Razavian et al, “CNN Features Off-the-Shelf: An


Astounding Baseline for Recognition”, CVPR Workshops
2014

1. Train on Imagenet 2. Small Dataset (C classes)


FC-1000 FC-C
FC-4096 FC-4096
FC-4096
Reinitialize Finetuned from AlexNet
FC-4096
this and train
MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512

MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512

MaxPool MaxPool Freeze these


Conv-256 Conv-256
Conv-256 Conv-256

MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128

MaxPool MaxPool
Conv-64 Conv-64
Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for
Conv-64 Conv-64 Generic Visual Recognition”, ICML 2014
Image Image

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 104 April 20, 2021
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014

Transfer Learning with CNNs Razavian et al, “CNN Features Off-the-Shelf: An


Astounding Baseline for Recognition”, CVPR Workshops
2014

1. Train on Imagenet 2. Small Dataset (C classes) 3. Bigger dataset


FC-1000 FC-C FC-C
FC-4096 FC-4096
Reinitialize FC-4096 Train these
FC-4096 FC-4096 FC-4096
this and train
MaxPool MaxPool MaxPool
Conv-512 Conv-512 Conv-512 With bigger
Conv-512 Conv-512 Conv-512
dataset, train
MaxPool MaxPool MaxPool
more layers
Conv-512 Conv-512 Conv-512
Conv-512 Conv-512 Conv-512

MaxPool MaxPool Freeze these MaxPool


Conv-256 Conv-256 Conv-256 Freeze these
Conv-256 Conv-256 Conv-256

MaxPool MaxPool MaxPool


Conv-128 Conv-128 Conv-128 Lower learning rate
Conv-128 Conv-128 Conv-128 when finetuning;
MaxPool MaxPool MaxPool 1/10 of original LR
Conv-64 Conv-64 Conv-64
is good starting
Conv-64 Conv-64 Conv-64
point
Image Image Image

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 105 April 20, 2021
FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512

MaxPool
very little data ? ?
Conv-512
More specific
Conv-512

MaxPool
Conv-256
Conv-256
More generic
MaxPool
Conv-128
Conv-128
quite a lot of ? ?
MaxPool
Conv-64
data
Conv-64

Image

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 106 April 20, 2021
FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512

MaxPool
very little data Use Linear ?
Conv-512
More specific Classifier on
Conv-512
top layer
MaxPool
Conv-256
Conv-256
More generic
MaxPool
Conv-128
Conv-128
quite a lot of Finetune a ?
MaxPool
Conv-64
data few layers
Conv-64

Image

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 107 April 20, 2021
FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512

MaxPool
very little data Use Linear You’re in
Conv-512
More specific Classifier on trouble… Try
Conv-512
top layer linear classifier
MaxPool
Conv-256 from different
Conv-256
More generic stages
MaxPool
Conv-128
Conv-128
quite a lot of Finetune a Finetune a
MaxPool
Conv-64
data few layers larger number
Conv-64 of layers
Image

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 108 April 20, 2021
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Object Detection
(Fast R-CNN) Image Captioning: CNN + RNN

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for


Girshick, “Fast R-CNN”, ICCV 2015 Generating Image Descriptions”, CVPR 2015
Figure copyright Ross Girshick, 2015. Reproduced with permission. Figure copyright IEEE, 2015. Reproduced for educational purposes.

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 109 April 20, 2021
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Object Detection
CNN pretrained Image Captioning: CNN + RNN
(Fast R-CNN)
on ImageNet

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for


Girshick, “Fast R-CNN”, ICCV 2015 Generating Image Descriptions”, CVPR 2015
Figure copyright Ross Girshick, 2015. Reproduced with permission. Figure copyright IEEE, 2015. Reproduced for educational purposes.

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 110 April 20, 2021
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Object Detection
CNN pretrained Image Captioning: CNN + RNN
(Fast R-CNN)
on ImageNet

Word vectors pretrained


Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for
Girshick, “Fast R-CNN”, ICCV 2015
Figure copyright Ross Girshick, 2015. Reproduced with permission.
with word2vec Generating Image Descriptions”, CVPR 2015
Figure copyright IEEE, 2015. Reproduced for educational purposes.

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 111 April 20, 2021
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)

1. Train CNN on ImageNet


2. Fine-Tune (1) for object detection on
Visual Genome
1. Train BERT language model on lots of text
2. Combine(2) and (3), train for joint image /
language modeling
3. Fine-tune (4) for image captioning, visual
question answering, etc.

Zhou et al, “Unified Vision-Language Pre-Training for Image Captioning and VQA” CVPR 2020 Krishna et al, “Visual genome: Connecting language and vision using crowdsourced dense image annotations” IJCV 2017
Figure copyright Luowei Zhou, 2020. Reproduced with permission. Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” ArXiv 2018

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 112 April 20, 2021
Transfer learning with CNNs -
Architecture matters

We will discuss different architectures in


detail in two lectures

Girshick, “The Generalized R-CNN Framework for Object Detection”, ICCV 2017 Tutorial on Instance-Level Visual Recognition

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 113 April 20, 2021
Transfer learning with CNNs is pervasive…
But recent results show it might not always be necessary!
Training from scratch can work just as
well as training from a pretrained
ImageNet model for object detection

But it takes 2-3x as long to train.

They also find that collecting more data


is better than finetuning on a related
task

He et al, “Rethinking ImageNet Pre-training”, ICCV 2019


Figure copyright Kaiming He, 2019. Reproduced with permission.

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 114 April 20, 2021
Takeaway for your projects and beyond:

Source: AI & Deep Learning Memes For Back-propagated Poets

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 115 April 20, 2021
Takeaway for your projects and beyond:
Have some dataset of interest but it has < ~1M images?

1. Find a very large dataset that has


similar data, train a big ConvNet there
2. Transfer learn to your dataset
Deep learning frameworks provide a “Model Zoo” of pretrained
models so you don’t need to train your own

TensorFlow: https://github.com/tensorflow/models
PyTorch: https://github.com/pytorch/vision

Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 116 April 20, 2021
Summary TLDRs
We looked in detail at:

- Activation Functions (use ReLU)


- Data Preprocessing (images: subtract mean)
- Weight Initialization (use Xavier/He init)
- Batch Normalization (use this!)
- Transfer learning (use this if you can!)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 117 April 20, 2021
Next time:
Training Neural Networks, Part 2
- Parameter update schemes
- Learning rate schedules
- Gradient checking
- Regularization (Dropout etc.)
- Babysitting learning
- Evaluation (Ensembles etc.)
- Hyperparameter Optimization
- Transfer learning / fine-tuning

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 118 April 20, 2021

You might also like