lecture_6_part_2
lecture_6_part_2
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 1 April 17, 2024
Where we are now...
CNN Architectures
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 2 April 17, 2024
Where we are now...
Learning network parameters through optimization
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 3 April 17, 2024
Where we are now...
Mini-batch SGD
Loop:
1. Sample a batch of data
2. Forward prop it through the graph
(network), get loss
3. Backprop to calculate the gradients
4. Update the parameters using the gradient
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 4 April 17, 2024
Today: Training Neural Networks
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 5 April 17, 2024
Overview
1. One time set up: activation functions, preprocessing,
weight initialization, regularization, gradient checking
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 6 April 17, 2024
Activation Functions
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 7 April 17, 2024
Activation Functions
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 8 April 17, 2024
Activation Functions
Sigmoid Leaky ReLU
tanh Maxout
ReLU ELU
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 9 April 17, 2024
Activation Functions
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
Sigmoid
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 10 April 17, 2024
Activation Functions
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
3 problems:
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 11 April 17, 2024
x sigmoid
gate
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 12 April 17, 2024
x sigmoid
gate
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 13 April 17, 2024
x sigmoid
gate
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 14 April 17, 2024
x sigmoid
gate
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 15 April 17, 2024
x sigmoid
gate
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 16 April 17, 2024
x sigmoid
gate
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 17 April 17, 2024
x sigmoid
gate
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 18 April 17, 2024
Activation Functions
tanh(x)
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 19 April 17, 2024
- Computes f(x) = max(0,x)
Activation Functions
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
ReLU
(Rectified Linear Unit)
[Krizhevsky et al., 2012]
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 20 April 17, 2024
- Computes f(x) = max(0,x)
Activation Functions
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 21 April 17, 2024
- Computes f(x) = max(0,x)
Activation Functions
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 22 April 17, 2024
x ReLU
gate
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 23 April 17, 2024
active ReLU
DATA CLOUD
dead ReLU
will never activate
=> never update
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 24 April 17, 2024
[Mass et al., 2013]
Activation Functions [He et al., 2015]
Leaky ReLU
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 26 April 17, 2024
[Mass et al., 2013]
Activation Functions [He et al., 2015]
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 27 April 17, 2024
[Hendrycks et al., 2016]
Activation Functions
- Computes f(x) = x*Φ(x)
Φ(x)
GELU
(Gaussian Error
Linear Unit) Sources:
https://en.wikipedia.org/wiki/Normal_distribution,
https://en.m.wikipedia.org/wiki/File:Cumulative_di
stribution_function_for_normal_distribution,_mea
n_0_and_sd_1.png
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 28 April 17, 2024
[Hendrycks et al., 2016]
Activation Functions
- Computes f(x) = x*Φ(x)
Source: https://en.m.wikipedia.org/wiki/File:ReLU_and_GELU.svg
Φ(x)
GELU
(Gaussian Error
Linear Unit) Sources:
https://en.wikipedia.org/wiki/Normal_distribution,
https://en.m.wikipedia.org/wiki/File:Cumulative_di
stribution_function_for_normal_distribution,_mea
n_0_and_sd_1.png
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 29 April 17, 2024
[Hendrycks et al., 2016]
Activation Functions
- Computes f(x) = x*Φ(x)
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 30 April 17, 2024
TLDR: In practice:
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 31 April 17, 2024
Data Preprocessing
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 32 April 17, 2024
Data Preprocessing
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 33 April 17, 2024
TLDR: In practice for Images: center only
e.g. consider CIFAR-10 example with [32,32,3] images
- Subtract the mean image (e.g. AlexNet)
(mean image = [32,32,3] array)
- Subtract per-channel mean (e.g. VGGNet)
(mean along each channel = 3 numbers)
- Subtract per-channel mean and
Divide by per-channel std (e.g. ResNet and beyond)
(mean along each channel = 3 numbers)
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 34 April 17, 2024
Weight Initialization
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 35 April 17, 2024
- Q: what happens when W=constant init is used?
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 36 April 17, 2024
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 37 April 17, 2024
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 38 April 17, 2024
Weight Initialization: Activation statistics
Forward pass for a 6-layer
net with hidden size 4096
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 39 April 17, 2024
Weight Initialization: Activation statistics
Forward pass for a 6-layer All activations tend to zero
net with hidden size 4096
for deeper network layers
Q: What do the gradients
dL/dW look like?
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 40 April 17, 2024
Weight Initialization: Activation statistics
Forward pass for a 6-layer All activations tend to zero
net with hidden size 4096
for deeper network layers
Q: What do the gradients
dL/dW look like?
A: All zero, no learning =(
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 41 April 17, 2024
Weight Initialization: Activation statistics
Increase std of initial
weights from 0.01 to 0.05
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 42 April 17, 2024
Weight Initialization: Activation statistics
Increase std of initial All activations saturate
weights from 0.01 to 0.05
Q: What do the gradients
look like?
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 43 April 17, 2024
Weight Initialization: Activation statistics
Increase std of initial All activations saturate
weights from 0.01 to 0.05
Q: What do the gradients
look like?
A: Local gradients all zero,
no learning =(
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 44 April 17, 2024
Weight Initialization: “Xavier” Initialization
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 45 April 17, 2024
Weight Initialization: “Xavier” Initialization
“Xavier” initialization: “Just right”: Activations are
std = 1/sqrt(Din) nicely scaled for all layers!
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 46 April 17, 2024
Weight Initialization: “Xavier” Initialization
“Xavier” initialization: “Just right”: Activations are
std = 1/sqrt(Din) nicely scaled for all layers!
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 47 April 17, 2024
Weight Initialization: What about ReLU?
Change from tanh to ReLU
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 48 April 17, 2024
Weight Initialization: What about ReLU?
Change from tanh to ReLU Xavier assumes zero
centered activation function
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 49 April 17, 2024
Weight Initialization: Kaiming / MSRA Initialization
ReLU correction: std = sqrt(2 / Din) “Just right”: Activations are
nicely scaled for all layers!
He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, ICCV 2015
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 50 April 17, 2024
Proper initialization is an ongoing area of research…
Understanding the difficulty of training deep feedforward neural networks
by Glorot and Bengio, 2010
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013
Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks, Frankle and Carbin, 2019
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 51 April 17, 2024
Training vs. Testing Error
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 52 April 17, 2024
Beyond Training Error
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 53 April 17, 2024
Early Stopping: Always do this
Train
Loss Accuracy Val
Iteration Iteration
Stop training the model when accuracy on the validation set decreases
Or train for a long time, but always keep track of the model snapshot
that worked best on val
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 54 April 17, 2024
Model Ensembles
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 55 April 17, 2024
How to improve single-model performance?
Regularization
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 56 April 17, 2024
Regularization: Add term to loss
In common use:
(Weight decay)
L2 regularization
L1 regularization
Elastic net (L1 + L2)
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 57 April 17, 2024
Regularization: Dropout
In each forward pass, randomly set some neurons to zero
Probability of dropping is a hyperparameter; 0.5 is common
Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 58 April 17, 2024
Regularization: Dropout Example forward
pass with a 3-
layer network
using dropout
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 59 April 17, 2024
Regularization: Dropout
How can this possibly be a good idea?
Forces the network to have a redundant representation;
Prevents co-adaptation of features
has an ear X
has a tail
is furry X cat
score
has claws
mischievous X
look
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 60 April 17, 2024
Regularization: Dropout
How can this possibly be a good idea?
Another interpretation:
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 62 April 17, 2024
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
a
w1 w2
x y
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 63 April 17, 2024
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
a
At test time we have:
w1 w2
x y
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 64 April 17, 2024
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
a
At test time we have:
w1 w2 During training we have:
x y
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 65 April 17, 2024
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
a
At test time we have:
w1 w2 During training we have:
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 66 April 17, 2024
Dropout: Test time
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 67 April 17, 2024
Dropout Summary
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 68 April 17, 2024
More common: “Inverted dropout”
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 69 April 17, 2024
Regularization: A common pattern
Training: Add some kind of randomness
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 70 April 17, 2024
Regularization: A common pattern
Training: Add some kind Example: Batch
of randomness Normalization
Training:
Normalize using
Testing: Average out randomness stats from random
(sometimes approximate) minibatches
“cat”
Load image
and label
Compute
loss
CNN
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 72 April 17, 2024
Regularization: Data Augmentation
“cat”
Load image
and label
Compute
loss
CNN
Transform image
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 73 April 17, 2024
Data Augmentation
Horizontal Flips
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 74 April 17, 2024
Data Augmentation
Random crops and scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 75 April 17, 2024
Data Augmentation
Random crops and scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 76 April 17, 2024
Data Augmentation
Color Jitter
Simple: Randomize
contrast and brightness
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 77 April 17, 2024
Data Augmentation
Get creative for your problem!
Examples of data augmentations:
- translation
- rotation
- stretching
- shearing,
- lens distortions, … (go crazy)
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 78 April 17, 2024
Automatic Data Augmentation
Cubuk et al., “AutoAugment: Learning Augmentation Strategies from Data”, CVPR 2019
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 79 April 17, 2024
Regularization: Cutout
Training: Set random image regions to zero
Testing: Use full image
Examples:
Dropout
Batch Normalization
Data Augmentation
Cutout / Random Crop
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 80 April 17, 2024
Regularization - In practice
Training: Add random noise
Testing: Marginalize over the noise
Examples:
- Consider dropout for large fully-
Dropout
connected layers
Batch Normalization
- Batch normalization and data
Data Augmentation
augmentation almost always a
Cutout / Random Crop
good idea
- Try cutout especially for small
classification datasets
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 81 April 17, 2024
Choosing Hyperparameters
(without tons of GPUs)
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 82 April 17, 2024
Choosing Hyperparameters
Step 1: Check initial loss
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 83 April 17, 2024
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 84 April 17, 2024
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Use the architecture from the previous step, use all training
data, turn on small weight decay, find a learning rate that
makes the loss drop significantly within ~100 iterations
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 85 April 17, 2024
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 86 April 17, 2024
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer
Pick best models from Step 4, train them for longer (~10-20
epochs) with constant learning rate
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 87 April 17, 2024
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer
Step 6: Look at loss and accuracy curves
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 88 April 17, 2024
Accuracy Accuracy still going up, you
need to train longer
Train
Val
time
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 89 April 17, 2024
Accuracy Huge train / val gap means
overfitting! Increase regularization,
get more data
Train
Val
time
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 90 April 17, 2024
Accuracy No gap between train / val means
underfitting: train longer, can use
a bigger model
Train
Val
time
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 91 April 17, 2024
Look at learning curves!
Training Loss Train / Val Accuracy
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 92 April 17, 2024
Cross-validation
We develop "command
centers" to visualize all our
models training with different
hyperparameters
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 93 April 17, 2024
You can plot all your loss curves for different hyperparameters on a single plot
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 94 April 17, 2024
Don't look at accuracy or loss curves for too long!
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 95 April 17, 2024
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer
Step 6: Look at loss and accuracy curves
Step 7: GOTO step 5
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 96 April 17, 2024
Random Search for Hyper-
Random Search vs. Grid Search Parameter Optimization
Bergstra and Bengio, 2012
Grid Layout Random Layout
Unimportant Parameter
Unimportant Parameter
Important Parameter Important Parameter
Illustration of Bergstra et al., 2012 by Shayne
Longpre, copyright CS231n 2017
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 97 April 17, 2024
Summary TLDRs
We looked in detail at:
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 98 April 17, 2024
In Lecture: Recap of Content + QA
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 99 April 17, 2024
Appendix – Slides from Previous Years of the Course
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 100 April 17, 2024
Activation Functions
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
3 problems:
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 101 April 17, 2024
Consider what happens when the input to a neuron is
always positive...
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 102 April 17, 2024
Consider what happens when the input to a neuron is
always positive...
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 103 April 17, 2024
Consider what happens when the input to a neuron is
always positive...
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 104 April 17, 2024
Consider what happens when the input to a neuron is
always positive...
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 105 April 17, 2024
Consider what happens when the input to a neuron is
always positive...
So!! Sign of gradient for all wi is the same as the sign of upstream scalar gradient!
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 106 April 17, 2024
Consider what happens when the input to a neuron is
always positive... allowed
gradient
update
directions
hypothetical
What can we say about the gradients on w? optimal w
vector
Always all positive or all negative :(
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 107 April 17, 2024
Consider what happens when the input to a neuron is
always positive... allowed
gradient
update
directions
hypothetical
What can we say about the gradients on w? optimal w
vector
Always all positive or all negative :(
(For a single element! Minibatches help)
108
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 108 April 17, 2024
Activation Functions
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
3 problems:
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 109 April 17, 2024
[Clevert et al., 2015]
Activation Functions
Exponential Linear Units (ELU)
- All benefits of ReLU
- Closer to zero mean outputs
- Negative saturation regime
compared with Leaky ReLU
adds some robustness to noise
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 110 April 17, 2024
[Klambauer et al. ICLR 2017]
Activation Functions
Scaled Exponential Linear Units (SELU)
- Scaled version of ELU that
works better for deep networks
- “Self-normalizing” property;
- Can train deep SELU networks
without BatchNorm
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 111 April 17, 2024
Maxout “Neuron” [Goodfellow et al., 2013]
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 112 April 17, 2024
Remember: Consider what happens when
the input to a neuron is always positive... allowed
gradient
update
directions
hypothetical
What can we say about the gradients on w? optimal w
vector
Always all positive or all negative :(
(this is also why you want zero-mean data!)
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 113 April 17, 2024
Data Preprocessing
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 114 April 17, 2024
Data Preprocessing
In practice, you may also see PCA and Whitening of the data
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 115 April 17, 2024
Data Preprocessing
Before normalization: classification loss After normalization: less sensitive to small
very sensitive to changes in weight matrix; changes in weights; easier to optimize
hard to optimize
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 7 - 116 April 25, 2023
Xavier Initialization: Proof of Optimality
“Xavier” initialization: “Just right”: Activations are
std = 1/sqrt(Din) nicely scaled for all layers!
Let: y = x1w1+x2w2+...+xDinwDin
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 117 April 17, 2024
Weight Initialization: “Xavier” Initialization
“Xavier” initialization: “Just right”: Activations are
std = 1/sqrt(Din) nicely scaled for all layers!
Let: y = x1w1+x2w2+...+xDinwDin
Assume: Var(x1) = Var(x2)= …=Var(xDin)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 118 April 17, 2024
Weight Initialization: “Xavier” Initialization
“Xavier” initialization: “Just right”: Activations are
std = 1/sqrt(Din) nicely scaled for all layers!
Let: y = x1w1+x2w2+...+xDinwDin
Assume: Var(x1) = Var(x2)= …=Var(xDin)
We want: Var(y) = Var(xi)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 119 April 17, 2024
Weight Initialization: “Xavier” Initialization
“Xavier” initialization: “Just right”: Activations are
std = 1/sqrt(Din) nicely scaled for all layers!
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 120 April 17, 2024
Weight Initialization: “Xavier” Initialization
“Xavier” initialization: “Just right”: Activations are
std = 1/sqrt(Din) nicely scaled for all layers!
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 121 April 17, 2024
Weight Initialization: “Xavier” Initialization
“Xavier” initialization: “Just right”: Activations are
std = 1/sqrt(Din) nicely scaled for all layers!
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 122 April 17, 2024
Weight Initialization: “Xavier” Initialization
“Xavier” initialization: “Just right”: Activations are
std = 1/sqrt(Din) nicely scaled for all layers!
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 123 April 17, 2024
Data Augmentation More Complex:
Color Jitter 1. Apply PCA to all [R, G, B]
Simple: Randomize pixels in training set
contrast and brightness 2. Sample a “color offset”
along principal component
directions
3. Add offset to all pixels of a
training image
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 124 April 17, 2024
Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization
Data Augmentation
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 125 April 17, 2024
Regularization: DropConnect
Training: Drop connections between neurons (set weights to 0)
Testing: Use all the connections
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 126 April 17, 2024
Regularization: Fractional Pooling
Training: Use randomized pooling regions
Testing: Average predictions from several regions
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 127 April 17, 2024
Regularization: Stochastic Depth
Training: Skip some layers in the network
Testing: Use all the layer
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
Stochastic Depth
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 128 April 17, 2024
Regularization: Mixup
Training: Train on random blends of images
Testing: Use original images
Examples:
Dropout Target label:
Batch Normalization CNN cat: 0.4
Data Augmentation dog: 0.6
DropConnect
Fractional Max Pooling
Stochastic Depth Randomly blend the pixels
Cutout / Random Crop of pairs of training images,
e.g. 40% cat, 60% dog
Mixup
Zhang et al, “mixup: Beyond Empirical Risk Minimization”, ICLR 2018
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 129 April 17, 2024
Transfer learning
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 130 April 17, 2024
You need a lot of a data if you want to
train/use CNNs?
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 131 April 17, 2024
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014
1. Train on Imagenet
FC-1000
FC-4096
FC-4096
MaxPool
Conv-512
Conv-512
MaxPool
Conv-512
Conv-512
MaxPool
Conv-256
Conv-256
MaxPool
Conv-128
Conv-128
MaxPool
Conv-64
Conv-64
Image
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 132 April 17, 2024
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014
MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512
MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128
MaxPool MaxPool
Conv-64 Conv-64
Conv-64 Conv-64
Image Image
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 133 April 17, 2024
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014
MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512
MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128
MaxPool MaxPool
Conv-64 Conv-64
Donahue et al, “DeCAF: A Deep Convolutional Activation
Conv-64 Conv-64 Feature for Generic Visual Recognition”, ICML 2014
Image Image
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 134 April 17, 2024
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014
MaxPool
this and train MaxPool
MaxPool
Conv-512 Conv-512
Conv-512
With bigger
Conv-512 Conv-512 Conv-512
dataset, train
MaxPool MaxPool MaxPool
Conv-512 Conv-512
more layers
Conv-512
Conv-512 Conv-512 Conv-512
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 135 April 17, 2024
Takeaway for your projects and beyond:
Have some dataset of interest but it has < ~1M images?
TensorFlow: https://github.com/tensorflow/models
PyTorch: https://github.com/pytorch/vision
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 136 April 17, 2024
Summary
- Improve your training error:
- Optimizers
- Learning rate schedules
- Improve your test error:
- Regularization
- Choosing Hyperparameters
Fei-Fei Li, Ehsan Adeli, Zane Durante Lecture 7 - 137 April 17, 2024