Optimization of Deep Networks
Optimization of Deep Networks
of Deep
Neural
Networks
Overview
Backpropagation, and automatic differentiation, allows us to optimize any
function composed of differentiable blocks
⬣ No need to modify the learning algorithm!
⬣ The complexity of the function is only limited by computation and memory
𝒑 𝑳
𝑿 −𝐥𝐨𝐠 𝒑
Model
Input Loss Function
Depth is important:
⬣ Structure the model to represent
an inherently compositional world output
⬣ Theoretical evidence that it leads
layer
input
to parameter efficiency layer hidden hidden
layer 1 layer 2
⬣ Gentle dimensionality reduction
(if done right)
Importance of Depth
There are still many design ?
decisions that must be made:
⬣ Architecture
⬣ Data Considerations
⬣ Training and
Optimization
Local
⬣ Machine Learning Minima
Considerations
Architectural Considerations
Input
Predictions
Data
Fully Connected
Neural Network
Example Architectures
Input Input
Predictions Predictions
Data Image
Example Architectures
Input Input
Predictions Predictions
Data Image
Different architectures
are suitable for different
applications or types of
input
Recurrent Neural Network
Example Architectures
As in traditional machine
learning, data is key:
⬣ Should we pre-process
the data?
Data Considerations
Even given a good neural network
architecture, we need a good optimization
algorithm to find good weights
Local
⬣ What optimizer should we use? Minima
Optimization Considerations
Machine Learning
Considerations
⬣ Composition of non-linear
layers enables complex
transformations of the
data
⬣ Gradients
Sigmoid Function
⬣ Min: -1, Max: 1
⬣ Centered
tanh
Derivative
⬣ Saturates at both ends
⬣ Gradients
⬣ Always positive
Tanh Function
⬣ Min: 0, Max: Infinity
⬣ Gradients
⬣ 𝟎 if 𝐱 ≤ 𝟎 (dead ReLU)
⬣ Learnable parameter!
⬣ No saturation
θ
⬣ Gradients
⬣ No dead neuron
𝒉ℓ = 𝒎𝒂𝒙(𝜶𝒉ℓ"𝟏 , 𝒉ℓ"𝟏 )
Leaky ReLU
Selecting a Non-Linearity
A Poor Initialization
Common approach is small normally distributed random numbers
⬣ E.g. 𝑵 𝝁, 𝝈 𝒘𝒉𝒆𝒓𝒆 𝝁 = 𝟎, 𝝈 = 𝟎. 𝟎𝟏
Gaussian/Normal Initialization
Deeper networks (with many layers) are more sensitive to
initialization
⬣ With a deep network,
activations (outputs of
nodes) get smaller
⬣ Standard deviation reduces
significantly
Distribution of activation values
⬣ Leads to small updates – of a network with tanh non-
smaller values multiplied by linearities, for increasingly deep
upstream gradients layers
⬣ Larger initial values lead to From "Understanding the difficulty of training deep
saturation feedforward neural networks." AISTATS, 2010.
Xavier Initialization
In practice, simpler versions perform empirically well:
𝟏
N 𝟎, 𝟏 ∗
𝒏𝒋
"Delving Deep into Rectifiers:Surpassing Human-Level Performance on ImageNet Classification“, ICCV, 2015.
Preprocessing
⬣ We can try to come up with a layer that can normalize the data across
the neural network
⬣ Given: A mini-batch of data [𝑩×𝑫] where 𝑩 is batch size
⬣ Compute mean and variance for each dimension 𝒅
From: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Sergey Ioffe, Christian Szegedy
𝒙𝒊 − 𝝁𝑩
G𝒊 =
𝒙
𝝈𝟐𝑩 + 𝝐
From: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Sergey Ioffe, Christian Szegedy
From: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Sergey Ioffe, Christian Szegedy
Input
Linear BN Non-
Layer Linearity
Where to Apply BN
Generalization of BN
Resource:
⬣ ML Explained - Normalization
Optimizers
Deep learning involves complex,
compositional, non-linear functions
Loss Landscape
It used to be thought that
existence of local minima is
the main issue in optimization
Loss Landscape
⬣ We use a subset of the
data at each iteration to 𝟏
calculate the loss (& 𝑳 = 3 𝑳 (𝒇 𝒙𝒊 , 𝑾 , 𝒚𝒊 )
𝑴
gradients)
⬣ This is an unbiased
estimator but can have
high variance
Noisy Gradients
Several loss surface geometries
are difficult for optimization
𝝏𝑳 Update Velocity
𝒗𝒊 = 𝜷𝒗𝒊"𝟏 + (starts as 0, 𝜷 = 𝟎. 𝟗𝟗)
𝝏𝒘𝒊"𝟏
⬣ Generalizes SGD (𝜷 = 𝟎)
Adding Momentum
⬣ Velocity term is an exponential moving average of the gradient
𝝏𝑳
𝒗𝒊 = 𝜷𝒗𝒊&𝟏 +
𝝏𝒘𝒊&𝟏
𝝏𝑳 𝝏𝑳
𝒗𝒊 = 𝜷(𝜷 𝒗𝒊&𝟐 + )+
𝝏𝒘𝒊&𝟐 𝝏𝒘𝒊&𝟏
𝟐
𝝏𝑳 𝝏𝑳
= 𝜷 𝒗𝒊&𝟐 + 𝜷 +
𝝏𝒘𝒊&𝟐 𝝏𝒘𝒊&𝟏
𝝏𝑳 Update Velocity
𝒗𝒊 = 𝜷𝒗𝒊"𝟏 − 𝜶
𝝏𝒘𝒊"𝟏 (starts as 0)
𝝏𝑳
𝒗𝒊 = 𝜷𝒗𝒊"𝟏 +
𝝏𝒘
G 𝒊"𝟏
𝒘𝒊 = 𝒘𝒊"𝟏 − 𝜶 𝒗𝒊
Nesterov Momentum
Nesterov Momentum
Resource:
https://medium.com/the-artificial-
impostor/sgd-implementation-in-
pytorch-4115bcb9f02c
⬣ Various mathematical ways to
characterize the loss landscape
First
order
Condition Number
Per-Parameter Learning Rate
Adagrad
Solution: Keep a moving 𝝏𝑳 𝟐
average of squared 𝑮𝒊 = 𝜷𝑮𝒊"𝟏 + 𝟏 − 𝜷
𝝏𝒘𝒊"𝟏
gradients!
𝜶 𝝏𝑳
𝒘𝒊 = 𝒘𝒊"𝟏 −
𝑮𝒊 + 𝝐 𝝏𝒘𝒊"𝟏
Does not saturate the
learning rate
RMSProp
𝝏𝑳
𝒗𝒊 = 𝜷𝟏 𝒗𝒊"𝟏 + 𝟏 − 𝜷𝟏
𝝏𝒘𝒊"𝟏
𝜶 𝒗𝒊
Maintains both first 𝒘𝒊 = 𝒘𝒊"𝟏 −
𝑮𝒊 + 𝝐
and second moment
statistics for gradients But unstable in the beginning
(one or both of moments will be
tiny values)
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
Adam
𝝏𝑳
Solution: Time-varying bias 𝒗𝒊 = 𝜷𝟏 𝒗𝒊"𝟏 + 𝟏 − 𝜷𝟏
𝝏𝒘𝒊"𝟏
correction
𝟐
𝝏𝑳
𝑮𝒊 = 𝜷𝟐 𝑮𝒊"𝟏 + 𝟏 − 𝜷𝟐
Typically 𝜷𝟏 = 𝟎. 𝟗, 𝜷𝟐 = 𝟎. 𝟗𝟗𝟗 𝝏𝒘𝒊"𝟏
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
Adam
Optimizers behave differently
depending on landscape
Behavior of Optimizers
First order optimization methods have
learning rates
Training
Theoretical results rely on annealed Loss
learning rate
L1 Regularization
𝑳 = |𝒚 − 𝑾𝒙𝒊 |𝟐 + 𝝀|𝑾|
where |𝑾| is element-wise
Example regularizations:
⬣ L1/L2 on weights (encourage small values)
⬣ L2: 𝑳 = 𝒚 − 𝑾𝒙𝒊 |𝟐 + 𝝀 𝑾|𝟐 (weight decay)
⬣ Elastic L1/L2: 𝒚 − 𝑾𝒙𝒊 |𝟐 + 𝜶 𝑾|𝟐 + 𝜷|𝑾|
Regularization
output
layer
input
layer hidden hidden
layer 1 layer 2
Problem: Network can learn to rely strong on a few features that work
really well
⬣ May cause overfitting if not representative of test data
From: Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.
output
layer
input
layer hidden hidden
layer 1 layer 2
Dropout Regularization
⬣ In practice, implement
with a mask calculated output
layer
each iteration input
layer hidden hidden
layer 1 layer 2
⬣ During testing, no
nodes are dropped 𝒂𝟏𝟏 𝟎
𝒂𝟐𝟏
𝒂𝟑𝟏 ⋅ 𝟏
𝟎
𝒂𝟒𝟏 𝟏
From: Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.
Dropout Implementation
⬣ During training, each node has an
expected 𝒑 ∗ 𝒇𝒂𝒏_𝒊𝒏 nodes
⬣ During test all nodes are activated
⬣ Principle: Always try to have
similar train and test-time output
input/output distributions! layer
input
layer hidden hidden
layer 1 layer 2
Solution: During test time, scale
outputs (or equivalently weights) by 𝒑
⬣ i.e. 𝑾𝒕𝒆𝒔𝒕 = 𝒑𝑾 𝒂𝟏𝟏 𝟎
𝒂𝟐𝟏
𝒂𝟑𝟏 ⋅ 𝟏
𝟎
𝟏 𝒂𝟒𝟏 𝟏
⬣ Alternative: Scale by 𝒑
at train time
From: Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.
From: Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.
networks:
⬣ Each configuration is a network
⬣ Most are trained with 1 or 2 mini-
batches of data
From: Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.
CutMix
Yun et al., CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features
Random Crop
Color Jitter
From https://mxnet.apache.org/versions/1.5.0/tutorials/gluon/data_augmentation.html
Color Jitter
We can apply generic affine
transformations:
⬣ Translation
⬣ Rotation
⬣ Scale
⬣ Shear
Geometric Transformations
We can combine these transformations to add even more variety!
From https://mxnet.apache.org/versions/1.5.0/tutorials/gluon/data_augmentation.html
Combining Transformations
Unlabelled
Image 𝑥,#
Unlabelled
Image 𝑥,
Mask proportion p
Mean
CowMask m
mix mix
Masked
CowMask m Masked
Noise Image 𝑥,!
Image 𝑥*
Unlabelled
Image 𝑥,"
CowMix
From French et al., “Milking CowMask for Semi-Supervised Image Classification”
Other Variations
The Process
of Training
Neural
Networks
⬣ Training deep neural networks is an art
form!
⬣ Lots of things matter (together) – the key Local
Minima
is to find a combination that works
⬣ Key principle: Monitoring everything to
understand what is going on!
⬣ Loss and accuracy curves
⬣ Gradient statistics/characteristics Optimizer
⬣ Other aspects of computation graph Trajectory
Sanity Checking
Change in loss indicates speed of
learning: Learning
Rate
⬣ Tiny loss change -> too small of a Too Low
learning rate
⬣ Loss (and then weights) turn to NaNs ->
too high of a learning rate Learning
Rate
Other bugs can also cause this, e.g.: Too High
⬣ Divide by zero
⬣ Forgetting the log!
In pytorch, use autograd’s detect
anomaly to debug
Overfitting
Many hyper-parameters to tune! Grid Layout Random Layout
⬣ Learning rate, weight decay
crucial
Unimportant
Unimportant
parameter
parameter
⬣ Momentum, others more stable
⬣ Always tune hyper-parameters;
even a good idea will fail un-
Important Important
tuned! parameter parameter
Start with coarser search: From: Bergstra et al., “Random Search for Hyper-Parameter Optimization”,
⬣
JMLR, 2012
E.g. learning rate of {0.1, 0.05,
0.03, 0.01, 0.003, 0.001, 0.0005,
0.0001} Automated methods are OK, but
⬣ Perform finer search around good intuition (or random) can do well given
values enough of a tuning budget
Hyper-Parameter Tuning
Inter-dependence of Hyperparameters
Note that hyper-parameters and even module
selection are interdependent!
Examples:
⬣ Batch norm and dropout maybe not be
needed together (and sometimes the
combination is worse)
⬣ The learning rate should be changed
proportionally to batch size – increase
the learning rate for larger batch sizes
⬣ One interpretation: Gradients are
more reliable/smoother
relevant elements
Note that we are optimizing a loss
function false negatives true negatives
𝑳 = −𝒍𝒐𝒈 𝑷(𝒀 = 𝒚𝒊 |𝑿 = 𝒙𝒊 )
From
https://en.wikipedia.org/wiki/Receiver_operating_
characteristic
Resources