0% found this document useful (0 votes)
13 views84 pages

Optimization of Deep Networks

The document discusses the optimization of deep neural networks, emphasizing the importance of architecture, data considerations, and optimization algorithms. It highlights the significance of depth in neural networks and the need for careful initialization of parameters to ensure effective learning. Additionally, it covers normalization techniques and the complexities of loss landscapes in deep learning optimization.

Uploaded by

Carlos Souza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views84 pages

Optimization of Deep Networks

The document discusses the optimization of deep neural networks, emphasizing the importance of architecture, data considerations, and optimization algorithms. It highlights the significance of depth in neural networks and the need for careful initialization of parameters to ensure effective learning. Additionally, it covers normalization techniques and the complexities of loss landscapes in deep learning optimization.

Uploaded by

Carlos Souza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Optimization

of Deep
Neural
Networks
Overview
Backpropagation, and automatic differentiation, allows us to optimize any
function composed of differentiable blocks
⬣ No need to modify the learning algorithm!
⬣ The complexity of the function is only limited by computation and memory

𝒑 𝑳
𝑿 −𝐥𝐨𝐠 𝒑
Model
Input Loss Function

The Power of Deep Learning


A network with two or more hidden
layers is often considered a deep
model

Depth is important:
⬣ Structure the model to represent
an inherently compositional world output
⬣ Theoretical evidence that it leads
layer
input
to parameter efficiency layer hidden hidden
layer 1 layer 2
⬣ Gentle dimensionality reduction
(if done right)

Importance of Depth
There are still many design ?
decisions that must be made:
⬣ Architecture

⬣ Data Considerations
⬣ Training and
Optimization
Local
⬣ Machine Learning Minima
Considerations

Designing Deep Neural Networks


We must design the neural network
architecture:

⬣ What modules (layers) should ?


we use?

⬣ How should they be connected


together?

⬣ Can we use our domain


knowledge to add architectural
biases?

Architectural Considerations
Input
Predictions
Data

Fully Connected
Neural Network

Example Architectures
Input Input
Predictions Predictions
Data Image

Fully Connected Convolutional Neural


Neural Network Networks

Example Architectures
Input Input
Predictions Predictions
Data Image

Fully Connected Convolutional Neural


Neural Network Networks

Different architectures
are suitable for different
applications or types of
input
Recurrent Neural Network

Example Architectures
As in traditional machine
learning, data is key:
⬣ Should we pre-process
the data?

⬣ Should we normalize it?


⬣ Can we augment our data
by adding noise or other
perturbations?

Data Considerations
Even given a good neural network
architecture, we need a good optimization
algorithm to find good weights
Local
⬣ What optimizer should we use? Minima

⬣ Different optimizers make different


weight updates depending on the
gradients
⬣ How should we initialize the weights?
Optimizer
⬣ What regularizers should we use?
Trajectory
⬣ What loss function is appropriate?

Optimization Considerations
Machine Learning
Considerations

The practice of machine learning


is complex: For your particular
application you have to trade off all
of the considerations together
⬣ Trade-off between model
capacity (e.g. measured by # of
parameters) and amount of data
⬣ Adding appropriate biases
based on knowledge of the
domain
Architectural
Considerations
Determining what modules to use, and how to
connect them is part of the architectural
design
⬣ Guided by the type of data used and its
characteristics ?
⬣ Understanding your data is always the
first step!
⬣ Lots of data types (modalities) already
have good architectures
⬣ Start with what others have
discovered!
⬣ The flow of gradients is one of the key
principles to use when analyzing layers

Designing the Architecture


⬣ Combination of linear and
non-linear layers 𝒘𝑻𝟏 (𝒘𝑻𝟐 (𝒘𝑻𝟑 𝒙)) = 𝒘𝑻𝟒 x

⬣ Combination of only linear


𝒖 𝒑 𝑳
layers has same 𝟏
𝒘𝑻 𝒙 −𝐥𝐨𝐠 𝒑
representational power as one 𝟏 + 𝒆"𝒖
linear layer

⬣ Non-linear layers are crucial

⬣ Composition of non-linear
layers enables complex
transformations of the
data

Linear and Non-Linear Modules


Several aspects that we can analyze:
⬣ Min/Max
⬣ Correspondence between input &
output statistics
⬣ Gradients
⬣ At initialization (e.g. small
values)
⬣ At extremes
⬣ Computational complexity

Analysis of Non-Linear Function


⬣ Min: 0, Max: 1 Sigmoid

⬣ Output always positive


Derivative
⬣ Saturates at both ends

⬣ Gradients

⬣ Vanishes at both end 𝒉ℓ = 𝝈 (𝒉ℓ"𝟏 )


⬣ Always positive 𝟏
𝝈 𝒙 = 𝝏𝑳 𝝏𝑳
𝟏 + 𝒆"𝒙 𝝏𝑳 𝝏𝒉ℓ
⬣ Computation: Exponential 𝝏𝒉ℓ"𝟏
𝝏𝑾
term 𝝏𝑳 𝝏𝑳 𝝏𝒉ℓ
=
𝝏𝑾 𝝏𝒉ℓ 𝝏𝑾

Sigmoid Function
⬣ Min: -1, Max: 1

⬣ Centered
tanh
Derivative
⬣ Saturates at both ends

⬣ Gradients

⬣ Vanishes at both end

⬣ Always positive

⬣ Still somewhat 𝒉ℓ = 𝒕𝒂𝒏𝒉(𝒉ℓ"𝟏 )


computationally heavy

Tanh Function
⬣ Min: 0, Max: Infinity

⬣ Output always positive

⬣ No saturation on positive end!

⬣ Gradients

⬣ 𝟎 if 𝐱 ≤ 𝟎 (dead ReLU)

⬣ Constant otherwise (does


not vanish)

⬣ Cheap to compute (max) 𝒉ℓ = 𝒎𝒂𝒙(𝟎, 𝒉ℓ"𝟏 )

Rectified Linear Unit


⬣ Min: -Infinity, Max: Infinity

⬣ Learnable parameter!

⬣ No saturation
θ
⬣ Gradients

⬣ No dead neuron

⬣ Still cheap to compute

𝒉ℓ = 𝒎𝒂𝒙(𝜶𝒉ℓ"𝟏 , 𝒉ℓ"𝟏 )

Leaky ReLU
Selecting a Non-Linearity

Which non-linearity should you


select?
⬣ Unfortunately, no one activation
function is best for all applications
⬣ ReLU is most common starting
point
⬣ Sometimes leaky ReLU can
make a big difference
⬣ Sigmoid is typically avoided
unless clamping to values from
[0,1] is needed
Initialization
Initializing the Parameters
The parameters of our model must be
initialized to something
⬣ Initialization is extremely important!
⬣ Determined how statistics of outputs
(given inputs) behave
⬣ Determines how well gradients flow in
the beginning of training (important)
⬣ Could limit use of full capacity of the
model if done improperly
⬣ Initialization that is close to a good (local)
minima will converge faster and to a better
solution
Initializing values to a constant value leads to a degenerate solution!

⬣ What happens to the


weight updates? 𝒘𝒊 = 𝒄 ∀𝒊

⬣ Each node has the same


input from previous layers
so gradients will be the
same output
layer
input
hidden hidden
⬣ As a results, all weights layer
layer 1 layer 2
will be updated to the
same exact values

A Poor Initialization
Common approach is small normally distributed random numbers

⬣ E.g. 𝑵 𝝁, 𝝈 𝒘𝒉𝒆𝒓𝒆 𝝁 = 𝟎, 𝝈 = 𝟎. 𝟎𝟏

⬣ Small weights are preferred since


no feature/input has prior
importance

⬣ Keeps the model within the linear


region of most activation
functions

Gaussian/Normal Initialization
Deeper networks (with many layers) are more sensitive to
initialization
⬣ With a deep network,
activations (outputs of
nodes) get smaller
⬣ Standard deviation reduces
significantly
Distribution of activation values
⬣ Leads to small updates – of a network with tanh non-
smaller values multiplied by linearities, for increasingly deep
upstream gradients layers
⬣ Larger initial values lead to From "Understanding the difficulty of training deep
saturation feedforward neural networks." AISTATS, 2010.

Limitation of Small Weights


Ideally, we’d like to maintain the variance at the output to be similar
to that of input!
⬣ This condition leads to a
simple initialization rule,
sampling from uniform
distribution:
𝟔 𝟔
Uniform − ,+
𝒏𝒋 &𝒏𝒋%𝟏 𝒏𝒋 &𝒏𝒋%𝟏 Distribution of activation values
of a network with tanh non-
⬣ Where 𝒏𝒋 is fan-in
linearities, for increasingly deep
(number of input nodes) layers
and 𝒏𝒋&𝟏 is fan-out
(number of output nodes) From "Understanding the difficulty of training deep
feedforward neural networks." AISTATS, 2010.

Xavier Initialization
In practice, simpler versions perform empirically well:

𝟏
N 𝟎, 𝟏 ∗
𝒏𝒋

⬣ This analysis holds for tanh or similar activations.

⬣ Similar analysis for ReLU activations leads to:


𝟏
𝑵 𝟎, 𝟏 ∗
𝒏𝒋 /𝟐

"Delving Deep into Rectifiers:Surpassing Human-Level Performance on ImageNet Classification“, ICCV, 2015.

(Simpler) Xavier and Xavier2 Initialization


Summary

Key takeaway: Initialization matters!


⬣ Determines the activation (output)
statistics, and therefore gradient
statistics
⬣ If gradients are small, no learning
will occur and no improvement is
possible!
⬣ Important to reason about
output/gradient statistics and
analyze them for new layers and
architectures
Normalization,
Preprocessing,
and
Augmentation
Importance of Data

In deep learning, data drives


learning of features and classifier

⬣ Its characteristics are therefore


extremely important

⬣ Always understand your data!

⬣ Relationship between output


statistics, layers such as non-
linearities, and gradients is
important
Just like initialization, normalization can
improve gradient flow and learning

Typically normalization methods apply:

⬣ Subtract mean, divide by standard Data after subtracting mean,


deviation (most common) dividing by standard deviation

⬣ This can be done per dimension

⬣ Whitening, e.g. through Principle


Component Analysis (PCA) (not
common) Data after whitening
Figure from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Preprocessing
⬣ We can try to come up with a layer that can normalize the data across
the neural network
⬣ Given: A mini-batch of data [𝑩×𝑫] where 𝑩 is batch size
⬣ Compute mean and variance for each dimension 𝒅

From: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Sergey Ioffe, Christian Szegedy

Making Normalization a Layer


Normalize data

𝒙𝒊 − 𝝁𝑩
G𝒊 =
𝒙
𝝈𝟐𝑩 + 𝝐

Note: This part


does not involve
new parameters

From: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Sergey Ioffe, Christian Szegedy

Normalizing the Data


⬣ We can give the model
flexibility through
learnable parameters
𝜸 (scale) and 𝜷 (shift)

⬣ Network can learn to not


normalize if necessary!

⬣ This layer is called a


Batch Normalization
(BN) layer

From: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Sergey Ioffe, Christian Szegedy

Learnable Scaling and Offset


Some Complexities of BN
During inference, stored
mean/variances calculated on training
set are used
Sufficient batch sizes must be used to
get stable per-batch estimates during
training
⬣ This is especially an issue when
using multi-GPU or multi-machine
training
⬣ Use torch.nn.SyncBatchNorm to
estimate batch statistics in these
settings
Normalization especially important before
non-linearities!
⬣ Very low/high values (un-
normalized/imbalanced data) cause
saturation

Input

Linear BN Non-
Layer Linearity

Where to Apply BN
Generalization of BN

There are many variations of batch


normalization

⬣ See Convolutional Neural


Network lectures for an example

Resource:

⬣ ML Explained - Normalization
Optimizers
Deep learning involves complex,
compositional, non-linear functions

The loss landscape is extremely non-


convex as a result

There is little direct theory and a lot of


intuition/rules of thumbs instead

⬣ Some insight can be gained via


theory for simpler cases (e.g.
convex settings)

Loss Landscape
It used to be thought that
existence of local minima is
the main issue in optimization

There are other more


impactful issues:
⬣ Noisy gradient estimates
Saddle Point
⬣ Saddle points
⬣ Ill-conditioned loss surface From: Identifying and attacking the saddle point problem in high-
dimensional non-convex optimization, Dauphi et al., 2014.

Loss Landscape
⬣ We use a subset of the
data at each iteration to 𝟏
calculate the loss (& 𝑳 = 3 𝑳 (𝒇 𝒙𝒊 , 𝑾 , 𝒚𝒊 )
𝑴
gradients)

⬣ This is an unbiased
estimator but can have
high variance

⬣ This results in noisy steps


in gradient descent

Noisy Gradients
Several loss surface geometries
are difficult for optimization

Several types of minima: Local


minima, plateaus, saddle points
Plateau
Saddle points are those where the
gradient of orthogonal directions
are zero
⬣ But they disagree (it’s min for
one, max for another) Saddle Point

Loss Surface Geometry


⬣ Gradient descent takes a step in the
steepest direction (negative gradient) 𝝏𝑳
𝒘𝒊 = 𝒘𝒊"𝟏 − 𝜶
𝝏𝒘𝒊
⬣ Intuitive idea: Imagine a ball rolling
down loss surface, and use
momentum to pass flat surfaces

𝝏𝑳 Update Velocity
𝒗𝒊 = 𝜷𝒗𝒊"𝟏 + (starts as 0, 𝜷 = 𝟎. 𝟗𝟗)
𝝏𝒘𝒊"𝟏

𝒘𝒊 = 𝒘𝒊"𝟏 − 𝜶𝒗𝒊 Update Weights

⬣ Generalizes SGD (𝜷 = 𝟎)

Adding Momentum
⬣ Velocity term is an exponential moving average of the gradient
𝝏𝑳
𝒗𝒊 = 𝜷𝒗𝒊&𝟏 +
𝝏𝒘𝒊&𝟏

𝝏𝑳 𝝏𝑳
𝒗𝒊 = 𝜷(𝜷 𝒗𝒊&𝟐 + )+
𝝏𝒘𝒊&𝟐 𝝏𝒘𝒊&𝟏

𝟐
𝝏𝑳 𝝏𝑳
= 𝜷 𝒗𝒊&𝟐 + 𝜷 +
𝝏𝒘𝒊&𝟐 𝝏𝒘𝒊&𝟏

⬣ There is a general class of accelerated gradient methods, with


some theoretical analysis (under assumptions)

Accelerated Descent Methods


Equivalent formulation:

𝝏𝑳 Update Velocity
𝒗𝒊 = 𝜷𝒗𝒊"𝟏 − 𝜶
𝝏𝒘𝒊"𝟏 (starts as 0)

𝒘𝒊 = 𝒘𝒊"𝟏 + 𝒗𝒊 Update Weights

Equivalent Momentum Update


Key idea: Rather than combining velocity
with current gradient, go along velocity
first and then calculate gradient at new
point Velocity
New Gradient
⬣ We know velocity is probably a
reasonable direction
𝒘
G 𝒊"𝟏 = 𝒘𝒊"𝟏 − 𝜷𝒗𝒊"𝟏

𝝏𝑳
𝒗𝒊 = 𝜷𝒗𝒊"𝟏 +
𝝏𝒘
G 𝒊"𝟏

𝒘𝒊 = 𝒘𝒊"𝟏 − 𝜶 𝒗𝒊

Nesterov Momentum
Nesterov Momentum

Note there are several equivalent


formulations across deep learning
frameworks!

Resource:
https://medium.com/the-artificial-
impostor/sgd-implementation-in-
pytorch-4115bcb9f02c
⬣ Various mathematical ways to
characterize the loss landscape

⬣ If you liked Jacobians… meet the Second


order

First
order

⬣ Gives us information about the


curvature of the loss surface

Hessian and Loss Curvature


Condition number is the ratio of
the largest and smallest eigenvalue
⬣ Tells us how different the
curvature is along different
dimensions

If this is high, SGD will make big


steps in some dimensions and
small steps in other dimension

Second-order optimization methods


divide steps by curvature, but
expensive to compute

Condition Number
Per-Parameter Learning Rate

Idea: Have a dynamic learning rate


for each weight
Several flavors of optimization
algorithms:
⬣ RMSProp
⬣ Adagrad
⬣ Adam
⬣ …
SGD can achieve similar results in
many cases but with much more
tuning
Idea: Use gradient statistics
to reduce learning rate across 𝝏𝑳 𝟐
iterations 𝑮𝒊 = 𝑮𝒊"𝟏 +
𝝏𝒘𝒊"𝟏
𝜶 𝝏𝑳
Denominator: Sum up 𝒘𝒊 = 𝒘𝒊"𝟏 −
𝑮𝒊 + 𝝐 𝝏𝒘𝒊"𝟏
gradients over iterations
As gradients are
Directions with high accumulated learning
curvature will have higher rate will go to zero
gradients, and learning rate
will reduce Duchi, et al., “Adaptive Subgradient Methods for Online
Learning and Stochastic Optimization”

Adagrad
Solution: Keep a moving 𝝏𝑳 𝟐
average of squared 𝑮𝒊 = 𝜷𝑮𝒊"𝟏 + 𝟏 − 𝜷
𝝏𝒘𝒊"𝟏
gradients!
𝜶 𝝏𝑳
𝒘𝒊 = 𝒘𝒊"𝟏 −
𝑮𝒊 + 𝝐 𝝏𝒘𝒊"𝟏
Does not saturate the
learning rate

RMSProp first introduced in: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

RMSProp
𝝏𝑳
𝒗𝒊 = 𝜷𝟏 𝒗𝒊"𝟏 + 𝟏 − 𝜷𝟏
𝝏𝒘𝒊"𝟏

Combines ideas from 𝝏𝑳 𝟐


above algorithms 𝑮𝒊 = 𝜷𝟐 𝑮𝒊"𝟏 + 𝟏 − 𝜷𝟐
𝝏𝒘𝒊"𝟏

𝜶 𝒗𝒊
Maintains both first 𝒘𝒊 = 𝒘𝒊"𝟏 −
𝑮𝒊 + 𝝐
and second moment
statistics for gradients But unstable in the beginning
(one or both of moments will be
tiny values)

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Adam
𝝏𝑳
Solution: Time-varying bias 𝒗𝒊 = 𝜷𝟏 𝒗𝒊"𝟏 + 𝟏 − 𝜷𝟏
𝝏𝒘𝒊"𝟏
correction
𝟐
𝝏𝑳
𝑮𝒊 = 𝜷𝟐 𝑮𝒊"𝟏 + 𝟏 − 𝜷𝟐
Typically 𝜷𝟏 = 𝟎. 𝟗, 𝜷𝟐 = 𝟎. 𝟗𝟗𝟗 𝝏𝒘𝒊"𝟏

So 𝒗G𝒊 will be small number 𝒗𝒊 𝑮𝒊


𝒗G𝒊 = Q𝒊 =
𝑮
divided by (1-0.9=0.1) resulting 𝟏 − 𝜷𝒕𝟏 𝟏 − 𝜷𝒕𝟐
in more reasonable values (and
P 𝒊 larger) G𝒊
𝜶𝒗
𝑮 𝒘𝒊 = 𝒘𝒊"𝟏 −
Q𝒊 + 𝝐
𝑮

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Adam
Optimizers behave differently
depending on landscape

Different behaviors such as


overshooting, stagnating, etc.

Plain SGD+Momentum can


generalize better than adaptive
methods, but requires more tuning
⬣ See: Luo et al., Adaptive
Gradient Methods with
Dynamic Bound of Learning
Rate, ICLR 2019
From: https://mlfromscratch.com/optimizers-explained/#/

Behavior of Optimizers
First order optimization methods have
learning rates
Training
Theoretical results rely on annealed Loss
learning rate

Several schedules that are typical:


⬣ Graduate student!
⬣ Step scheduler
⬣ Exponential scheduler
⬣ Cosine scheduler
From: Leslie Smith, “Cyclical Learning Rates for Training Neural Networks”

Learning Rate Schedules


Regularization
Many standard regularization methods still apply!

L1 Regularization

𝑳 = |𝒚 − 𝑾𝒙𝒊 |𝟐 + 𝝀|𝑾|
where |𝑾| is element-wise

Example regularizations:
⬣ L1/L2 on weights (encourage small values)
⬣ L2: 𝑳 = 𝒚 − 𝑾𝒙𝒊 |𝟐 + 𝝀 𝑾|𝟐 (weight decay)
⬣ Elastic L1/L2: 𝒚 − 𝑾𝒙𝒊 |𝟐 + 𝜶 𝑾|𝟐 + 𝜷|𝑾|

Regularization
output
layer

input
layer hidden hidden
layer 1 layer 2

Problem: Network can learn to rely strong on a few features that work
really well
⬣ May cause overfitting if not representative of test data

From: Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.

Preventing Co-Adapted Features


𝒑 = 𝟎. 𝟓

output
layer

input
layer hidden hidden
layer 1 layer 2

An idea: For each node, keep its output with probability p


⬣ Activations of deactivated nodes are essentially zero
Choose whether to mask out a particular node each iteration
From: Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.

Dropout Regularization
⬣ In practice, implement
with a mask calculated output
layer
each iteration input
layer hidden hidden
layer 1 layer 2

⬣ During testing, no
nodes are dropped 𝒂𝟏𝟏 𝟎
𝒂𝟐𝟏
𝒂𝟑𝟏 ⋅ 𝟏
𝟎
𝒂𝟒𝟏 𝟏
From: Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.

Dropout Implementation
⬣ During training, each node has an
expected 𝒑 ∗ 𝒇𝒂𝒏_𝒊𝒏 nodes
⬣ During test all nodes are activated
⬣ Principle: Always try to have
similar train and test-time output
input/output distributions! layer
input
layer hidden hidden
layer 1 layer 2
Solution: During test time, scale
outputs (or equivalently weights) by 𝒑
⬣ i.e. 𝑾𝒕𝒆𝒔𝒕 = 𝒑𝑾 𝒂𝟏𝟏 𝟎
𝒂𝟐𝟏
𝒂𝟑𝟏 ⋅ 𝟏
𝟎
𝟏 𝒂𝟒𝟏 𝟏
⬣ Alternative: Scale by 𝒑
at train time
From: Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.

Inference with Dropout


Interpretation 1: The model should
not rely too heavily on particular
features
⬣ If it does, it has probability 𝟏 − 𝒑
output
of losing that feature in an layer
iteration input
layer hidden hidden
layer 1 layer 2

From: Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.

Why Dropout Works


Interpretation 1: The model should
not rely too heavily on particular
features
⬣ If it does, it has probability 𝟏 − 𝒑
output
of losing that feature in an layer
iteration input
layer hidden hidden
Interpretation 2: Training 𝟐𝒏 layer 1 layer 2

networks:
⬣ Each configuration is a network
⬣ Most are trained with 1 or 2 mini-
batches of data
From: Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.

Why Dropout Works


Data
Augmentation
Data augmentation – Performing a range of transformations to
the data
⬣ This essentially “increases” your dataset
⬣ Transformations should not change meaning of the data (or
label has to be changed as well)
Simple example: Image Flipping

Data Augmentation: Motivation


Random crop
⬣ Take different crops during training
⬣ Can be used during inference too!

CutMix
Yun et al., CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features

Random Crop
Color Jitter

From https://mxnet.apache.org/versions/1.5.0/tutorials/gluon/data_augmentation.html

Color Jitter
We can apply generic affine
transformations:
⬣ Translation

⬣ Rotation

⬣ Scale
⬣ Shear

Geometric Transformations
We can combine these transformations to add even more variety!

From https://mxnet.apache.org/versions/1.5.0/tutorials/gluon/data_augmentation.html

Combining Transformations
Unlabelled
Image 𝑥,#
Unlabelled
Image 𝑥,

Mask proportion p
Mean
CowMask m

mix mix

Masked
CowMask m Masked
Noise Image 𝑥,!
Image 𝑥*

Unlabelled
Image 𝑥,"
CowMix
From French et al., “Milking CowMask for Semi-Supervised Image Classification”

Other Variations
The Process
of Training
Neural
Networks
⬣ Training deep neural networks is an art
form!
⬣ Lots of things matter (together) – the key Local
Minima
is to find a combination that works
⬣ Key principle: Monitoring everything to
understand what is going on!
⬣ Loss and accuracy curves
⬣ Gradient statistics/characteristics Optimizer
⬣ Other aspects of computation graph Trajectory

The Process of Training


Proper Methodology

Always start with proper methodology!


⬣ Not uncommon even in published papers
to get this wrong
Separate data into: Training, validation, test
set
⬣ Do not look at test set performance until
you have decided on everything (including
hyper-parameters)
Use cross-validation to decide on hyper-
parameters if amount of data is an issue
Check the bounds of your loss function
⬣ E.g. cross-entropy ranges from [𝟎, ∞]
⬣ Check initial loss at small random weight
values
⬣ E.g. −𝐥𝐨𝐠(𝒑) for cross-entropy,
where 𝒑 = 𝟎. 𝟓
Another example: Start without
regularization and make sure loss goes up
when added
Validation Loss
Key Principle: Simplify the dataset to make
sure your model can properly (over)-fit
before applying regularization

Sanity Checking
Change in loss indicates speed of
learning: Learning
Rate
⬣ Tiny loss change -> too small of a Too Low
learning rate
⬣ Loss (and then weights) turn to NaNs ->
too high of a learning rate Learning
Rate
Other bugs can also cause this, e.g.: Too High
⬣ Divide by zero
⬣ Forgetting the log!
In pytorch, use autograd’s detect
anomaly to debug

Loss and Not a Number (NaN)


⬣ Classic machine learning signs of
under/overfitting still apply!
Validations
⬣ Over-fitting: Validation loss/accuracy starts to
Training
get worse after a while

⬣ Under-fitting: Validation loss very close to Loss


training loss, or both are high

⬣ Note: You can have higher training loss!

⬣ Validation loss has no regularization

⬣ Validation loss is typically measured at


the end of an epoch Loss

Overfitting
Many hyper-parameters to tune! Grid Layout Random Layout
⬣ Learning rate, weight decay
crucial

Unimportant

Unimportant
parameter

parameter
⬣ Momentum, others more stable
⬣ Always tune hyper-parameters;
even a good idea will fail un-
Important Important
tuned! parameter parameter
Start with coarser search: From: Bergstra et al., “Random Search for Hyper-Parameter Optimization”,


JMLR, 2012
E.g. learning rate of {0.1, 0.05,
0.03, 0.01, 0.003, 0.001, 0.0005,
0.0001} Automated methods are OK, but
⬣ Perform finer search around good intuition (or random) can do well given
values enough of a tuning budget

Hyper-Parameter Tuning
Inter-dependence of Hyperparameters
Note that hyper-parameters and even module
selection are interdependent!
Examples:
⬣ Batch norm and dropout maybe not be
needed together (and sometimes the
combination is worse)
⬣ The learning rate should be changed
proportionally to batch size – increase
the learning rate for larger batch sizes
⬣ One interpretation: Gradients are
more reliable/smoother
relevant elements
Note that we are optimizing a loss
function false negatives true negatives

What we actually care about is


typically different metrics that we
can’t differentiate:
true positives false positives
⬣ Accuracy
⬣ Precision/recall
⬣ Other specialized metrics
The relationship between the two
selected elements
can be complex!
From https://en.wikipedia.org/wiki/Precision_and_recall

Relationship Between Loss and Other Metrics


⬣ Example: Cross entropy loss

𝑳 = −𝒍𝒐𝒈 𝑷(𝒀 = 𝒚𝒊 |𝑿 = 𝒙𝒊 )

⬣ Accuracy is measured based on:


Loss
𝒂𝒓𝒈𝒎𝒂𝒙𝒊 (𝑷 𝒀 = 𝒚𝒊 𝑿 = 𝒙𝒊 )

⬣ Since the correct class score only has


to be slightly higher, we can have flat
loss curves but increasing
accuracy! Accuracy

Simple Example: Cross-Entropy and Accuracy


⬣ TPR/FPR curves represent the inherent
tradeoff between number of positive predictions
and correctness of predictions
⬣ Receiver operating characteristic (ROC)
curves similar, plot precision/recall instead
⬣ Definitions
𝒕𝒑
⬣ True Positive Rate: 𝑻𝑷𝑹 =
𝒕𝒑0𝒇𝒏
𝒇𝒑
⬣ False Positive Rate: 𝑭𝑷𝑹 =
𝒇𝒑0𝒕𝒏
𝒕𝒑0𝒕𝒏
⬣ 𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 = 𝒕𝒑0𝒕𝒏0𝒇𝒑0𝒇𝒏

From
https://en.wikipedia.org/wiki/Receiver_operating_
characteristic

Example: Precision/Recall or ROC Curves


⬣ Precision/Recall curves represent the
inherent tradeoff between number of positive
predictions and correctness of predictions
⬣ Definitions
𝒕𝒑
⬣ True Positive Rate: 𝑻𝑷𝑹 =
𝒕𝒑0𝒇𝒏
𝒇𝒑
⬣ False Positive Rate: 𝑭𝑷𝑹 = 𝒇𝒑0𝒕𝒏
𝒕𝒑0𝒕𝒏
⬣ 𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 = 𝒕𝒑0𝒕𝒏0𝒇𝒑0𝒇𝒏
⬣ We can obtain a curve by varying the
(probability) threshold:
⬣ Area under the curve (AUC) common From
https://en.wikipedia.org/wiki/Receiver_operating_
single-number metric to summarize characteristic

⬣ Mapping between this and loss is not simple!

Example: Precision/Recall or ROC Curves


Resource: Local
Minima
⬣ A disciplined approach to
neural network hyper-
parameters: Part 1 --
learning rate, batch size,
momentum, and weight
decay, Leslie N. Smith
Optimizer
Trajectory

Resources

You might also like