0% found this document useful (0 votes)

13 views84 pages

Optimization of Deep Networks

The document discusses the optimization of deep neural networks, emphasizing the importance of architecture, data considerations, and optimization algorithms. It highlights the significance of depth in neural networks and the need for careful initialization of parameters to ensure effective learning. Additionally, it covers normalization techniques and the complexities of loss landscapes in deep learning optimization.

Uploaded by

Carlos Souza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views84 pages

Optimization of Deep Networks

Uploaded by

Carlos Souza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 84

Optimization

of Deep
Neural
Networks
Overview
Backpropagation, and automatic differentiation, allows us to optimize any
function composed of differentiable blocks
⬣ No need to modify the learning algorithm!
⬣ The complexity of the function is only limited by computation and memory

𝒑 𝑳
𝑿 −𝐥𝐨𝐠 𝒑
Model
Input Loss Function

The Power of Deep Learning

A network with two or more hidden
layers is often considered a deep
model

Depth is important:
⬣ Structure the model to represent
an inherently compositional world output
⬣ Theoretical evidence that it leads
layer
input
to parameter efficiency layer hidden hidden
layer 1 layer 2
⬣ Gentle dimensionality reduction
(if done right)

Importance of Depth
There are still many design ?
decisions that must be made:
⬣ Architecture

⬣ Data Considerations
⬣ Training and
Optimization
Local
⬣ Machine Learning Minima
Considerations

Designing Deep Neural Networks

We must design the neural network
architecture:

⬣ What modules (layers) should ?

we use?

⬣ How should they be connected

together?

⬣ Can we use our domain

knowledge to add architectural
biases?

Architectural Considerations
Input
Predictions
Data

Fully Connected
Neural Network

Example Architectures
Input Input
Predictions Predictions
Data Image

Fully Connected Convolutional Neural

Neural Network Networks

Example Architectures
Input Input
Predictions Predictions
Data Image

Fully Connected Convolutional Neural

Neural Network Networks

Different architectures
are suitable for different
applications or types of
input
Recurrent Neural Network

Example Architectures
As in traditional machine
learning, data is key:
⬣ Should we pre-process
the data?

⬣ Should we normalize it?

⬣ Can we augment our data
by adding noise or other
perturbations?

Data Considerations
Even given a good neural network
architecture, we need a good optimization
algorithm to find good weights
Local
⬣ What optimizer should we use? Minima

⬣ Different optimizers make different

weight updates depending on the
gradients
⬣ How should we initialize the weights?
Optimizer
⬣ What regularizers should we use?
Trajectory
⬣ What loss function is appropriate?

Optimization Considerations
Machine Learning
Considerations

The practice of machine learning

is complex: For your particular
application you have to trade off all
of the considerations together
⬣ Trade-off between model
capacity (e.g. measured by # of
parameters) and amount of data
⬣ Adding appropriate biases
based on knowledge of the
domain
Architectural
Considerations
Determining what modules to use, and how to
connect them is part of the architectural
design
⬣ Guided by the type of data used and its
characteristics ?
⬣ Understanding your data is always the
first step!
⬣ Lots of data types (modalities) already
have good architectures
⬣ Start with what others have
discovered!
⬣ The flow of gradients is one of the key
principles to use when analyzing layers

Designing the Architecture

⬣ Combination of linear and
non-linear layers 𝒘𝑻𝟏 (𝒘𝑻𝟐 (𝒘𝑻𝟑 𝒙)) = 𝒘𝑻𝟒 x

⬣ Combination of only linear

𝒖 𝒑 𝑳
layers has same 𝟏
𝒘𝑻 𝒙 −𝐥𝐨𝐠 𝒑
representational power as one 𝟏 + 𝒆"𝒖
linear layer

⬣ Non-linear layers are crucial

⬣ Composition of non-linear
layers enables complex
transformations of the
data

Linear and Non-Linear Modules

Several aspects that we can analyze:
⬣ Min/Max
⬣ Correspondence between input &
output statistics
⬣ Gradients
⬣ At initialization (e.g. small
values)
⬣ At extremes
⬣ Computational complexity

Analysis of Non-Linear Function

⬣ Min: 0, Max: 1 Sigmoid

⬣ Output always positive

Derivative
⬣ Saturates at both ends

⬣ Gradients

⬣ Vanishes at both end 𝒉ℓ = 𝝈 (𝒉ℓ"𝟏 )

⬣ Always positive 𝟏
𝝈 𝒙 = 𝝏𝑳 𝝏𝑳
𝟏 + 𝒆"𝒙 𝝏𝑳 𝝏𝒉ℓ
⬣ Computation: Exponential 𝝏𝒉ℓ"𝟏
𝝏𝑾
term 𝝏𝑳 𝝏𝑳 𝝏𝒉ℓ
=
𝝏𝑾 𝝏𝒉ℓ 𝝏𝑾

Sigmoid Function
⬣ Min: -1, Max: 1

⬣ Centered
tanh
Derivative
⬣ Saturates at both ends

⬣ Gradients

⬣ Vanishes at both end

⬣ Always positive

⬣ Still somewhat 𝒉ℓ = 𝒕𝒂𝒏𝒉(𝒉ℓ"𝟏 )

computationally heavy

Tanh Function
⬣ Min: 0, Max: Infinity

⬣ Output always positive

⬣ No saturation on positive end!

⬣ Gradients

⬣ 𝟎 if 𝐱 ≤ 𝟎 (dead ReLU)

⬣ Constant otherwise (does

not vanish)

⬣ Cheap to compute (max) 𝒉ℓ = 𝒎𝒂𝒙(𝟎, 𝒉ℓ"𝟏 )

Rectified Linear Unit

⬣ Min: -Infinity, Max: Infinity

⬣ Learnable parameter!

⬣ No saturation
θ
⬣ Gradients

⬣ No dead neuron

⬣ Still cheap to compute

𝒉ℓ = 𝒎𝒂𝒙(𝜶𝒉ℓ"𝟏 , 𝒉ℓ"𝟏 )

Leaky ReLU
Selecting a Non-Linearity

Which non-linearity should you

select?
⬣ Unfortunately, no one activation
function is best for all applications
⬣ ReLU is most common starting
point
⬣ Sometimes leaky ReLU can
make a big difference
⬣ Sigmoid is typically avoided
unless clamping to values from
[0,1] is needed
Initialization
Initializing the Parameters
The parameters of our model must be
initialized to something
⬣ Initialization is extremely important!
⬣ Determined how statistics of outputs
(given inputs) behave
⬣ Determines how well gradients flow in
the beginning of training (important)
⬣ Could limit use of full capacity of the
model if done improperly
⬣ Initialization that is close to a good (local)
minima will converge faster and to a better
solution
Initializing values to a constant value leads to a degenerate solution!

⬣ What happens to the

weight updates? 𝒘𝒊 = 𝒄 ∀𝒊

⬣ Each node has the same

input from previous layers
so gradients will be the
same output
layer
input
hidden hidden
⬣ As a results, all weights layer
layer 1 layer 2
will be updated to the
same exact values

A Poor Initialization
Common approach is small normally distributed random numbers

⬣ E.g. 𝑵 𝝁, 𝝈 𝒘𝒉𝒆𝒓𝒆 𝝁 = 𝟎, 𝝈 = 𝟎. 𝟎𝟏

⬣ Small weights are preferred since

no feature/input has prior
importance

⬣ Keeps the model within the linear

region of most activation
functions

Gaussian/Normal Initialization
Deeper networks (with many layers) are more sensitive to
initialization
⬣ With a deep network,
activations (outputs of
nodes) get smaller
⬣ Standard deviation reduces
significantly
Distribution of activation values
⬣ Leads to small updates – of a network with tanh non-
smaller values multiplied by linearities, for increasingly deep
upstream gradients layers
⬣ Larger initial values lead to From "Understanding the difficulty of training deep
saturation feedforward neural networks." AISTATS, 2010.

Limitation of Small Weights

Ideally, we’d like to maintain the variance at the output to be similar
to that of input!
⬣ This condition leads to a
simple initialization rule,
sampling from uniform
distribution:
𝟔 𝟔
Uniform − ,+
𝒏𝒋 &𝒏𝒋%𝟏 𝒏𝒋 &𝒏𝒋%𝟏 Distribution of activation values
of a network with tanh non-
⬣ Where 𝒏𝒋 is fan-in
linearities, for increasingly deep
(number of input nodes) layers
and 𝒏𝒋&𝟏 is fan-out
(number of output nodes) From "Understanding the difficulty of training deep
feedforward neural networks." AISTATS, 2010.

Xavier Initialization
In practice, simpler versions perform empirically well:

𝟏
N 𝟎, 𝟏 ∗
𝒏𝒋

⬣ This analysis holds for tanh or similar activations.

⬣ Similar analysis for ReLU activations leads to:

𝟏
𝑵 𝟎, 𝟏 ∗
𝒏𝒋 /𝟐

"Delving Deep into Rectifiers:Surpassing Human-Level Performance on ImageNet Classification“, ICCV, 2015.

(Simpler) Xavier and Xavier2 Initialization

Summary

Key takeaway: Initialization matters!

⬣ Determines the activation (output)
statistics, and therefore gradient
statistics
⬣ If gradients are small, no learning
will occur and no improvement is
possible!
⬣ Important to reason about
output/gradient statistics and
analyze them for new layers and
architectures
Normalization,
Preprocessing,
and
Augmentation
Importance of Data

In deep learning, data drives

learning of features and classifier

⬣ Its characteristics are therefore

extremely important

⬣ Always understand your data!

⬣ Relationship between output

statistics, layers such as non-
linearities, and gradients is
important
Just like initialization, normalization can
improve gradient flow and learning

Typically normalization methods apply:

⬣ Subtract mean, divide by standard Data after subtracting mean,

deviation (most common) dividing by standard deviation

⬣ This can be done per dimension

⬣ Whitening, e.g. through Principle

Component Analysis (PCA) (not
common) Data after whitening
Figure from slides by Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Preprocessing
⬣ We can try to come up with a layer that can normalize the data across
the neural network
⬣ Given: A mini-batch of data [𝑩×𝑫] where 𝑩 is batch size
⬣ Compute mean and variance for each dimension 𝒅

From: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Sergey Ioffe, Christian Szegedy

Making Normalization a Layer

Normalize data

𝒙𝒊 − 𝝁𝑩
G𝒊 =
𝒙
𝝈𝟐𝑩 + 𝝐

Note: This part

does not involve
new parameters

From: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Sergey Ioffe, Christian Szegedy

Normalizing the Data

⬣ We can give the model
flexibility through
learnable parameters
𝜸 (scale) and 𝜷 (shift)

⬣ Network can learn to not

normalize if necessary!

⬣ This layer is called a

Batch Normalization
(BN) layer

From: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Sergey Ioffe, Christian Szegedy

Learnable Scaling and Offset

Some Complexities of BN
During inference, stored
mean/variances calculated on training
set are used
Sufficient batch sizes must be used to
get stable per-batch estimates during
training
⬣ This is especially an issue when
using multi-GPU or multi-machine
training
⬣ Use torch.nn.SyncBatchNorm to
estimate batch statistics in these
settings
Normalization especially important before
non-linearities!
⬣ Very low/high values (un-
normalized/imbalanced data) cause
saturation

Input

Linear BN Non-
Layer Linearity

Where to Apply BN
Generalization of BN

There are many variations of batch

normalization

⬣ See Convolutional Neural

Network lectures for an example

Resource:

⬣ ML Explained - Normalization
Optimizers
Deep learning involves complex,
compositional, non-linear functions

The loss landscape is extremely non-

convex as a result

There is little direct theory and a lot of

intuition/rules of thumbs instead

⬣ Some insight can be gained via

theory for simpler cases (e.g.
convex settings)

Loss Landscape
It used to be thought that
existence of local minima is
the main issue in optimization

There are other more

impactful issues:
⬣ Noisy gradient estimates
Saddle Point
⬣ Saddle points
⬣ Ill-conditioned loss surface From: Identifying and attacking the saddle point problem in high-
dimensional non-convex optimization, Dauphi et al., 2014.

Loss Landscape
⬣ We use a subset of the
data at each iteration to 𝟏
calculate the loss (& 𝑳 = 3 𝑳 (𝒇 𝒙𝒊 , 𝑾 , 𝒚𝒊 )
𝑴
gradients)

⬣ This is an unbiased
estimator but can have
high variance

⬣ This results in noisy steps

in gradient descent

Noisy Gradients
Several loss surface geometries
are difficult for optimization

Several types of minima: Local

minima, plateaus, saddle points
Plateau
Saddle points are those where the
gradient of orthogonal directions
are zero
⬣ But they disagree (it’s min for
one, max for another) Saddle Point

Loss Surface Geometry

⬣ Gradient descent takes a step in the
steepest direction (negative gradient) 𝝏𝑳
𝒘𝒊 = 𝒘𝒊"𝟏 − 𝜶
𝝏𝒘𝒊
⬣ Intuitive idea: Imagine a ball rolling
down loss surface, and use
momentum to pass flat surfaces

𝝏𝑳 Update Velocity
𝒗𝒊 = 𝜷𝒗𝒊"𝟏 + (starts as 0, 𝜷 = 𝟎. 𝟗𝟗)
𝝏𝒘𝒊"𝟏

𝒘𝒊 = 𝒘𝒊"𝟏 − 𝜶𝒗𝒊 Update Weights

⬣ Generalizes SGD (𝜷 = 𝟎)

Adding Momentum
⬣ Velocity term is an exponential moving average of the gradient
𝝏𝑳
𝒗𝒊 = 𝜷𝒗𝒊&𝟏 +
𝝏𝒘𝒊&𝟏

𝝏𝑳 𝝏𝑳
𝒗𝒊 = 𝜷(𝜷 𝒗𝒊&𝟐 + )+
𝝏𝒘𝒊&𝟐 𝝏𝒘𝒊&𝟏

𝟐
𝝏𝑳 𝝏𝑳
= 𝜷 𝒗𝒊&𝟐 + 𝜷 +
𝝏𝒘𝒊&𝟐 𝝏𝒘𝒊&𝟏

⬣ There is a general class of accelerated gradient methods, with

some theoretical analysis (under assumptions)

Accelerated Descent Methods

Equivalent formulation:

𝝏𝑳 Update Velocity
𝒗𝒊 = 𝜷𝒗𝒊"𝟏 − 𝜶
𝝏𝒘𝒊"𝟏 (starts as 0)

𝒘𝒊 = 𝒘𝒊"𝟏 + 𝒗𝒊 Update Weights

Equivalent Momentum Update

Key idea: Rather than combining velocity
with current gradient, go along velocity
first and then calculate gradient at new
point Velocity
New Gradient
⬣ We know velocity is probably a
reasonable direction
𝒘
G 𝒊"𝟏 = 𝒘𝒊"𝟏 − 𝜷𝒗𝒊"𝟏

𝝏𝑳
𝒗𝒊 = 𝜷𝒗𝒊"𝟏 +
𝝏𝒘
G 𝒊"𝟏

𝒘𝒊 = 𝒘𝒊"𝟏 − 𝜶 𝒗𝒊

Nesterov Momentum
Nesterov Momentum

Note there are several equivalent

formulations across deep learning
frameworks!

Resource:
https://medium.com/the-artificial-
impostor/sgd-implementation-in-
pytorch-4115bcb9f02c
⬣ Various mathematical ways to
characterize the loss landscape

⬣ If you liked Jacobians… meet the Second

order

First
order

⬣ Gives us information about the

curvature of the loss surface

Hessian and Loss Curvature

Condition number is the ratio of
the largest and smallest eigenvalue
⬣ Tells us how different the
curvature is along different
dimensions

If this is high, SGD will make big

steps in some dimensions and
small steps in other dimension

Second-order optimization methods

divide steps by curvature, but
expensive to compute

Condition Number
Per-Parameter Learning Rate

Idea: Have a dynamic learning rate

for each weight
Several flavors of optimization
algorithms:
⬣ RMSProp
⬣ Adagrad
⬣ Adam
⬣ …
SGD can achieve similar results in
many cases but with much more
tuning
Idea: Use gradient statistics
to reduce learning rate across 𝝏𝑳 𝟐
iterations 𝑮𝒊 = 𝑮𝒊"𝟏 +
𝝏𝒘𝒊"𝟏
𝜶 𝝏𝑳
Denominator: Sum up 𝒘𝒊 = 𝒘𝒊"𝟏 −
𝑮𝒊 + 𝝐 𝝏𝒘𝒊"𝟏
gradients over iterations
As gradients are
Directions with high accumulated learning
curvature will have higher rate will go to zero
gradients, and learning rate
will reduce Duchi, et al., “Adaptive Subgradient Methods for Online
Learning and Stochastic Optimization”

Adagrad
Solution: Keep a moving 𝝏𝑳 𝟐
average of squared 𝑮𝒊 = 𝜷𝑮𝒊"𝟏 + 𝟏 − 𝜷
𝝏𝒘𝒊"𝟏
gradients!
𝜶 𝝏𝑳
𝒘𝒊 = 𝒘𝒊"𝟏 −
𝑮𝒊 + 𝝐 𝝏𝒘𝒊"𝟏
Does not saturate the
learning rate

RMSProp first introduced in: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

RMSProp
𝝏𝑳
𝒗𝒊 = 𝜷𝟏 𝒗𝒊"𝟏 + 𝟏 − 𝜷𝟏
𝝏𝒘𝒊"𝟏

Combines ideas from 𝝏𝑳 𝟐

above algorithms 𝑮𝒊 = 𝜷𝟐 𝑮𝒊"𝟏 + 𝟏 − 𝜷𝟐
𝝏𝒘𝒊"𝟏

𝜶 𝒗𝒊
Maintains both first 𝒘𝒊 = 𝒘𝒊"𝟏 −
𝑮𝒊 + 𝝐
and second moment
statistics for gradients But unstable in the beginning
(one or both of moments will be
tiny values)

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Adam
𝝏𝑳
Solution: Time-varying bias 𝒗𝒊 = 𝜷𝟏 𝒗𝒊"𝟏 + 𝟏 − 𝜷𝟏
𝝏𝒘𝒊"𝟏
correction
𝟐
𝝏𝑳
𝑮𝒊 = 𝜷𝟐 𝑮𝒊"𝟏 + 𝟏 − 𝜷𝟐
Typically 𝜷𝟏 = 𝟎. 𝟗, 𝜷𝟐 = 𝟎. 𝟗𝟗𝟗 𝝏𝒘𝒊"𝟏

So 𝒗G𝒊 will be small number 𝒗𝒊 𝑮𝒊

𝒗G𝒊 = Q𝒊 =
𝑮
divided by (1-0.9=0.1) resulting 𝟏 − 𝜷𝒕𝟏 𝟏 − 𝜷𝒕𝟐
in more reasonable values (and
P 𝒊 larger) G𝒊
𝜶𝒗
𝑮 𝒘𝒊 = 𝒘𝒊"𝟏 −
Q𝒊 + 𝝐
𝑮

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Adam
Optimizers behave differently
depending on landscape

Different behaviors such as

overshooting, stagnating, etc.

Plain SGD+Momentum can

generalize better than adaptive
methods, but requires more tuning
⬣ See: Luo et al., Adaptive
Gradient Methods with
Dynamic Bound of Learning
Rate, ICLR 2019
From: https://mlfromscratch.com/optimizers-explained/#/

Behavior of Optimizers
First order optimization methods have
learning rates
Training
Theoretical results rely on annealed Loss
learning rate

Several schedules that are typical:

⬣ Graduate student!
⬣ Step scheduler
⬣ Exponential scheduler
⬣ Cosine scheduler
From: Leslie Smith, “Cyclical Learning Rates for Training Neural Networks”

Learning Rate Schedules

Regularization
Many standard regularization methods still apply!

L1 Regularization

𝑳 = |𝒚 − 𝑾𝒙𝒊 |𝟐 + 𝝀|𝑾|
where |𝑾| is element-wise

Example regularizations:
⬣ L1/L2 on weights (encourage small values)
⬣ L2: 𝑳 = 𝒚 − 𝑾𝒙𝒊 |𝟐 + 𝝀 𝑾|𝟐 (weight decay)
⬣ Elastic L1/L2: 𝒚 − 𝑾𝒙𝒊 |𝟐 + 𝜶 𝑾|𝟐 + 𝜷|𝑾|

Regularization
output
layer

input
layer hidden hidden
layer 1 layer 2

Problem: Network can learn to rely strong on a few features that work
really well
⬣ May cause overfitting if not representative of test data

From: Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.

Preventing Co-Adapted Features

𝒑 = 𝟎. 𝟓

output
layer

input
layer hidden hidden
layer 1 layer 2

An idea: For each node, keep its output with probability p

⬣ Activations of deactivated nodes are essentially zero
Choose whether to mask out a particular node each iteration
From: Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.

Dropout Regularization
⬣ In practice, implement
with a mask calculated output
layer
each iteration input
layer hidden hidden
layer 1 layer 2

⬣ During testing, no
nodes are dropped 𝒂𝟏𝟏 𝟎
𝒂𝟐𝟏
𝒂𝟑𝟏 ⋅ 𝟏
𝟎
𝒂𝟒𝟏 𝟏
From: Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.

Dropout Implementation
⬣ During training, each node has an
expected 𝒑 ∗ 𝒇𝒂𝒏_𝒊𝒏 nodes
⬣ During test all nodes are activated
⬣ Principle: Always try to have
similar train and test-time output
input/output distributions! layer
input
layer hidden hidden
layer 1 layer 2
Solution: During test time, scale
outputs (or equivalently weights) by 𝒑
⬣ i.e. 𝑾𝒕𝒆𝒔𝒕 = 𝒑𝑾 𝒂𝟏𝟏 𝟎
𝒂𝟐𝟏
𝒂𝟑𝟏 ⋅ 𝟏
𝟎
𝟏 𝒂𝟒𝟏 𝟏
⬣ Alternative: Scale by 𝒑
at train time
From: Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.

Inference with Dropout

Interpretation 1: The model should
not rely too heavily on particular
features
⬣ If it does, it has probability 𝟏 − 𝒑
output
of losing that feature in an layer
iteration input
layer hidden hidden
layer 1 layer 2

From: Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.

Why Dropout Works

networks:
⬣ Each configuration is a network
⬣ Most are trained with 1 or 2 mini-
batches of data
From: Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.

Why Dropout Works

Data
Augmentation
Data augmentation – Performing a range of transformations to
the data
⬣ This essentially “increases” your dataset
⬣ Transformations should not change meaning of the data (or
label has to be changed as well)
Simple example: Image Flipping

Data Augmentation: Motivation

Random crop
⬣ Take different crops during training
⬣ Can be used during inference too!

CutMix
Yun et al., CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features

Random Crop
Color Jitter

From https://mxnet.apache.org/versions/1.5.0/tutorials/gluon/data_augmentation.html

Color Jitter
We can apply generic affine
transformations:
⬣ Translation

⬣ Rotation

⬣ Scale
⬣ Shear

Geometric Transformations
We can combine these transformations to add even more variety!

From https://mxnet.apache.org/versions/1.5.0/tutorials/gluon/data_augmentation.html

Combining Transformations
Unlabelled
Image 𝑥,#
Unlabelled
Image 𝑥,

Mask proportion p
Mean
CowMask m

mix mix

Masked
CowMask m Masked
Noise Image 𝑥,!
Image 𝑥*

Unlabelled
Image 𝑥,"
CowMix
From French et al., “Milking CowMask for Semi-Supervised Image Classification”

Other Variations
The Process
of Training
Neural
Networks
⬣ Training deep neural networks is an art
form!
⬣ Lots of things matter (together) – the key Local
Minima
is to find a combination that works
⬣ Key principle: Monitoring everything to
understand what is going on!
⬣ Loss and accuracy curves
⬣ Gradient statistics/characteristics Optimizer
⬣ Other aspects of computation graph Trajectory

The Process of Training

Proper Methodology

Always start with proper methodology!

⬣ Not uncommon even in published papers
to get this wrong
Separate data into: Training, validation, test
set
⬣ Do not look at test set performance until
you have decided on everything (including
hyper-parameters)
Use cross-validation to decide on hyper-
parameters if amount of data is an issue
Check the bounds of your loss function
⬣ E.g. cross-entropy ranges from [𝟎, ∞]
⬣ Check initial loss at small random weight
values
⬣ E.g. −𝐥𝐨𝐠(𝒑) for cross-entropy,
where 𝒑 = 𝟎. 𝟓
Another example: Start without
regularization and make sure loss goes up
when added
Validation Loss
Key Principle: Simplify the dataset to make
sure your model can properly (over)-fit
before applying regularization

Sanity Checking
Change in loss indicates speed of
learning: Learning
Rate
⬣ Tiny loss change -> too small of a Too Low
learning rate
⬣ Loss (and then weights) turn to NaNs ->
too high of a learning rate Learning
Rate
Other bugs can also cause this, e.g.: Too High
⬣ Divide by zero
⬣ Forgetting the log!
In pytorch, use autograd’s detect
anomaly to debug

Loss and Not a Number (NaN)

⬣ Classic machine learning signs of
under/overfitting still apply!
Validations
⬣ Over-fitting: Validation loss/accuracy starts to
Training
get worse after a while

⬣ Under-fitting: Validation loss very close to Loss

training loss, or both are high

⬣ Note: You can have higher training loss!

⬣ Validation loss has no regularization

⬣ Validation loss is typically measured at

the end of an epoch Loss

Overfitting
Many hyper-parameters to tune! Grid Layout Random Layout
⬣ Learning rate, weight decay
crucial

Unimportant

Unimportant
parameter

parameter
⬣ Momentum, others more stable
⬣ Always tune hyper-parameters;
even a good idea will fail un-
Important Important
tuned! parameter parameter
Start with coarser search: From: Bergstra et al., “Random Search for Hyper-Parameter Optimization”,

⬣
JMLR, 2012
E.g. learning rate of {0.1, 0.05,
0.03, 0.01, 0.003, 0.001, 0.0005,
0.0001} Automated methods are OK, but
⬣ Perform finer search around good intuition (or random) can do well given
values enough of a tuning budget

Hyper-Parameter Tuning
Inter-dependence of Hyperparameters
Note that hyper-parameters and even module
selection are interdependent!
Examples:
⬣ Batch norm and dropout maybe not be
needed together (and sometimes the
combination is worse)
⬣ The learning rate should be changed
proportionally to batch size – increase
the learning rate for larger batch sizes
⬣ One interpretation: Gradients are
more reliable/smoother
relevant elements
Note that we are optimizing a loss
function false negatives true negatives

What we actually care about is

typically different metrics that we
can’t differentiate:
true positives false positives
⬣ Accuracy
⬣ Precision/recall
⬣ Other specialized metrics
The relationship between the two
selected elements
can be complex!
From https://en.wikipedia.org/wiki/Precision_and_recall

Relationship Between Loss and Other Metrics

⬣ Example: Cross entropy loss

𝑳 = −𝒍𝒐𝒈 𝑷(𝒀 = 𝒚𝒊 |𝑿 = 𝒙𝒊 )

⬣ Accuracy is measured based on:

Loss
𝒂𝒓𝒈𝒎𝒂𝒙𝒊 (𝑷 𝒀 = 𝒚𝒊 𝑿 = 𝒙𝒊 )

⬣ Since the correct class score only has

to be slightly higher, we can have flat
loss curves but increasing
accuracy! Accuracy

Simple Example: Cross-Entropy and Accuracy

⬣ TPR/FPR curves represent the inherent
tradeoff between number of positive predictions
and correctness of predictions
⬣ Receiver operating characteristic (ROC)
curves similar, plot precision/recall instead
⬣ Definitions
𝒕𝒑
⬣ True Positive Rate: 𝑻𝑷𝑹 =
𝒕𝒑0𝒇𝒏
𝒇𝒑
⬣ False Positive Rate: 𝑭𝑷𝑹 =
𝒇𝒑0𝒕𝒏
𝒕𝒑0𝒕𝒏
⬣ 𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 = 𝒕𝒑0𝒕𝒏0𝒇𝒑0𝒇𝒏

From
https://en.wikipedia.org/wiki/Receiver_operating_
characteristic

Example: Precision/Recall or ROC Curves

⬣ Precision/Recall curves represent the
inherent tradeoff between number of positive
predictions and correctness of predictions
⬣ Definitions
𝒕𝒑
⬣ True Positive Rate: 𝑻𝑷𝑹 =
𝒕𝒑0𝒇𝒏
𝒇𝒑
⬣ False Positive Rate: 𝑭𝑷𝑹 = 𝒇𝒑0𝒕𝒏
𝒕𝒑0𝒕𝒏
⬣ 𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 = 𝒕𝒑0𝒕𝒏0𝒇𝒑0𝒇𝒏
⬣ We can obtain a curve by varying the
(probability) threshold:
⬣ Area under the curve (AUC) common From
https://en.wikipedia.org/wiki/Receiver_operating_
single-number metric to summarize characteristic

⬣ Mapping between this and loss is not simple!

Example: Precision/Recall or ROC Curves

Resource: Local
Minima
⬣ A disciplined approach to
neural network hyper-
parameters: Part 1 --
learning rate, batch size,
momentum, and weight
decay, Leslie N. Smith
Optimizer
Trajectory

Resources

Deep Learning Hand Book 2024
No ratings yet
Deep Learning Hand Book 2024
185 pages
Deep Learning Turorial PDF
No ratings yet
Deep Learning Turorial PDF
301 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
26 pages
Fixing Neural Network Course 2 1659759284
No ratings yet
Fixing Neural Network Course 2 1659759284
30 pages
LecML - 3 NN
No ratings yet
LecML - 3 NN
33 pages
L7 Lecture Image - classification.DNN v4
No ratings yet
L7 Lecture Image - classification.DNN v4
61 pages
Unit 1
No ratings yet
Unit 1
16 pages
IoT - Lecture 11
No ratings yet
IoT - Lecture 11
58 pages
Learning Algorithm
No ratings yet
Learning Algorithm
100 pages
Unit-5 AI ETC
No ratings yet
Unit-5 AI ETC
64 pages
Slides 11
No ratings yet
Slides 11
48 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
ML Lec 10 Neural Networks
No ratings yet
ML Lec 10 Neural Networks
87 pages
Deep Learning Andrew NG
100% (3)
Deep Learning Andrew NG
173 pages
Unit 2 - ML
No ratings yet
Unit 2 - ML
18 pages
Deep Neural Network
No ratings yet
Deep Neural Network
60 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
Introduction Deep Eng
No ratings yet
Introduction Deep Eng
50 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
Notes On Introduction To Deep Learning
No ratings yet
Notes On Introduction To Deep Learning
19 pages
Lecture W15ab
No ratings yet
Lecture W15ab
44 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
Lect 12 - Deep Feed Forward NN - Review
No ratings yet
Lect 12 - Deep Feed Forward NN - Review
93 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
CERN Deep Learning and Vision
No ratings yet
CERN Deep Learning and Vision
72 pages
ACI Structural Journal May 2023 V. 120 No. 3
No ratings yet
ACI Structural Journal May 2023 V. 120 No. 3
258 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
Deep Learing
No ratings yet
Deep Learing
37 pages
Unit 5 (Second Half)
No ratings yet
Unit 5 (Second Half)
10 pages
Unit 3
No ratings yet
Unit 3
110 pages
2.game AI 1
No ratings yet
2.game AI 1
268 pages
ML Unit 4
No ratings yet
ML Unit 4
23 pages
Deep Learning
100% (2)
Deep Learning
49 pages
Training Neural
No ratings yet
Training Neural
16 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Ch22 Presn PDF
No ratings yet
Ch22 Presn PDF
34 pages
cst414 - Deep Learning
No ratings yet
cst414 - Deep Learning
34 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
Unit V
No ratings yet
Unit V
26 pages
Unit I
No ratings yet
Unit I
90 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
Ca 3 DL
No ratings yet
Ca 3 DL
6 pages
Liver Tumor Segmentation Thesis
No ratings yet
Liver Tumor Segmentation Thesis
62 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
3.convolutional Networks and Sequence Modeling
No ratings yet
3.convolutional Networks and Sequence Modeling
19 pages
Notes DL-1
No ratings yet
Notes DL-1
10 pages
Unit Ii DNN
No ratings yet
Unit Ii DNN
24 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
A Survey of Deep Convolutional Neural Networks Applied For Prediction of Plant Leaf Diseases
No ratings yet
A Survey of Deep Convolutional Neural Networks Applied For Prediction of Plant Leaf Diseases
35 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
Training Neural Networks
No ratings yet
Training Neural Networks
109 pages
Deep Learning Computer Vision NLP
No ratings yet
Deep Learning Computer Vision NLP
140 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
Deep Learning Answers
No ratings yet
Deep Learning Answers
36 pages
Density Estimation Using Real NVP
No ratings yet
Density Estimation Using Real NVP
32 pages
Ijst 2021 1266
No ratings yet
Ijst 2021 1266
15 pages
On Calibration of Modern Neural Networks
No ratings yet
On Calibration of Modern Neural Networks
14 pages
批处理标准化如何帮助优化（李宏毅教授视频推荐）
No ratings yet
批处理标准化如何帮助优化（李宏毅教授视频推荐）
26 pages
BNU Net A Novel Deep Learning Approach For LV MRI Analysis in Short
No ratings yet
BNU Net A Novel Deep Learning Approach For LV MRI Analysis in Short
6 pages
How To Get Pavement Distress Detection Ready For Deep Learning A Systematic Approach
No ratings yet
How To Get Pavement Distress Detection Ready For Deep Learning A Systematic Approach
9 pages
CNN Based Automatic Detection of PV Cell Defects...
No ratings yet
CNN Based Automatic Detection of PV Cell Defects...
15 pages
Audio-Visual Sentiment Analysis For Learning Emotional Arcs in Movies
No ratings yet
Audio-Visual Sentiment Analysis For Learning Emotional Arcs in Movies
6 pages
Handwritten Bengali Alphabets, Compound Characters and Numerals Recognition Using CNN-based Approach
No ratings yet
Handwritten Bengali Alphabets, Compound Characters and Numerals Recognition Using CNN-based Approach
19 pages
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
No ratings yet
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
8 pages
Breast Cancer Classification-IEEE Paper
No ratings yet
Breast Cancer Classification-IEEE Paper
17 pages
CNNs Pytorch
No ratings yet
CNNs Pytorch
19 pages
Zero Initialization Initializi
No ratings yet
Zero Initialization Initializi
14 pages
Beyond A Gaussian Denoiser: Residual Learning of Deep CNN For Image Denoising
No ratings yet
Beyond A Gaussian Denoiser: Residual Learning of Deep CNN For Image Denoising
14 pages
DL Unit-1
No ratings yet
DL Unit-1
25 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Exploring Simple Siamese Representation Learning
No ratings yet
Exploring Simple Siamese Representation Learning
10 pages
Systematic Evaluation of Convolution Neural Network Advances On The Imagenet-2017
No ratings yet
Systematic Evaluation of Convolution Neural Network Advances On The Imagenet-2017
9 pages
Retina Net
No ratings yet
Retina Net
2 pages
SP18 Practice Midterm
No ratings yet
SP18 Practice Midterm
5 pages
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
From Everand
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
César Pérez López
No ratings yet
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet