0% found this document useful (0 votes)
32 views

Regularization

The document summarizes various regularization techniques for neural networks: 1) Parameter norm penalization adds a penalty term to the loss function based on the size of the network parameters to discourage overfitting. L1 and L2 regularization are discussed. 2) Data augmentation helps increase generalization by artificially expanding the training dataset, such as through basic transformations or adding noise. 3) Early stopping monitors validation error during training and stops when it starts increasing to avoid overfitting. 4) Ensemble methods like dropout train multiple similar networks on different versions of the data to improve performance through averaging of their predictions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Regularization

The document summarizes various regularization techniques for neural networks: 1) Parameter norm penalization adds a penalty term to the loss function based on the size of the network parameters to discourage overfitting. L1 and L2 regularization are discussed. 2) Data augmentation helps increase generalization by artificially expanding the training dataset, such as through basic transformations or adding noise. 3) Early stopping monitors validation error during training and stops when it starts increasing to avoid overfitting. 4) Ensemble methods like dropout train multiple similar networks on different versions of the data to improve performance through averaging of their predictions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Regularization and Optimization of

Backpropagation

Keith L. Downing

The Norwegian University of Science and Technology (NTNU)


Trondheim, Norway
[email protected]

December 1, 2020

Keith L. Downing Regularization and Optimization of Backpropagation


Outline

Regularization
Optimization

Keith L. Downing Regularization and Optimization of Backpropagation


Regularization

Definition of Regularization
Reduction of testing error while maintaining a low training error.
Excessive training does reduce training error, but often at the expense
of higher testing error.
The network essentially memorizes the training cases, which hinders
its ability to generalize and handle new cases (e.g., the test set).
Exemplifies tradeoff between bias and variance: an NN with low bias
(i.e. training error) has trouble reducing variance (i.e. test error).
Bias(θ̃m ) = E(θ̃m ) − θ
Where θ is the parameterization of the true generating function, and θ̃m
is the parameterization of an estimator based on the sample, m. E.g.
θ̃m are weights of an NN after training on sample set m.
Var(θ̃m ) = degree to which the estimator’s results change with other
data samples (e.g., the test set) from same data generator.
→ Regularization = attempt to combat the bias-variance tradeoff: to
produce a θ̃m with low bias and low variance.

Keith L. Downing Regularization and Optimization of Backpropagation


Types of Regularization

Parameter Norm Penalization


Data Augmentation
Multitask Learning
Early Stopping
Sparse Representations
Ensemble Learning
Dropout
Adversarial Training

Keith L. Downing Regularization and Optimization of Backpropagation


Parameter Norm Penalization
Some parameter sets, such as NN weights, achieve over-fitting when many
parameters (weights) have a high absolute value. So incorporate this into the
loss function.
L̃(θ , C) = L(θ , C) + αΩ(θ )
L = loss function; L̃ = regularized loss function; C = cases; θ = parameters of
the estimator (e.g. weights of the NN); Ω = penalty function; α = penalty
scaling factor

L2 Regularization
1
Ω2 (θ ) = 2 ∑wi ∈θ wi2
Sum of squared weights across the whole neural network.

L1 Regularization
Ω1 (θ ) = ∑wi ∈θ |wi |
Sum of absolute values of all weights across the whole neural network.

Keith L. Downing Regularization and Optimization of Backpropagation


L1 -vs- L2 Regularization

Though very similar, they can have different effects.

In cases where most weights have a small absolute value (as during the
initial phases of typical training runs), L1 imposes a much stiffer penalty:
|wi | > wi2 .
Hence, L1 can have a sparsifying effect: it drives many weights to zero.
∂ L̃
Comparing penalty derivatives (which contribute to ∂ wi
gradients):
∂ Ω2 (θ ) ∂ 1 2
∂ wi = ∂ wi ( 2 ∑wk ∈θ wk ) = wi
∂ Ω1 (θ ) ∂
∂ wi = ∂ wi ∑wk ∈θ |wk | = sign(wk ) ∈ {−1, 0, 1}
So L2 penalty scales linearly with wi , giving a more stable update
scheme. L1 provides a more constant gradient that can be large
compared to wi (when |wi | < 1).
This also contributes to L1 ’s sparsifying effect upon θ .
Sparsification supports feature selection in Machine Learning.

Keith L. Downing Regularization and Optimization of Backpropagation


Dataset Augmentation

It is easier to overfit small data sets. By training on larger sets,


generalization naturally increases.
But we may not have many cases.
Sometimes we can manufacture additional cases.

Approaches to Manufacturing New Data Cases


1 Modify the features in many (small or big) ways, but which we know will
not change the target classification. E.g. rotate or translate an image.
2 Add (small) amounts of noise to the features and assume that this does
not change the target. E.g., add noise to a vector of sensor readings,
stock trends, customer behaviors, patient measurements, etc.

Another problem is noisy data: the targets are wrong.


Solution: Add (small) noise to all target vectors and use softmax on
network outputs. This prevents overfitting to bad cases and thus
improves generalization.

Keith L. Downing Regularization and Optimization of Backpropagation


Multitask Learning
Representations learned for task 1 can be reused for task 2, thus
requiring fewer cases and less training to learn task 2.
By forcing the net to perform multiple tasks, its shared hidden layer (H0)
has general-purpose representations.

Specialized
Out-1 Out-2 Actions

Specialized
Representations
H1 H2 H3

General detector of
patterns in data, but
Shared unrelated to a task
H0
Representations

Input

Keith L. Downing Regularization and Optimization of Backpropagation


Early Stopping
Divide data into training, validation and test sets.
Check the error on the validation set every K epochs, but do not learn
(i.e. modify weights) from these cases.
Stop training when the validation error rises.
Easy, non-obtrustive form of regularization: no need to change the
network, the loss function, etc. Very common.

Data Set

Training Training
Error

Testing

Validation Validation

Testing

Epochs
Should Stop
Training Here

Keith L. Downing Regularization and Optimization of Backpropagation


Sparse Representations
Instead of penalizing parameters (i.e., weights) in the loss function,
penalize hidden-layer activations.
Ω(H) where H = hidden layer(s) and Ω can be L1 , L2 or others.
Sparser reps → better pattern separation among classes → better
generalization.

Layer
Weights
Activations

Negative

Positive

Zero

Sparse Parameters Dense Representation

Dense Parameters Sparse Representation

Keith L. Downing Regularization and Optimization of Backpropagation


Ensemble Learning
Train multiple classifiers on different versions of the data set.
Each has different, but uncorrelated (key point) errors.
Use voting to classify test cases.
Each classifier has high variance, but the ensemble does not.

Original Data Set Diverse Training Sets

New Test
Case
"Diamond" "Cross" "Diamond"

Keith L. Downing Regularization and Optimization of Backpropagation


Dropout
Case 33 Case 34

1 0 0 1 1 0 0 0 1 0 1 1

Never drop
output
nodes

For each case, randomly select nodes at all levels (except output)
whose outputs will be clamped to 0.
Dropout prob range: (0.2, 0.5) - lower for input than hidden layers.
Similar to bagging, with each model = a submodel of entire NN.
Due to missing non-output nodes (but fixed-length outputs and targets)
during training, each connection has more significance and thus needs
(and achieves by learning) a higher magnitude.
Testing involves all nodes.
Scaling: prior to testing, multiply all weights by (1- ph ), where ph = prob.
dropping hidden nodes.

Keith L. Downing Regularization and Optimization of Backpropagation


Adversarial Training

Class 3
Class 1 Class 2

dL / d f2 = high New
Case

Feature 2

Class 3
Old
Class 1
Case

dL / d f1 = low
Clouds: NN regions
Boxes: True regions

Feature 1

Create new cases by mutating old cases in ways that maximize the
chance that the net will misclassify the new case.
∂L
Calc ∂ fi
for loss func (L) and each input feature fi .

∀i : fi ← fi + λ ∂∂ fL - make largest moves along dimensions most likely to


i
increase loss/error; λ is very small positive.
Train on the new cases → Increased generality.

Keith L. Downing Regularization and Optimization of Backpropagation


Optimization

Classic Optimization Goal: Reduce training error


Classic Machine Learning Goal: Produce a general
classifier.
→ Low error on training and test set.
→ Finding the global minimum in the error (loss) landscape
is less important; a local minimum may even be a better
basis for good performance on the test set.

Techniques to Improve Backprop Training


Momentum
Adaptive Learning Rates
Batch Normalization
Higher Order Derivatives

Keith L. Downing Regularization and Optimization of Backpropagation


Smart Search

Optimization → Making smart moves in the K+1 dimensional


error (loss) landscape; K = k parameters k (wgts + biases).
Ideally, this allows fast mvmt to global error mimimum.

∂ (Loss)
4wi = −λ
∂ wi

Techniques for Smart Movement


λ : Make step size appropriate for the texture (smoothness,
bumpiness) of the current region of the landscape.
∂ (Loss)
∂ wi :
Avoid domination by the current (local) gradient by
including previous gradients in the calculation of 4wi .

Keith L. Downing Regularization and Optimization of Backpropagation


Overshoot

Descent Gradient

Appropriate Δw

∆w(t) ∆w(t+1) ∆w(t+2)

When current search state is in a bowl (canyon) of the


search landscape, following gradients can lead to
oscillations if the step size is too big.
Step size is λ (learning rate): 4w = −λ ∂∂wL .

Keith L. Downing Regularization and Optimization of Backpropagation


Inefficient Oscillations

w2

Loss(L)

w1

Following the (weaker) gradient along w1 leads to the minimum loss, but
the w2 gradients, up and down the sides of the cylinder (canyon), cause
excess lateral motion.
Since the w1 and w2 gradients are independent, search should still
move downhill quickly (in this smooth landscape).
But in a more rugged landscape, the lateral movement could visit
locations where the w1 gradient is untrue to the general trend. E.g. a
little dent or bump on the side of the cylinder could have gradients
indicating that increasing w1 would help reduce Loss (L).

Keith L. Downing Regularization and Optimization of Backpropagation


Dangers of Oscillation in a Rugged Landscape

End up
here

Bump nd
Should end
in Tre
up here! Ma

Best Move Move


Recommended
by Local Gradient

How do we prevent local gradients from dominating the search?

Keith L. Downing Regularization and Optimization of Backpropagation


Momentum: Combatting Local Minima

𝝰∆w(t-1)

- ƛdE/dw

∆w(t)
Error

∆w(t-1)

∆w(t)

∂E
∆wij (t) = −λ + α∆wij (t − 1)
∂ wij
E
Without momentum, λ ∂∂ w leads to a local minimum
Keith L. Downing Regularization and Optimization of Backpropagation
Momentum in High Dimensions

High Error Region Local Minimum

Low Error Region Global Minimum

Momentum

Keith L. Downing Regularization and Optimization of Backpropagation


Momentum in Stochastic Gradient Descent (SGD)

λ = learning rate; θ = wgts + biases.


α = momentum factor (with typical value = 0.5, 0.9, 0.99)

The main SGD Loop (with Standard Momentum)



Calc gradients: gt ← ∂∂ θL

θt
Update velocity:vt+1 ← αvt − λ gt
Update weights and biases: θt+1 ← θt + vt+1

The main SGD Loop (with Nesterov Momentum)



Calc gradients: g ← ∂∂ θL

θt +αvt
Update velocity:vt+1 ← αvt − λ g
Update weights and biases: θt+1 ← θt + vt+1

Key difference: evaluates gradients using previous velocity

Keith L. Downing Regularization and Optimization of Backpropagation


Adaptive Learning Rates
These methods have one learning rate per parameter (i.e. weight and bias).

Batch Gradient Descent


Running all training cases before updating any weights.
Delta-bar-delta Algorithm (Jacobs, 1988)
z = any system parameter.
While sign( ∂ Loss
∂ z ) is constant, learning rate ↑.
∂ Loss
When sign( ∂ z ) changes, learning rate ↓

Stochastic Gradient Descent (SGD)


Running a minibatch, then updating weights.
Scale each learning rate inversely by accumulated gradient
magnitudes over time. So learning rates tend to decrease over time;
the question is often: How quickly? Caveat: some of the gradient
accumulators do a weighted average that may decrease over time.
More high-magnitude (pos or neg) gradients → learning rate ↓ quicker.
Popular variants: AdaGrad, RMSProp, Adam

Keith L. Downing Regularization and Optimization of Backpropagation


Basic Framework for Adaptive SGD

1 λ = a tensor of learning rates, one per parameter.


2 Init each rate in λ to same value.
3 Init the global gradient accumulator (R) to 0.
4 Minibatch Training Loop
Sample a minibatch, M.
Reset local gradient accumulator: G ← ZeroTensor .
For each case (c) in M:
Run c through the network; compute loss and gradients (g).
G ← G + g (for all parameters)
R ← faccum (R, G)
∆θ = −1 × fscale (R, λ ) G
where θ = weights and biases, and = element-wise
operator: multiply each gradient by its own scale factor.
Key differences between optimizers: faccum and fscale

Keith L. Downing Regularization and Optimization of Backpropagation


Specializations
AdaGrad (Duchi et. al., 2011)
faccum (R, G) = R + G G (square each gradient in G)
fscale (R, λ ) = λ√
(where δ is very small, e.g. 10−7 )
δ+ R

RMSProp (Hinton, 2012)


faccum (R, G) = ρR + (1 − ρ)G G (where ρ = decay rate)
fscale (R, λ ) = √λ (where δ is very small, e.g. 10−6 )
δ +R
ρ controls weight given to older gradients.

Adam (adaptive moments) (Kingma and Ba, 2014)


ρR+(1−ρ)G G
faccum (R, G) = 1−ρ T
(where ρ = decay rate, T = timestep)
S
fscale (R, λ ) = λ√
(δ = 10−8 )
δ+ R
φ S+(1−φ )G
where S ← 1−φ T
(S = first-moment estimate, φ = decay)

RMSProp and Adam are the most popular.


Keith L. Downing Regularization and Optimization of Backpropagation
Momentum in Adam

Adagrad and RMSprop use momentum only indirectly via


the 2nd order moment estimate (R).
The S term in the Adam optimizer is a 1st-order moment
estimate: a more direct use of previous gradients to affect
∆θ .
It implements momentum, to produce a modified value of
the gradient G.
Since S (and thus G) is included in fscale , we do not need G
in the final calculation of 4θ , which, in Adam, is:

4θ = −1 × fscale (R, λ )

Keith L. Downing Regularization and Optimization of Backpropagation


Batch Normalization
Normalize all activations in a layer w.r.t. minibatch averages.

Training Phase
Hidden layer (h) of size n; Minibatch of size m.
hik = output of kth neuron for case i of the minibatch
Calculate averages and standard deviations (per neuron):

1 m
µk = ∑ hik
m i=1
s
1 m
σk = δ+ ∑ [hik − µk ]2
m i=1

δ = small constant (e.g. 10−7 ) to avoid divide by 0.


Scale activations, hik ∀i, k :
hik − µk
ĥik =
σk

Testing Phase: Calc ĥik using µk and σk from ALL training data.
Keith L. Downing Regularization and Optimization of Backpropagation
Backpropagation with Batch Normalization (BN)

Input

Forward Pass Backpropagation

Hidden Layer L Hidden Layer L

Batch Norm Layer Batch Norm Layer

Hidden Layer L+1 Hidden Layer L+1

Batch Norm Layer Batch Norm Layer

Output

∂ Loss
Gradients ∂w
easily calculated across batch norm layers.
BN can be used either a) after Fact or b) between Σ inputs and Fact .

Keith L. Downing Regularization and Optimization of Backpropagation


Why Is BN A Popular Optimizer?
Normalization → many small outputs → less saturation of output
functions → fewer vanishing gradients.
Although sigmoid and tanh also have small outputs, the BN layer
maintains significant gradients.
Higher learning rates can be used with BN.
Even sigmoids and tanhs used in hidden layers with BN.
BN Handles a Covariate Shift
Covariate Shift → Distribution of features differs between
training and test data.
Problem: Train on one ”type” of inputs, test on another. E.g.
pictures of dogs under different lighting.
Testing goes poorly unless training and test data both
scaled to same distribution.
Each hidden layer receives ”features” from the previous
layer. It’s harder to learn when those features are always
changing due to a) different data, and b) learning (which
changes upstream weights and outputs).

Keith L. Downing Regularization and Optimization of Backpropagation


Why Is BN a Useful Regularizer?

In scaling the inputs and other activations, the use of


minibatch statistics (avgs and variances) instead of those
for the full batch introduces noise into the data (and
processing).
This prevents over-fitting in ways similar to both dataset
augmentation and dropout.

Keith L. Downing Regularization and Optimization of Backpropagation


Second Derivatives: Quantifying Curvature
dL/dw < 0 dL/dw > 0

2 2
d L/dw = 0
L

2 2
d L/dw < 0

2 2
d L/dw > 0

2
When sign( ∂∂wL ) = sign( ∂∂wL2 ), the current slope gets more extreme.
∂ 2L
For gradient descent learning: ∂w2
< 0 (> 0) → More (Less) descent
∂L
per 4w than estimated by the standard gradient, ∂w
.

Keith L. Downing Regularization and Optimization of Backpropagation


Approximations using 1st and 2nd Derivatives

For a point (p0 ) in search space for which we know F(p0 ), use 1st and
2nd derivative of F at p0 to estimate F(p) for a nearby point p.
Knowing F(p) affects decision of moving to (or toward) p from p0 .
Using the 2nd-order Taylor Expansion of F to approximate F(p):
1
F (p) ≈ F (p0 ) + F 0 (p0 )(p − p0 ) + F 00 (p0 )(p − p0 )2
2
Extend this to a multivariable function such as L (the loss function), that
computes a scalar L(θ ) when give the tensor of variables, θ :
!
∂ 2L
 
T ∂L 1 T
L(θ ) ≈ L(θ0 ) + (θ − θ0 ) + (θ − θ0 ) (θ − θ0 )
∂ θ θ0 2 ∂θ2
θ0

∂L ∂ 2L
∂θ
= Jacobian, and ∂θ2
= Hessian
Knowing L(θ ) affects decision of moving to (or toward) θ from θ0 .

Keith L. Downing Regularization and Optimization of Backpropagation


The Hessian Matrix

A matrix (H) of all second derivatives of a function, f, with


respect to all parameters.
For gradient descent learning, f = the loss function (L) and the
parameters are weights (and biases).

∂ 2L
H(L)(w)ij =
∂ wi ∂ wj

Wherever the second derivs are continuous, H is symmetric:

∂ 2L ∂ 2L
=
∂ wi ∂ wj ∂ wj ∂ wi

Verify this symmetry: Let L(x, y ) = 4x 2 y + 3y 3 x 2 , and then


2 2
compute ∂∂x∂Ly and ∂∂y ∂Lx

Keith L. Downing Regularization and Optimization of Backpropagation


The Hessian Matrix

∂ L2 ∂ L2 ∂ L2
 
∂ w12 ∂ w1 ∂ w2 ··· ∂ w1 ∂ wn
 
 
∂ L2 ∂ L2 ∂ L2
 

 ∂ w2 ∂ w1 ∂ w22
··· ∂ w2 ∂ wn


H(L)(w)ij = 
 ... ... ... ... 


 ... ... ... ... 

 
 
 
∂ L2 ∂ L2 ∂ L2
∂ wn ∂ w1 ∂ wn ∂ w2 ··· ∂ wn2

Keith L. Downing Regularization and Optimization of Backpropagation


Function Estimates via Jacobian and Hessian

p
(Δx, Δy)
y
q
Contours =
values of F(x,y)

Given: point p, point q, F(p), the Jacobian(J) and Hessian(H) at p.


Goal: Estimate F(q)
Approach: Combine 4p = [4x, 4y ]T with J and H.

The Jacobian’s Contribution to 4F (x, y)



∂F
∂x
[4x, 4y ] • ≈ 4F (x, y)|p→q

∂F
∂y
p

Keith L. Downing Regularization and Optimization of Backpropagation


The Hessian’s Contribution to 4F (x, y )

Hessian •4p ≈ Jacobian


 ∂F2 ∂F2


4x ∂F
∂ x2 ∂ x∂ y ∂x
 • ≈
 

∂F2 ∂F2 4y ∂F
∂y

∂ y∂ x ∂y2 p p

4pT • Hessian • 4p ≈ 4F (x, y )


∂F2 ∂F2
 
∂ x2 ∂ x∂ y
4x

[4x, 4y ] •   • ≈


∂F2 ∂F2 4y
∂ y∂ x ∂y2 p


∂F
∂x
≈ [4x, 4y ] • ≈ 4F (x, y )|p→q

∂F
∂y

p

Keith L. Downing Regularization and Optimization of Backpropagation


Eigenvalues of The Hessian Matrix

For any unit vector, v, the 2nd derivative of L in that direction is v T Hv .


Since the Hessian is real and symmetric, it decomposes into an
eigenvector basis and a set of real eigenvalues.
For each eigenvector-eigenvalue pair (vi , κi ), where each vi is a unit
vector, the 2nd derivative of L in the direction vi is κi , since:

viT Hvi = viT κi vi = viT vi κi = (1)κi

Hvi = κi vi by definition of Eigenvectors and Eigenvalues.


viT vi = 1 since vi is a unit vector.
The max (min) eigenvalue = the max (min) second derivative along any
of the eigenvectors. These indicate directions of high positive and
negative curvature, along with flatter directions, all depending upon the
signs and magnitudes of the eigenvalues.

Keith L. Downing Regularization and Optimization of Backpropagation


The Condition Number of the Hessian

Eigenvalues of the Hessian provide a quick check as to


how dramatic and varied the curvatures of the loss function
are at any point in parameter space.
Condition number of a matrix = ratio of magnitudes of max
and min eigenvalues, κ:

κ
maxi,j i
κj

ILL Conditioning: High condition number (at a location in


search space) → difficult gradient-descent search.
Curvatures vary greatly in different directions, so the
shared learning rate (which affects movement in all
dimensions) may cause too big a step in some directions,
and too small in others.

Keith L. Downing Regularization and Optimization of Backpropagation


Optimal Step Sizes based on Hessian Eigenvalues

When derivative of loss function is negative along eigenvector vi :


κi ≈ 0 → landscape has constant slope → take a normal-sized
step along vi , since the gradient is an accurate representation of
the surrounding area.
κi > 0 → landscape is curving upward → only take a small step
along vi , since a normal step could end up in a region of
increased error/loss.
κi < 0 → landscape is curving downward→ take a large step
along vi to descend quickly.
The proximity of the ith dimensional axis to each such eigenvector
indicates the proper step to take along dimension i: the proper 4wi .

Keith L. Downing Regularization and Optimization of Backpropagation

You might also like