Regularization
Regularization
Backpropagation
Keith L. Downing
December 1, 2020
Regularization
Optimization
Definition of Regularization
Reduction of testing error while maintaining a low training error.
Excessive training does reduce training error, but often at the expense
of higher testing error.
The network essentially memorizes the training cases, which hinders
its ability to generalize and handle new cases (e.g., the test set).
Exemplifies tradeoff between bias and variance: an NN with low bias
(i.e. training error) has trouble reducing variance (i.e. test error).
Bias(θ̃m ) = E(θ̃m ) − θ
Where θ is the parameterization of the true generating function, and θ̃m
is the parameterization of an estimator based on the sample, m. E.g.
θ̃m are weights of an NN after training on sample set m.
Var(θ̃m ) = degree to which the estimator’s results change with other
data samples (e.g., the test set) from same data generator.
→ Regularization = attempt to combat the bias-variance tradeoff: to
produce a θ̃m with low bias and low variance.
L2 Regularization
1
Ω2 (θ ) = 2 ∑wi ∈θ wi2
Sum of squared weights across the whole neural network.
L1 Regularization
Ω1 (θ ) = ∑wi ∈θ |wi |
Sum of absolute values of all weights across the whole neural network.
In cases where most weights have a small absolute value (as during the
initial phases of typical training runs), L1 imposes a much stiffer penalty:
|wi | > wi2 .
Hence, L1 can have a sparsifying effect: it drives many weights to zero.
∂ L̃
Comparing penalty derivatives (which contribute to ∂ wi
gradients):
∂ Ω2 (θ ) ∂ 1 2
∂ wi = ∂ wi ( 2 ∑wk ∈θ wk ) = wi
∂ Ω1 (θ ) ∂
∂ wi = ∂ wi ∑wk ∈θ |wk | = sign(wk ) ∈ {−1, 0, 1}
So L2 penalty scales linearly with wi , giving a more stable update
scheme. L1 provides a more constant gradient that can be large
compared to wi (when |wi | < 1).
This also contributes to L1 ’s sparsifying effect upon θ .
Sparsification supports feature selection in Machine Learning.
Specialized
Out-1 Out-2 Actions
Specialized
Representations
H1 H2 H3
General detector of
patterns in data, but
Shared unrelated to a task
H0
Representations
Input
Data Set
Training Training
Error
Testing
Validation Validation
Testing
Epochs
Should Stop
Training Here
Layer
Weights
Activations
Negative
Positive
Zero
New Test
Case
"Diamond" "Cross" "Diamond"
1 0 0 1 1 0 0 0 1 0 1 1
Never drop
output
nodes
For each case, randomly select nodes at all levels (except output)
whose outputs will be clamped to 0.
Dropout prob range: (0.2, 0.5) - lower for input than hidden layers.
Similar to bagging, with each model = a submodel of entire NN.
Due to missing non-output nodes (but fixed-length outputs and targets)
during training, each connection has more significance and thus needs
(and achieves by learning) a higher magnitude.
Testing involves all nodes.
Scaling: prior to testing, multiply all weights by (1- ph ), where ph = prob.
dropping hidden nodes.
Class 3
Class 1 Class 2
dL / d f2 = high New
Case
Feature 2
Class 3
Old
Class 1
Case
dL / d f1 = low
Clouds: NN regions
Boxes: True regions
Feature 1
Create new cases by mutating old cases in ways that maximize the
chance that the net will misclassify the new case.
∂L
Calc ∂ fi
for loss func (L) and each input feature fi .
∂ (Loss)
4wi = −λ
∂ wi
Descent Gradient
Appropriate Δw
w2
Loss(L)
w1
Following the (weaker) gradient along w1 leads to the minimum loss, but
the w2 gradients, up and down the sides of the cylinder (canyon), cause
excess lateral motion.
Since the w1 and w2 gradients are independent, search should still
move downhill quickly (in this smooth landscape).
But in a more rugged landscape, the lateral movement could visit
locations where the w1 gradient is untrue to the general trend. E.g. a
little dent or bump on the side of the cylinder could have gradients
indicating that increasing w1 would help reduce Loss (L).
End up
here
Bump nd
Should end
in Tre
up here! Ma
𝝰∆w(t-1)
- ƛdE/dw
∆w(t)
Error
∆w(t-1)
∆w(t)
∂E
∆wij (t) = −λ + α∆wij (t − 1)
∂ wij
E
Without momentum, λ ∂∂ w leads to a local minimum
Keith L. Downing Regularization and Optimization of Backpropagation
Momentum in High Dimensions
Momentum
4θ = −1 × fscale (R, λ )
Training Phase
Hidden layer (h) of size n; Minibatch of size m.
hik = output of kth neuron for case i of the minibatch
Calculate averages and standard deviations (per neuron):
1 m
µk = ∑ hik
m i=1
s
1 m
σk = δ+ ∑ [hik − µk ]2
m i=1
Testing Phase: Calc ĥik using µk and σk from ALL training data.
Keith L. Downing Regularization and Optimization of Backpropagation
Backpropagation with Batch Normalization (BN)
Input
Output
∂ Loss
Gradients ∂w
easily calculated across batch norm layers.
BN can be used either a) after Fact or b) between Σ inputs and Fact .
2 2
d L/dw = 0
L
2 2
d L/dw < 0
2 2
d L/dw > 0
2
When sign( ∂∂wL ) = sign( ∂∂wL2 ), the current slope gets more extreme.
∂ 2L
For gradient descent learning: ∂w2
< 0 (> 0) → More (Less) descent
∂L
per 4w than estimated by the standard gradient, ∂w
.
For a point (p0 ) in search space for which we know F(p0 ), use 1st and
2nd derivative of F at p0 to estimate F(p) for a nearby point p.
Knowing F(p) affects decision of moving to (or toward) p from p0 .
Using the 2nd-order Taylor Expansion of F to approximate F(p):
1
F (p) ≈ F (p0 ) + F 0 (p0 )(p − p0 ) + F 00 (p0 )(p − p0 )2
2
Extend this to a multivariable function such as L (the loss function), that
computes a scalar L(θ ) when give the tensor of variables, θ :
!
∂ 2L
T ∂L 1 T
L(θ ) ≈ L(θ0 ) + (θ − θ0 ) + (θ − θ0 ) (θ − θ0 )
∂ θ θ0 2 ∂θ2
θ0
∂L ∂ 2L
∂θ
= Jacobian, and ∂θ2
= Hessian
Knowing L(θ ) affects decision of moving to (or toward) θ from θ0 .
∂ 2L
H(L)(w)ij =
∂ wi ∂ wj
∂ 2L ∂ 2L
=
∂ wi ∂ wj ∂ wj ∂ wi
∂ L2 ∂ L2 ∂ L2
∂ w12 ∂ w1 ∂ w2 ··· ∂ w1 ∂ wn
∂ L2 ∂ L2 ∂ L2
∂ w2 ∂ w1 ∂ w22
··· ∂ w2 ∂ wn
H(L)(w)ij =
... ... ... ...
... ... ... ...
∂ L2 ∂ L2 ∂ L2
∂ wn ∂ w1 ∂ wn ∂ w2 ··· ∂ wn2
p
(Δx, Δy)
y
q
Contours =
values of F(x,y)
∂F
∂x
≈ [4x, 4y ] • ≈ 4F (x, y )|p→q
∂F
∂y
p