0% found this document useful (0 votes)
2 views

Regularization and Normalization

The document discusses techniques for regularization and normalization in neural networks, including normalization methods like Min-Max and Standardization, and strategies to mitigate overfitting such as data augmentation, L1 and L2 regularization, and dropout. It emphasizes the importance of scaling data to ensure stability during training and outlines how to implement weight decay in optimization algorithms like SGD and Adam. Additionally, it highlights the limitations of dropout in convolutional layers and introduces alternatives like DropConnect.

Uploaded by

Cát Lăng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Regularization and Normalization

The document discusses techniques for regularization and normalization in neural networks, including normalization methods like Min-Max and Standardization, and strategies to mitigate overfitting such as data augmentation, L1 and L2 regularization, and dropout. It emphasizes the importance of scaling data to ensure stability during training and outlines how to implement weight decay in optimization algorithms like SGD and Adam. Additionally, it highlights the limitations of dropout in convolutional layers and introduces alternatives like DropConnect.

Uploaded by

Cát Lăng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Regularization and

Normalization
Normalization and
Standardization
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑎𝑖𝑡𝑖𝑜𝑛:

• Normalization:
1 500 0 1

Min-Max Normalization (Scales data to have values between 0-1):

Example: [44,23,56]
44  (44 – 23) / (56 – 23) = 0.63
23  (23 – 23) / (56 – 23) = 0
56  (56 – 23) / (56 – 23) = 1

𝑥 −𝑚𝑒𝑎𝑛 Example: [44,23,56]


• Standardization: 𝑥 = Mean: (44 + 23 + 56) / 3 = 41
𝑠𝑡𝑑 Standard Deviation: 13.63
44  (44 – 41) / 13.63 = 0.22
The data will have a mean of 0 and standard deviation of 1
23  (23 – 41) / 13.63 = -1.32
56  (56 – 41) / 13.63 = 1.10
But why do we need to do this in the first
place?
• If we don’t, higher values will dominant lower values, making the
lower values useless in the dataset. Therefore, the network will be
unstable, and the exploding gradient problem might happen as well.
Moreover, the training speed would decrease. Therefore, all our data
should be on the same scale
Overfitting
• A situation where your network has only learned the training data, but cannot
cope up with the test data. It only performs well on the training data
• If you get a high accuracy on the training data, but a low accuracy on the test data
 Your model has overfitted.

Underfitting Good Overfitting


How do we know when our model is
overfitting?
• When we see a high training accuracy but a low validation/test accuracy.

Accuracy

Training

Validation
Validation
Accuracy Starts
Degrading

epoch
How Can we Reduce Overfitting?
Accuracy
• Without adding any regularization:
Training
- Train on more data
- Use Data Augmentation
- Use Early Stopping:
When validation error starts increasing, Validation
or validation error not improving Stop Training
Here

epoch
Data Augmentation
• In data augmentation, we increase the training data but utilizing the same
images we have and creating different versions of them (ex. Flipping,
rotating, cropping, zooming, shifting…etc.). The network will then be
trained on the original dataset with the augmented data, so it will perform
better since it’s seen different appearances of the same images.
L1 regularization
• L1 is a learned feature selection technique (machine learning term:
embedded feature selection technique) which causes non-relevant features
to have a small weight (near or equal to zero), so that they don’t affect the
prediction output. Irrelevant (and too much!) features cause the network to
overfit.
• Other machine learning models that use embedded feature selection
techniques include decision trees and random forests.

Irrelevant
feature 0
𝑥1 𝑤1 + 𝑥2 𝑤2 + 𝑥3 𝑤3
𝑥1 (0) + 𝑥2 𝑤2 + 𝑥3 𝑤3
𝑥2 𝑤2 + 𝑥3 𝑤3

non-relevant features: features that are not correlated with the output (independent from the output)
L1 Regularization: adds the sum of the absolute value of the weights as penalty term to the loss function.
Used for feature selection (sparsity) and shrinkage of weights (making them smaller in value)

Loss Function
Loss = Example (XE, MSE, CL…etc)

L2 Regularization: adds the sum of the square of the weights as penalty term to the loss function.
Does not introduce sparsity, and used only for shrinkage of the weights (making them smaller in value)

Loss Function
Loss = Example (XE, MSE, CL…etc)

λ is very important! a good value is 0.5


If set to 1, all weights become 0  no learning
If set to 0, term goes away  no regularization affect
Weight Decay and Regularization 𝑤 =𝑤−𝑛
𝑑 𝐸෨
𝑑𝑤
• The weight update rule is given as: 𝑑𝐸෨ 𝑑𝐸 𝑑𝐿2
= +
𝑑𝑤 𝑑𝑤 𝑑𝑤

• And by adding the term to the cost function: λ 𝒅(𝒘𝟐 ) λ


= (𝟐𝒘) = λ𝒘
𝟐 𝒘 𝟐

• Taking the differentiation of the above equation and substituting it in the


weight update rule , we end up with the update rule: 𝑑 𝐸෨ 𝑑𝐸
= + λ𝑤
𝑑𝑤 𝑑𝑤
𝑑𝐸
𝑤 = 𝑤 − 𝑛( + λ𝑤)
𝑑𝑤
This new term coming from the regularization causes the weight to
decay in proportion to its size.
So we subtract a little portion of the weight at each step, hence the name decay. We directly modify the
weight update rule rather than modifying the loss function. we don’t want to add more computations by
modifying the loss when there is an easier way
• So when adding the L2 regularization term to the loss function, it is denoted as L2
regularization. And when we modify the weight update rule directly, it is denoted as Weight
decay. For the case of vanilla SGD, they are the same.
• But they aren’t the same when adding Momentum or using Adam. The L2 regularization
and weight decay become different

Why? If using momentum, the moving average becomes (when using L2 regularization)
moving_avg = alpha * moving_avg + (1-alpha) * (w.grad + wd*w)
gradient of
weight
Then we update the weights: w = w – lr * moving_avg decay term

 the part linked to the regularization that will be taken from w is lr * (1-alpha)* wd * w plus a
combination of the previous weights that were already in moving_avg.

When using weight decay:


moving_avg = alpha * moving_avg + (1-alpha) * w.grad
w = w - lr * moving_avg - lr * wd * w

We can see that the part subtracted from w linked to regularization isn’t the same in the two methods.

See https://www.fast.ai/2018/07/02/adam-weight-decay/
Decoupling weight decay: Adding Weight Decay to Momentum and Adam
Simply add the weight decay term after the parameter update as in the original definition

The SGD with momentum and weight decay (SGDW) update:

−𝑛λ𝜃𝑡 decoupled weight decay

Similarly, for Adam with weight decay (AdamW) we obtain:

−𝑛λ𝜃𝑡 decoupled weight decay


Summary
Just adding the square of the weights to the loss function is not the correct
way of using L2 regularization/weight decay with Adam, since that will
interact with the m and v parameters in strange ways. Instead we want to
decay the weights in a manner that doesn't interact with the m/v parameters.
This is equivalent to adding the square of the weights to the loss with plain
(non-momentum) SGD. So just add weight decay at the end (fixed version)
if group["weight_decay"] > 0.0: p.data.add_(p.data,
alpha=(-group["lr"] * group["weight_decay"]))
Using L2
Using Weight Decay
So did it yield good results?
• The authors show that this substantially improves Adam’s generalization
performance and allows it to compete with SGD with momentum on image
classification datasets.

• Using a lower 𝛽2 value (0.99 or 0.9 vs. the default 0.999),


which controls the contribution of the exponential moving
average of past squared gradients in Adam, worked better
Dropout
• During training of the network, neurons tend to heavily depend on each other, hence the
power of each neuron is lost and the model might overfit.
• Dropout forces a neural network to learn more robust features that are useful in conjunction
with many different random subsets of the other neurons.
• Dropout roughly doubles the number of iterations required to converge. However, training
time for each epoch is less, since some neurons are disabled.
In the Training Phase: For a specific hidden layer, for each training sample, for each iteration,
randomly disable a fraction, p, of neurons (and corresponding activations).

Result:
Neurons don’t depend on each other anymore,
and hence can perform well on other data. The
network is able to learn several independent
correlations in the data Therefore, overfitting is
reduced.

Srivastava, Nitish, et al. ”Dropout: a simple way to prevent neural networks from overfitting”, JMLR 2014
A specific neuron might always learn to
depend on one of the previous neurons
(one feature) in order to extract
information. When we remove that
feature and not let the neuron depend
on it anymore, it will be forced to learn
the information needed from the other
features. Thus, at testing time, it can
perform better on unseen data since it
has learned how to extract information
from features that are not present at
testing time.
Consider p to be the probability that a neuron is kept (higher = less dropout)
At Training Time: Drop the neuron activated outputs (consider p = 0.5)
0.2 0.3 0.9 0.7 0.4 0.1 0 0.3 0.9 0 0.4 0.1

0.4 0.7 0.4 1.1 0.6 -0.2 0 0 0 1.1 0 0

0.9 0.1 0.1 1.2 0.1 -0.7 0 0.1 0.1 1.2 0 -0.7

0.6 0.6 0.8 0.7 0.6 -0.4 0.6 0 0.8 0 0.6 0

At test time dropout is not applied (all inputs are considered), so we want the outputs of neurons at test time to be
identical to their expected outputs at training time. Otherwise at testing time, it will be very unusual since there will be high
numbers (and during training the network saw smaller values since we zeroed out some)

𝑛𝑒𝑢𝑟𝑜𝑛 𝑜𝑢𝑡𝑝𝑢𝑡 𝑑𝑟𝑜𝑝𝑜𝑢𝑡 = 𝑝𝑥 + 1 − 𝑝 0 = 𝑝𝑥


Neurons
that are
not kept
0.2 (0.5) 0.3 (0.5) 0.9 (0.5) 0.7 (0.5) 0.4 (0.5) 0.1 (0.5)
0.2 0.3 0.9 0.7 0.4 0.1
0.4 (0.5) 0.7 (0.5) 0.4 (0.5) 1.1 (0.5) 0.6 (0.5) -0.2 (0.5)
0.4 0.7 0.4 1.1 0.6 -0.2

0.9 0.1 0.1 1.2 0.1 -0.7 0.9 (0.5) 0.1 (0.5) 0.1 (0.5) 1.2 (0.5) 0.1 (0.5) -0.7 (0.5)

0.6 0.6 0.8 0.7 0.6 -0.4


0.6 (0.5) 0.6 (0.5) 0.8 (0.5) 0.7 (0.5) 0.6 (0.5) -0.4 (0.5)

So at test time, we must scale x by px to get the same result.


Another way to do this is using Inverted Dropout:
Performs the scaling at training time, leaving the forward pass at test
time untouched. 
At training time, divide the activations by the keeping probability p

0 0.3/0.5 0.9/0.5 0 0.4/0.5 0.1/0.5


0.2 0.3 0.9 0.7 0.4 0.1

0.4 0.7 0.4 1.1 0.6 -0.2 0 0 0 1.1/0.5 0 0

0.9 0.1 0.1 1.2 0.1 -0.7

0.6 0.6 0.8 0.7 0.6 -0.4 0 0.1/0.5 0.1/0.5 1.2/0.5 0 -0.7/0.5

0.6/0.5 0 0.8/0.5 0 0.6/0.5 0


Drop Connect (Weight Drop)
• The only difference with Dropout is that we drop the weights (set them to zero)
rather than dropping the neuron outputs.

Dropout DropConnect
Dropout in CNN
• Dropout is less effective for convolutional layers, where features are correlated
spatially. When the features are correlated, even with dropout, information about
the input can still be sent to the next layer, which causes the networks to overfit.
Therefore, dropping out activations at random is not effective in removing
semantic information because nearby activations contain closely related
information.

Example: When applying max pooling


0 0 0 0 0 0
Maximum Value is still
6 5 4 Drop 6 0 0 Max Pool 6 selected and propagated
to the next layer
5 4 3 5 4 0

0 0 0
9 0 0 A correlated feature (8) is
Drop Max Pool selected, rather than the
8 0 0 8
8 0 0 maximum 9. Important
information is still being
0 0 0
0 0 0 selected and propagated to
the next layer
DropBlock
DropBlock is a simple method similar to dropout. Its main difference from dropout is that it drops continuous
regions from a feature map of a layer instead of dropping out independent random units. Dropping continuous
regions can remove certain semantic information (e.g., head or feet) and consequently enforcing remaining
units to learn features for classifying input image.

Dropout DropBlock

The shaded regions include the activation units which contain


semantic information in the input image.
Hyperparameters for DropBlock
Block Size: The size of the block from the feature map to be dropped. Note that if block size =
1, DropBlock is equivalent to Dropout.

Block Size

γ: controls the number of features to drop


To account for the fact that every zero entry in the mask will be expanded by 𝑏𝑙𝑜𝑐𝑘_𝑠𝑖𝑧𝑒 2 ,
we divide the dropout probability by 𝑏𝑙𝑜𝑐𝑘_𝑠𝑖𝑧𝑒 2

𝑑𝑟𝑜𝑝 𝑝𝑟𝑜𝑏
𝛾=
𝑏𝑙𝑜𝑐𝑘 𝑠𝑖𝑧𝑒 2
Algorithm
Mask to be multiplied
with the feature map

You might also like