Regularization and Normalization
Regularization and Normalization
Normalization
Normalization and
Standardization
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑎𝑖𝑡𝑖𝑜𝑛:
• Normalization:
1 500 0 1
Example: [44,23,56]
44 (44 – 23) / (56 – 23) = 0.63
23 (23 – 23) / (56 – 23) = 0
56 (56 – 23) / (56 – 23) = 1
Accuracy
Training
Validation
Validation
Accuracy Starts
Degrading
epoch
How Can we Reduce Overfitting?
Accuracy
• Without adding any regularization:
Training
- Train on more data
- Use Data Augmentation
- Use Early Stopping:
When validation error starts increasing, Validation
or validation error not improving Stop Training
Here
epoch
Data Augmentation
• In data augmentation, we increase the training data but utilizing the same
images we have and creating different versions of them (ex. Flipping,
rotating, cropping, zooming, shifting…etc.). The network will then be
trained on the original dataset with the augmented data, so it will perform
better since it’s seen different appearances of the same images.
L1 regularization
• L1 is a learned feature selection technique (machine learning term:
embedded feature selection technique) which causes non-relevant features
to have a small weight (near or equal to zero), so that they don’t affect the
prediction output. Irrelevant (and too much!) features cause the network to
overfit.
• Other machine learning models that use embedded feature selection
techniques include decision trees and random forests.
Irrelevant
feature 0
𝑥1 𝑤1 + 𝑥2 𝑤2 + 𝑥3 𝑤3
𝑥1 (0) + 𝑥2 𝑤2 + 𝑥3 𝑤3
𝑥2 𝑤2 + 𝑥3 𝑤3
non-relevant features: features that are not correlated with the output (independent from the output)
L1 Regularization: adds the sum of the absolute value of the weights as penalty term to the loss function.
Used for feature selection (sparsity) and shrinkage of weights (making them smaller in value)
Loss Function
Loss = Example (XE, MSE, CL…etc)
L2 Regularization: adds the sum of the square of the weights as penalty term to the loss function.
Does not introduce sparsity, and used only for shrinkage of the weights (making them smaller in value)
Loss Function
Loss = Example (XE, MSE, CL…etc)
Why? If using momentum, the moving average becomes (when using L2 regularization)
moving_avg = alpha * moving_avg + (1-alpha) * (w.grad + wd*w)
gradient of
weight
Then we update the weights: w = w – lr * moving_avg decay term
the part linked to the regularization that will be taken from w is lr * (1-alpha)* wd * w plus a
combination of the previous weights that were already in moving_avg.
We can see that the part subtracted from w linked to regularization isn’t the same in the two methods.
See https://www.fast.ai/2018/07/02/adam-weight-decay/
Decoupling weight decay: Adding Weight Decay to Momentum and Adam
Simply add the weight decay term after the parameter update as in the original definition
Result:
Neurons don’t depend on each other anymore,
and hence can perform well on other data. The
network is able to learn several independent
correlations in the data Therefore, overfitting is
reduced.
Srivastava, Nitish, et al. ”Dropout: a simple way to prevent neural networks from overfitting”, JMLR 2014
A specific neuron might always learn to
depend on one of the previous neurons
(one feature) in order to extract
information. When we remove that
feature and not let the neuron depend
on it anymore, it will be forced to learn
the information needed from the other
features. Thus, at testing time, it can
perform better on unseen data since it
has learned how to extract information
from features that are not present at
testing time.
Consider p to be the probability that a neuron is kept (higher = less dropout)
At Training Time: Drop the neuron activated outputs (consider p = 0.5)
0.2 0.3 0.9 0.7 0.4 0.1 0 0.3 0.9 0 0.4 0.1
0.9 0.1 0.1 1.2 0.1 -0.7 0 0.1 0.1 1.2 0 -0.7
At test time dropout is not applied (all inputs are considered), so we want the outputs of neurons at test time to be
identical to their expected outputs at training time. Otherwise at testing time, it will be very unusual since there will be high
numbers (and during training the network saw smaller values since we zeroed out some)
0.9 0.1 0.1 1.2 0.1 -0.7 0.9 (0.5) 0.1 (0.5) 0.1 (0.5) 1.2 (0.5) 0.1 (0.5) -0.7 (0.5)
0.6 0.6 0.8 0.7 0.6 -0.4 0 0.1/0.5 0.1/0.5 1.2/0.5 0 -0.7/0.5
Dropout DropConnect
Dropout in CNN
• Dropout is less effective for convolutional layers, where features are correlated
spatially. When the features are correlated, even with dropout, information about
the input can still be sent to the next layer, which causes the networks to overfit.
Therefore, dropping out activations at random is not effective in removing
semantic information because nearby activations contain closely related
information.
0 0 0
9 0 0 A correlated feature (8) is
Drop Max Pool selected, rather than the
8 0 0 8
8 0 0 maximum 9. Important
information is still being
0 0 0
0 0 0 selected and propagated to
the next layer
DropBlock
DropBlock is a simple method similar to dropout. Its main difference from dropout is that it drops continuous
regions from a feature map of a layer instead of dropping out independent random units. Dropping continuous
regions can remove certain semantic information (e.g., head or feet) and consequently enforcing remaining
units to learn features for classifying input image.
Dropout DropBlock
Block Size
𝑑𝑟𝑜𝑝 𝑝𝑟𝑜𝑏
𝛾=
𝑏𝑙𝑜𝑐𝑘 𝑠𝑖𝑧𝑒 2
Algorithm
Mask to be multiplied
with the feature map