0% found this document useful (0 votes)

2 views

Regularization and Normalization

The document discusses techniques for regularization and normalization in neural networks, including normalization methods like Min-Max and Standardization, and strategies to mitigate overfitting such as data augmentation, L1 and L2 regularization, and dropout. It emphasizes the importance of scaling data to ensure stability during training and outlines how to implement weight decay in optimization algorithms like SGD and Adam. Additionally, it highlights the limitations of dropout in convolutional layers and introduces alternatives like DropConnect.

Uploaded by

Cát Lăng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Regularization and Normalization

Uploaded by

Cát Lăng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Regularization and

Normalization
Normalization and
Standardization
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑎𝑖𝑡𝑖𝑜𝑛:

• Normalization:
1 500 0 1

Min-Max Normalization (Scales data to have values between 0-1):

Example: [44,23,56]
44  (44 – 23) / (56 – 23) = 0.63
23  (23 – 23) / (56 – 23) = 0
56  (56 – 23) / (56 – 23) = 1

𝑥 −𝑚𝑒𝑎𝑛 Example: [44,23,56]

• Standardization: 𝑥 = Mean: (44 + 23 + 56) / 3 = 41
𝑠𝑡𝑑 Standard Deviation: 13.63
44  (44 – 41) / 13.63 = 0.22
The data will have a mean of 0 and standard deviation of 1
23  (23 – 41) / 13.63 = -1.32
56  (56 – 41) / 13.63 = 1.10
But why do we need to do this in the first
place?
• If we don’t, higher values will dominant lower values, making the
lower values useless in the dataset. Therefore, the network will be
unstable, and the exploding gradient problem might happen as well.
Moreover, the training speed would decrease. Therefore, all our data
should be on the same scale
Overfitting
• A situation where your network has only learned the training data, but cannot
cope up with the test data. It only performs well on the training data
• If you get a high accuracy on the training data, but a low accuracy on the test data
 Your model has overfitted.

Underfitting Good Overfitting

How do we know when our model is
overfitting?
• When we see a high training accuracy but a low validation/test accuracy.

Accuracy

Training

Validation
Validation
Accuracy Starts
Degrading

epoch
How Can we Reduce Overfitting?
Accuracy
• Without adding any regularization:
Training
- Train on more data
- Use Data Augmentation
- Use Early Stopping:
When validation error starts increasing, Validation
or validation error not improving Stop Training
Here

epoch
Data Augmentation
• In data augmentation, we increase the training data but utilizing the same
images we have and creating different versions of them (ex. Flipping,
rotating, cropping, zooming, shifting…etc.). The network will then be
trained on the original dataset with the augmented data, so it will perform
better since it’s seen different appearances of the same images.
L1 regularization
• L1 is a learned feature selection technique (machine learning term:
embedded feature selection technique) which causes non-relevant features
to have a small weight (near or equal to zero), so that they don’t affect the
prediction output. Irrelevant (and too much!) features cause the network to
overfit.
• Other machine learning models that use embedded feature selection
techniques include decision trees and random forests.

Irrelevant
feature 0
𝑥1 𝑤1 + 𝑥2 𝑤2 + 𝑥3 𝑤3
𝑥1 (0) + 𝑥2 𝑤2 + 𝑥3 𝑤3
𝑥2 𝑤2 + 𝑥3 𝑤3

non-relevant features: features that are not correlated with the output (independent from the output)
L1 Regularization: adds the sum of the absolute value of the weights as penalty term to the loss function.
Used for feature selection (sparsity) and shrinkage of weights (making them smaller in value)

Loss Function
Loss = Example (XE, MSE, CL…etc)

L2 Regularization: adds the sum of the square of the weights as penalty term to the loss function.
Does not introduce sparsity, and used only for shrinkage of the weights (making them smaller in value)

Loss Function
Loss = Example (XE, MSE, CL…etc)

λ is very important! a good value is 0.5

If set to 1, all weights become 0  no learning
If set to 0, term goes away  no regularization affect
Weight Decay and Regularization 𝑤 =𝑤−𝑛
𝑑 𝐸෨
𝑑𝑤
• The weight update rule is given as: 𝑑𝐸෨ 𝑑𝐸 𝑑𝐿2
= +
𝑑𝑤 𝑑𝑤 𝑑𝑤

• And by adding the term to the cost function: λ 𝒅(𝒘𝟐 ) λ

= (𝟐𝒘) = λ𝒘
𝟐 𝒘 𝟐

• Taking the differentiation of the above equation and substituting it in the

weight update rule , we end up with the update rule: 𝑑 𝐸෨ 𝑑𝐸
= + λ𝑤
𝑑𝑤 𝑑𝑤
𝑑𝐸
𝑤 = 𝑤 − 𝑛( + λ𝑤)
𝑑𝑤
This new term coming from the regularization causes the weight to
decay in proportion to its size.
So we subtract a little portion of the weight at each step, hence the name decay. We directly modify the
weight update rule rather than modifying the loss function. we don’t want to add more computations by
modifying the loss when there is an easier way
• So when adding the L2 regularization term to the loss function, it is denoted as L2
regularization. And when we modify the weight update rule directly, it is denoted as Weight
decay. For the case of vanilla SGD, they are the same.
• But they aren’t the same when adding Momentum or using Adam. The L2 regularization
and weight decay become different

Why? If using momentum, the moving average becomes (when using L2 regularization)
moving_avg = alpha * moving_avg + (1-alpha) * (w.grad + wd*w)
gradient of
weight
Then we update the weights: w = w – lr * moving_avg decay term

 the part linked to the regularization that will be taken from w is lr * (1-alpha)* wd * w plus a
combination of the previous weights that were already in moving_avg.

When using weight decay:

moving_avg = alpha * moving_avg + (1-alpha) * w.grad
w = w - lr * moving_avg - lr * wd * w

We can see that the part subtracted from w linked to regularization isn’t the same in the two methods.

See https://www.fast.ai/2018/07/02/adam-weight-decay/
Decoupling weight decay: Adding Weight Decay to Momentum and Adam
Simply add the weight decay term after the parameter update as in the original definition

The SGD with momentum and weight decay (SGDW) update:

−𝑛λ𝜃𝑡 decoupled weight decay

Similarly, for Adam with weight decay (AdamW) we obtain:

−𝑛λ𝜃𝑡 decoupled weight decay

Summary
Just adding the square of the weights to the loss function is not the correct
way of using L2 regularization/weight decay with Adam, since that will
interact with the m and v parameters in strange ways. Instead we want to
decay the weights in a manner that doesn't interact with the m/v parameters.
This is equivalent to adding the square of the weights to the loss with plain
(non-momentum) SGD. So just add weight decay at the end (fixed version)
if group["weight_decay"] > 0.0: p.data.add_(p.data,
alpha=(-group["lr"] * group["weight_decay"]))
Using L2
Using Weight Decay
So did it yield good results?
• The authors show that this substantially improves Adam’s generalization
performance and allows it to compete with SGD with momentum on image
classification datasets.

• Using a lower 𝛽2 value (0.99 or 0.9 vs. the default 0.999),

which controls the contribution of the exponential moving
average of past squared gradients in Adam, worked better
Dropout
• During training of the network, neurons tend to heavily depend on each other, hence the
power of each neuron is lost and the model might overfit.
• Dropout forces a neural network to learn more robust features that are useful in conjunction
with many different random subsets of the other neurons.
• Dropout roughly doubles the number of iterations required to converge. However, training
time for each epoch is less, since some neurons are disabled.
In the Training Phase: For a specific hidden layer, for each training sample, for each iteration,
randomly disable a fraction, p, of neurons (and corresponding activations).

Result:
Neurons don’t depend on each other anymore,
and hence can perform well on other data. The
network is able to learn several independent
correlations in the data Therefore, overfitting is
reduced.

Srivastava, Nitish, et al. ”Dropout: a simple way to prevent neural networks from overfitting”, JMLR 2014
A specific neuron might always learn to
depend on one of the previous neurons
(one feature) in order to extract
information. When we remove that
feature and not let the neuron depend
on it anymore, it will be forced to learn
the information needed from the other
features. Thus, at testing time, it can
perform better on unseen data since it
has learned how to extract information
from features that are not present at
testing time.
Consider p to be the probability that a neuron is kept (higher = less dropout)
At Training Time: Drop the neuron activated outputs (consider p = 0.5)
0.2 0.3 0.9 0.7 0.4 0.1 0 0.3 0.9 0 0.4 0.1

0.4 0.7 0.4 1.1 0.6 -0.2 0 0 0 1.1 0 0

0.9 0.1 0.1 1.2 0.1 -0.7 0 0.1 0.1 1.2 0 -0.7

0.6 0.6 0.8 0.7 0.6 -0.4 0.6 0 0.8 0 0.6 0

At test time dropout is not applied (all inputs are considered), so we want the outputs of neurons at test time to be
identical to their expected outputs at training time. Otherwise at testing time, it will be very unusual since there will be high
numbers (and during training the network saw smaller values since we zeroed out some)

𝑛𝑒𝑢𝑟𝑜𝑛 𝑜𝑢𝑡𝑝𝑢𝑡 𝑑𝑟𝑜𝑝𝑜𝑢𝑡 = 𝑝𝑥 + 1 − 𝑝 0 = 𝑝𝑥

Neurons
that are
not kept
0.2 (0.5) 0.3 (0.5) 0.9 (0.5) 0.7 (0.5) 0.4 (0.5) 0.1 (0.5)
0.2 0.3 0.9 0.7 0.4 0.1
0.4 (0.5) 0.7 (0.5) 0.4 (0.5) 1.1 (0.5) 0.6 (0.5) -0.2 (0.5)
0.4 0.7 0.4 1.1 0.6 -0.2

0.9 0.1 0.1 1.2 0.1 -0.7 0.9 (0.5) 0.1 (0.5) 0.1 (0.5) 1.2 (0.5) 0.1 (0.5) -0.7 (0.5)

0.6 0.6 0.8 0.7 0.6 -0.4

0.6 (0.5) 0.6 (0.5) 0.8 (0.5) 0.7 (0.5) 0.6 (0.5) -0.4 (0.5)

So at test time, we must scale x by px to get the same result.

Another way to do this is using Inverted Dropout:
Performs the scaling at training time, leaving the forward pass at test
time untouched. 
At training time, divide the activations by the keeping probability p

0 0.3/0.5 0.9/0.5 0 0.4/0.5 0.1/0.5

0.2 0.3 0.9 0.7 0.4 0.1

0.4 0.7 0.4 1.1 0.6 -0.2 0 0 0 1.1/0.5 0 0

0.9 0.1 0.1 1.2 0.1 -0.7

0.6 0.6 0.8 0.7 0.6 -0.4 0 0.1/0.5 0.1/0.5 1.2/0.5 0 -0.7/0.5

0.6/0.5 0 0.8/0.5 0 0.6/0.5 0

Drop Connect (Weight Drop)
• The only difference with Dropout is that we drop the weights (set them to zero)
rather than dropping the neuron outputs.

Dropout DropConnect
Dropout in CNN
• Dropout is less effective for convolutional layers, where features are correlated
spatially. When the features are correlated, even with dropout, information about
the input can still be sent to the next layer, which causes the networks to overfit.
Therefore, dropping out activations at random is not effective in removing
semantic information because nearby activations contain closely related
information.

Example: When applying max pooling

0 0 0 0 0 0
Maximum Value is still
6 5 4 Drop 6 0 0 Max Pool 6 selected and propagated
to the next layer
5 4 3 5 4 0

0 0 0
9 0 0 A correlated feature (8) is
Drop Max Pool selected, rather than the
8 0 0 8
8 0 0 maximum 9. Important
information is still being
0 0 0
0 0 0 selected and propagated to
the next layer
DropBlock
DropBlock is a simple method similar to dropout. Its main difference from dropout is that it drops continuous
regions from a feature map of a layer instead of dropping out independent random units. Dropping continuous
regions can remove certain semantic information (e.g., head or feet) and consequently enforcing remaining
units to learn features for classifying input image.

Dropout DropBlock

The shaded regions include the activation units which contain

semantic information in the input image.
Hyperparameters for DropBlock
Block Size: The size of the block from the feature map to be dropped. Note that if block size =
1, DropBlock is equivalent to Dropout.

Block Size

γ: controls the number of features to drop

To account for the fact that every zero entry in the mask will be expanded by 𝑏𝑙𝑜𝑐𝑘_𝑠𝑖𝑧𝑒 2 ,
we divide the dropout probability by 𝑏𝑙𝑜𝑐𝑘_𝑠𝑖𝑧𝑒 2

𝑑𝑟𝑜𝑝 𝑝𝑟𝑜𝑏
𝛾=
𝑏𝑙𝑜𝑐𝑘 𝑠𝑖𝑧𝑒 2
Algorithm
Mask to be multiplied
with the feature map

03 Reg Slides
No ratings yet
03 Reg Slides
64 pages
Deep Learning Unit2
No ratings yet
Deep Learning Unit2
16 pages
UNIT-II Regularization in Deep Learning
No ratings yet
UNIT-II Regularization in Deep Learning
24 pages
Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
17 pages
Regularization_for_Neural_Networks_1718966083
No ratings yet
Regularization_for_Neural_Networks_1718966083
9 pages
Regularization Slides (2)
No ratings yet
Regularization Slides (2)
50 pages
DL Class3
No ratings yet
DL Class3
28 pages
Regularization: Swetha V, Research Scholar
No ratings yet
Regularization: Swetha V, Research Scholar
32 pages
Lec 05 Regularization
No ratings yet
Lec 05 Regularization
77 pages
DL+lect+7 (1)
No ratings yet
DL+lect+7 (1)
15 pages
WEEK 10
No ratings yet
WEEK 10
69 pages
Deep Learning Basics Lecture 4 Regularization II
No ratings yet
Deep Learning Basics Lecture 4 Regularization II
27 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
9.b Handout-2-Regularization
No ratings yet
9.b Handout-2-Regularization
5 pages
Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman
No ratings yet
Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman
75 pages
Lecture 6
No ratings yet
Lecture 6
41 pages
cours4
No ratings yet
cours4
30 pages
Unit 2.3
No ratings yet
Unit 2.3
43 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
What is Regularization.
No ratings yet
What is Regularization.
10 pages
Regularization
No ratings yet
Regularization
3 pages
A Quick Guide On Basic Regularization Methods For Neural Networks - by Jaime Durán - Yottabytes - Medium
No ratings yet
A Quick Guide On Basic Regularization Methods For Neural Networks - by Jaime Durán - Yottabytes - Medium
9 pages
Deep Feedforward Networks and Regularization: Licheng Zhang
No ratings yet
Deep Feedforward Networks and Regularization: Licheng Zhang
56 pages
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
No ratings yet
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
20 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
Nndl Notes
No ratings yet
Nndl Notes
73 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
Dataset Augmentation
No ratings yet
Dataset Augmentation
30 pages
mod4
No ratings yet
mod4
65 pages
Module-4_4
No ratings yet
Module-4_4
19 pages
Deep-Learning-Module-2-Important-Topics-PYQs
No ratings yet
Deep-Learning-Module-2-Important-Topics-PYQs
30 pages
Regularization
No ratings yet
Regularization
46 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Overfitting vs Underfitting
No ratings yet
Overfitting vs Underfitting
16 pages
Unit 4
No ratings yet
Unit 4
35 pages
4 - DNN Tip
No ratings yet
4 - DNN Tip
52 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Curs5site PDF
No ratings yet
Curs5site PDF
47 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
S10_DNN_Regularization_wip
No ratings yet
S10_DNN_Regularization_wip
11 pages
unit4
No ratings yet
unit4
93 pages
LECTURE#9 EE258 F22 Part2 Draft v1
No ratings yet
LECTURE#9 EE258 F22 Part2 Draft v1
14 pages
ANN_Presentation_Exam_Hafsa
No ratings yet
ANN_Presentation_Exam_Hafsa
29 pages
4. Regularization
No ratings yet
4. Regularization
19 pages
DL Unit-3
No ratings yet
DL Unit-3
10 pages
Module - 2 Ver 1.4
No ratings yet
Module - 2 Ver 1.4
35 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
07_regularization
No ratings yet
07_regularization
51 pages
DNN tip
No ratings yet
DNN tip
49 pages
4th Unit DL Final Class Notes (1)
No ratings yet
4th Unit DL Final Class Notes (1)
68 pages
Unit - 4 REGULARIZATION FOR DEEP LEARNING
No ratings yet
Unit - 4 REGULARIZATION FOR DEEP LEARNING
56 pages
An Overview of Overfitting and Its Solutions
No ratings yet
An Overview of Overfitting and Its Solutions
7 pages
Module-4_3
No ratings yet
Module-4_3
20 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
12-Regularization for Deep Learning-17!08!2024
No ratings yet
12-Regularization for Deep Learning-17!08!2024
51 pages
Stable Weight Decay Regularization
No ratings yet
Stable Weight Decay Regularization
18 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
A Conversation About Calculus
From Everand
A Conversation About Calculus
Ginachukwu Amah
No ratings yet
RNNs and LSTMs
No ratings yet
RNNs and LSTMs
41 pages
Softmax
No ratings yet
Softmax
5 pages
Cnn
No ratings yet
Cnn
98 pages
KL Divergence
No ratings yet
KL Divergence
8 pages
Very Highspeed BJT Buffer For Trackandhold Amplifiers With Enhan
No ratings yet
Very Highspeed BJT Buffer For Trackandhold Amplifiers With Enhan
4 pages
Backpropogation Algorithm
No ratings yet
Backpropogation Algorithm
48 pages
CMOS Circuit Speed and Buffer Optimization
No ratings yet
CMOS Circuit Speed and Buffer Optimization
12 pages
Deep Learning Techniques To Classify The Aerial Images With Gabor Filter
No ratings yet
Deep Learning Techniques To Classify The Aerial Images With Gabor Filter
8 pages
Final Report Virtual Herbal Garden
No ratings yet
Final Report Virtual Herbal Garden
57 pages
Quantization and Training of Neural Networks For Efficient Integer-Arithmetic-Only Inference
No ratings yet
Quantization and Training of Neural Networks For Efficient Integer-Arithmetic-Only Inference
14 pages
Deep Learning Based Robust Hybrid Approaches For Brain Tumor Classification in Magnetic Resonance Images
No ratings yet
Deep Learning Based Robust Hybrid Approaches For Brain Tumor Classification in Magnetic Resonance Images
15 pages
BHAVYATHA_TECHNICAL_SEMINAR_REPORT
No ratings yet
BHAVYATHA_TECHNICAL_SEMINAR_REPORT
30 pages
PHD IT Syllabus 01
No ratings yet
PHD IT Syllabus 01
27 pages
Dl Question Bank
No ratings yet
Dl Question Bank
23 pages
2023-2024 Major Project List
No ratings yet
2023-2024 Major Project List
5 pages
Unseen Object Amodal Instance Segmentation Via Hierarchical Occlusion Modeling
No ratings yet
Unseen Object Amodal Instance Segmentation Via Hierarchical Occlusion Modeling
8 pages
SRL-ACO A Text Augmentation Framework Based On Semantic Role
No ratings yet
SRL-ACO A Text Augmentation Framework Based On Semantic Role
18 pages
Research
No ratings yet
Research
8 pages
Artificial Intelligence in The Creative Industries: A Review
No ratings yet
Artificial Intelligence in The Creative Industries: A Review
68 pages
RTRP Project Documentation Format-2024
No ratings yet
RTRP Project Documentation Format-2024
11 pages
Zeng Et Al - 2022 - Rubber Leaf Disease Recognition Based On Improved Deep Convolutional Neural
No ratings yet
Zeng Et Al - 2022 - Rubber Leaf Disease Recognition Based On Improved Deep Convolutional Neural
12 pages
Difference between AlexNet, VGGNet, ResNet, and Inception
No ratings yet
Difference between AlexNet, VGGNet, ResNet, and Inception
25 pages
Deep Learning in Ophthalmology
No ratings yet
Deep Learning in Ophthalmology
24 pages
Deep Anomaly Detection On Attributed Networks
No ratings yet
Deep Anomaly Detection On Attributed Networks
9 pages
A_New_Hardware-Efficient_VLSI-Architecture_of_GoogLeNet_CNN-Model_Based_Hardware_Accelerator_for_Edge_Computing_Applications
No ratings yet
A_New_Hardware-Efficient_VLSI-Architecture_of_GoogLeNet_CNN-Model_Based_Hardware_Accelerator_for_Edge_Computing_Applications
4 pages
Interpretable Machine Learning - A Brief History, State-of-the-Art and Challenges
No ratings yet
Interpretable Machine Learning - A Brief History, State-of-the-Art and Challenges
15 pages
Convolutional Neural Networks For Image Classification
No ratings yet
Convolutional Neural Networks For Image Classification
5 pages
Machine Vision Application Chapter
No ratings yet
Machine Vision Application Chapter
25 pages
Identification_of_Diseases_in_Apple_Fruits_Using_A-march 2025
No ratings yet
Identification_of_Diseases_in_Apple_Fruits_Using_A-march 2025
9 pages
+2021 - Video Based Fire Smoke Detection Using Temporal Spatial Salienc Features
No ratings yet
+2021 - Video Based Fire Smoke Detection Using Temporal Spatial Salienc Features
6 pages
r5.Unicoder-VL-A Universal Encoder For Vision and Language by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang
No ratings yet
r5.Unicoder-VL-A Universal Encoder For Vision and Language by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang
12 pages
Hierarchical CADNet Learning From B-Reps For Machining Feature
No ratings yet
Hierarchical CADNet Learning From B-Reps For Machining Feature
16 pages
Electronics: Identification of Plant-Leaf Diseases Using CNN and Transfer-Learning Approach
No ratings yet
Electronics: Identification of Plant-Leaf Diseases Using CNN and Transfer-Learning Approach
19 pages
Smart Control of Traffic Using Artificial Intelligence: CE4098D Major Project - Part I
No ratings yet
Smart Control of Traffic Using Artificial Intelligence: CE4098D Major Project - Part I
28 pages
QB3RDIA
No ratings yet
QB3RDIA
2 pages
Pneumonia Detection Using CNNs
No ratings yet
Pneumonia Detection Using CNNs
4 pages
DIP Sample Midterm Question - Fall 2023 - Lec Raiyan
No ratings yet
DIP Sample Midterm Question - Fall 2023 - Lec Raiyan
2 pages