0% found this document useful (0 votes)

10 views

Regularization

Uploaded by

aimad baigouar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Regularization

Uploaded by

aimad baigouar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Regularization

Data Augmentation 1
Dropout 2
Early Stopping 4
Ensembling 5
Injecting Noise 5
L1 Regularization 6
L2 Regularization 7

What is overfitting?
From Wikipedia overfitting is,
The production of an analysis that corresponds too closely or exactly to a particular set of data, and may
therefore fail to fit additional data or predict future observations reliably
What is Regularization?
It is a Techniques for combating overfitting and improving training.

Data Augmentation
Having more data (dataset / samples) is a best way to get better consistent estimators (ML model). In the
real world getting a large volume of useful data for training a model is cumbersome and labelling is an
extremely tedious task.
Either labelling requires more manual annotation, example - For creating a better image classifier we can
use Mturk and involve more man power to generate dataset or doing survey in social media and asking
people to participate and generate dataset. Above process can yield good dataset however those are
difficult to carry and expensive. Having small dataset will lead to the well know Over fitting problem.
Data Augmentation is one of the interesting regularization technique to resolve the above problem. The
concept is very simple, this technique generates new training data from given original dataset. Dataset
Augmentation provides a cheap and easy way to increase the amount of your training data.
This technique can be used for both NLP and CV.
In CV we can use the techniques like Jitter, PCA and Flipping. Similarly in NLP we can use the techniques
like Synonym Replacement,Random Insertion, Random Deletion and Word Embeddings.
It is worth knowing that Keras' provided ImageDataGenerator for generating Data Augmentation.
Sample code for random deletion

def random_deletion(words, p):

"""
Randomly delete words from the sentence with probability p
"""

#obviously, if there's only one word, don't delete it

if len(words) == 1:
return words

#randomly delete words with probability p

new_words = []
for word in words:
r = random.uniform(0, 1)
if r > p:
new_words.append(word)

#if you end up deleting all words, just return a random word
if len(new_words) == 0:
rand_int = random.randint(0, len(words)-1)
return [words[rand_int]]

return new_words

Furthermore, when comparing two machine learning algorithms train both with either augmented or
non-augmented dataset. Otherwise, no subjective decision can be made on which algorithm performed
better
Further reading

• NLP Data Augmentation

• CV Data Augmentation
• Regularization

Dropout
What is Dropout?
Dropout is a regularization technique for reducing overfitting in neural networks by preventing complex
co-adaptations on training data
Dropout is a technique where randomly selected neurons are ignored during training. They are
“dropped-out” randomly. This means that their contribution to the activation of downstream neurons is
temporally removed on the forward pass and any weight updates are not applied to the neuron on the
backward pass.
Simply put, It is the process of ignoring some of the neurons in particular forward or backward pass.
Dropout can be easily implemented by randomly selecting nodes to be dropped-out with a given probability
(e.g. .1%) each weight update cycle.
Most importantly Dropout is only used during the training of a model and is not used when evaluating the
model.
image from https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

import numpy as np
A = np.arange(20).reshape((5,4))

print("Given input: ")

print(A)

def dropout(X, drop_probability):

keep_probability = 1 - drop_probability
mask = np.random.uniform(0, 1.0, X.shape) < keep_probability
if keep_probability > 0.0:
scale = (1/keep_probability)
else:
scale = 0.0
return mask * X * scale

print("\n After Dropout: ")

print(dropout(A,0.5))

output from above code

Given input:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]

After Dropout:
[[ 0. 2. 0. 0.]
[ 8. 0. 0. 14.]
[16. 18. 0. 22.]
[24. 0. 0. 0.]
[32. 34. 36. 0.]]
Further reading

• Dropout https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

Early Stopping
One of the biggest problem in training neural network is how long to train the model.
Training too little will lead to underfit in train and test sets. Traning too much will have the overfit in training
set and poor result in test sets.
Here the challenge is to train the network long enough that it is capable of learning the mapping from
inputs to outputs, but not training the model so long that it overfits the training data.
One possible solution to solve this problem is to treat the number of training epochs as a hyperparameter
and train the model multiple times with different values, then select the number of epochs that result in the
best accuracy on the train or a holdout test dataset, But the problem is it requires multiple models to be
trained and discarded.

Clearly, after ‘t’ epochs, the model starts overfitting. This is clear by the increasing gap between the train
and the validation error in the above plot.
One alternative technique to prevent overfitting is use validation error to decide when to stop. This
approach is called Early Stopping.
While building the model, it is evaluated on the holdout validation dataset after each epoch. If the accuracy
of the model on the validation dataset starts to degrade (e.g. loss begins to increase or accuracy begins to
decrease), then the training process is stopped. This process is called Early stopping.
Python implementation for Early stopping,
def early_stopping(theta0, (x_train, y_train), (x_valid, y_valid), n = 1, p = 100):
""" The early stopping meta-algorithm for determining the best amount of time to train.
REF: Algorithm 7.1 in deep learning book.

Parameters:
n: int; Number of steps between evaluations.
p: int; "patience", the number of evaluations to observe worsening validataion set.
theta0: Network; initial network.
x_train: iterable; The training input set.
y_train: iterable; The training output set.
x_valid: iterable; The validation input set.
y_valid: iterable; The validation output set.

Returns:
theta_prime: Network object; The output network.
i_prime: int; The number of iterations for the output network.

v: float; The validation error for the output network.

"""
# Initialize variables
theta = theta0.clone() # The active network
i = 0 # The number of training steps taken
j = 0 # The number of evaluations steps since last update of theta_prime
v = np.inf # The best evaluation error observed thusfar
theta_prime = theta.clone() # The best network found thusfar
i_prime = i # The index of theta_prime

while j < p:
# Update theta by running the training algorithm for n steps
for _ in range(n):
theta.train(x_train, y_train)

# Update Values
i += n
v_new = theta.error(x_valid, y_valid)

# If better validation error, then reset waiting time, save the network, and update the best error value
if v_new < v:
j = 0
theta_prime = theta.clone()
i_prime = i
v = v_new

# Otherwise, update the waiting time

else:
j += 1

return theta_prime, i_prime, v

• Noise added to weights can be interpreted as a more traditional form of regularization.

• In other words, it pushes the model to be relatively insensitive to small variations in the weights,
finding points that are not merely minima, but minima surrounded by flat regions.
Noise Injection on Outputs

• In the real world dataset, We can expect some amount of mistakes in the output labels. One way to
remedy this is to explicitly model the noise on labels.
• An example for Noise Injection on Outputs is label smoothing
Further reading

• Regularization

L1 Regularization
A regression model that uses L1 regularization technique is called Lasso Regression.
Mathematical formula for L1 Regularization.
Let's define a model to see how L1 Regularization works. For simplicity, We define a simple linear
regression model Y with one independent variable.
In this model, W represent Weight, b represent Bias.
W = w1, w2. . . wn
X = x1 , x2 . . . xn
and the predicted result is
̂
Y = w1x1 + w2x2 + . . . wnxn + b
Following formula calculates the error without Regularization function
̂
Loss = Error(Y, Y)
Following formula calculates the error With L1 Regularization function
n
̂
Loss = Error(Y − Y) + λ∑|wi|
1

Note
Here, If the value of lambda is Zero then above Loss function becomes Ordinary Least Square
whereas very large value makes the coefficients (weights) zero hence it under-fits.

One thing to note is that is differentiable when w!=0 as shown below,

\frac{\text{d}|w|}{\text{d}w} = \begin{cases}1 & w > 0\\-1 &
To understand the Note above,
Let's substitute the formula in finding new weights using Gradient Descent optimizer.
wnew = w − η∂L1
∂w
When we apply the L1 in above formula it becomes,
w_{new} = w - \eta. (Error(Y , \widehat{Y}) + \lambda\frac{
= \begin{cases}w - \eta . (Error(Y , \widehat{Y}) +\lam
From the above formula,

• If w is positive, the regularization parameter > 0 will push w to be less positive, by subtracting
from w.
• If w is negative, the regularization parameter < 0 will push w to be less negative, by adding to w.
hence this has the effect of pushing w towards 0.
Simple python implementation
def update_weights_with_l1_regularization(features, targets, weights, lr,lambda):
'''
Features:(200, 3)
Targets: (200, 1)
Weights:(3, 1)
'''
predictions = predict(features, weights)

#Extract our features

x1 = features[:,0]
x2 = features[:,1]
x3 = features[:,2]

# Use matrix cross product (*) to simultaneously

# calculate the derivative for each weight
d_w1 = -x1*(targets - predictions)
d_w2 = -x2*(targets - predictions)
d_w3 = -x3*(targets - predictions)

# Multiply the mean derivative by the learning rate

# and subtract from our weights (remember gradient points in direction of steepest ASCENT)

weights[0][0] = (weights[0][0] - lr * np.mean(d_w1) - lambda) if weights[0][0] > 0 else (weights[0][0] - lr * np.mean(d_w1) + lambda)
weights[1][0] = (weights[1][0] - lr * np.mean(d_w2) - lambda) if weights[1][0] > 0 else (weights[1][0] - lr * np.mean(d_w2) + lambda)
weights[2][0] = (weights[2][0] - lr * np.mean(d_w3) - lambda) if weights[2][0] > 0 else (weights[2][0] - lr * np.mean(d_w3) + lambda)

return weights

Use Case
L1 Regularization (or varient of this concept) is a model of choice when the number of features are high,
Since it provides sparse solutions. We can get computational advantage as the features with zero
coefficients can simply be ignored.
Further reading

• Linear Regression

L2 Regularization
A regression model that uses L2 regularization technique is called Ridge Regression. Main difference
between L1 and L2 regularization is, L2 regularization uses “squared magnitude” of coefficient as penalty
term to the loss function.
Mathematical formula for L2 Regularization.
Let's define a model to see how L2 Regularization works. For simplicity, We define a simple linear
regression model Y with one independent variable.
In this model, W represent Weight, b represent Bias.
W = w1, w2. . . wn
X = x1 , x2 . . . xn
and the predicted result is
̂
Y = w1x1 + w2x2 + . . . wnxn + b
Following formula calculates the error without Regularization function
̂
Loss = Error(Y, Y)
Following formula calculates the error With L2 Regularization function
n
̂
Loss = Error(Y − Y) + λ∑wi2
1
Note
Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large
then it will add too much weight and it leads to under-fitting.

To understand the Note above,

Let's substitute the formula in finding new weights using Gradient Descent optimizer.
wnew = w − η∂L2
∂w
When we apply the L2 in above formula it becomes,
wnew = w − η. (Error(Y, Y) + λ∂L2
̂
∂w )
̂
= w − η. (Error(Y, Y) + 2λw)
Simple python implementation

def update_weights_with_l2_regularization(features, targets, weights, lr,lambda):

'''
Features:(200, 3)
Targets: (200, 1)
Weights:(3, 1)
'''
predictions = predict(features, weights)

#Extract our features

x1 = features[:,0]
x2 = features[:,1]
x3 = features[:,2]

# Use matrix cross product (*) to simultaneously

# calculate the derivative for each weight
d_w1 = -x1*(targets - predictions)
d_w2 = -x2*(targets - predictions)
d_w3 = -x3*(targets - predictions)

# Multiply the mean derivative by the learning rate

# and subtract from our weights (remember gradient points in direction of steepest ASCENT)

weights[0][0] = weights[0][0] - lr * np.mean(d_w1) - 2 * lambda * weights[0][0]

weights[1][0] = weights[1][0] - lr * np.mean(d_w2) - 2 * lambda * weights[1][0]
weights[2][0] = weights[2][0] - lr * np.mean(d_w3) - 2 * lambda * weights[2][0]

return weights

Use Case
L2 regularization can address the multicollinearity problem by constraining the coefficient norm and
keeping all the variables. L2 regression can be used to estimate the predictor importance and penalize
predictors that are not important. One issue with co-linearity is that the variance of the parameter estimate
is huge. In cases where the number of features are greater than the number of observations, the matrix
used in the OLS may not be invertible but Ridge Regression enables this matrix to be inverted.
Further reading

• Ridge Regression
References
1 http://www.deeplearningbook.org/contents/regularization.html

Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Spare Parts List: R902208686 R902210528 Drawing: Material Number
No ratings yet
Spare Parts List: R902208686 R902210528 Drawing: Material Number
24 pages
tutorial 4
No ratings yet
tutorial 4
6 pages
03 Reg Slides
No ratings yet
03 Reg Slides
64 pages
Deep Learning Unit2
No ratings yet
Deep Learning Unit2
16 pages
Module-4_4
No ratings yet
Module-4_4
19 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
Dataset Augmentation
No ratings yet
Dataset Augmentation
30 pages
cours4
No ratings yet
cours4
30 pages
An Overview of Overfitting and Its Solutions
No ratings yet
An Overview of Overfitting and Its Solutions
7 pages
An Overview of Overfitting and Its Solutions
No ratings yet
An Overview of Overfitting and Its Solutions
7 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
An Overview of Overfitting and Its Solutions
No ratings yet
An Overview of Overfitting and Its Solutions
7 pages
4. Regularization
No ratings yet
4. Regularization
19 pages
LECTURE#9 EE258 F22 Part2 Draft v1
No ratings yet
LECTURE#9 EE258 F22 Part2 Draft v1
14 pages
DL Class3
No ratings yet
DL Class3
28 pages
Deep Learning Basics Lecture 4 Regularization II
No ratings yet
Deep Learning Basics Lecture 4 Regularization II
27 pages
Validation and training
No ratings yet
Validation and training
3 pages
Lecture 6
No ratings yet
Lecture 6
41 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman
No ratings yet
Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman
75 pages
AN2DL_03_2324_NeuralNetwroksTraining
No ratings yet
AN2DL_03_2324_NeuralNetwroksTraining
40 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
4 - DNN Tip
No ratings yet
4 - DNN Tip
52 pages
IoT - Lecture 11
No ratings yet
IoT - Lecture 11
58 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
116 pages
Unit-2 L2 (3)
No ratings yet
Unit-2 L2 (3)
22 pages
UNIT-II Regularization in Deep Learning
No ratings yet
UNIT-II Regularization in Deep Learning
24 pages
Regularization Slides (2)
No ratings yet
Regularization Slides (2)
50 pages
Dropout in Deep Learning
No ratings yet
Dropout in Deep Learning
14 pages
DL mod 2
No ratings yet
DL mod 2
4 pages
Deep Learning_Lecture 3_Regularization in Neural Networks
No ratings yet
Deep Learning_Lecture 3_Regularization in Neural Networks
16 pages
DL Unit 3
No ratings yet
DL Unit 3
59 pages
Neural Networks Report
No ratings yet
Neural Networks Report
4 pages
WEEK 10
No ratings yet
WEEK 10
69 pages
Deep Feedforward Networks and Regularization: Licheng Zhang
No ratings yet
Deep Feedforward Networks and Regularization: Licheng Zhang
56 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
Regularization_for_Neural_Networks_1718966083
No ratings yet
Regularization_for_Neural_Networks_1718966083
9 pages
1509.04612v2
No ratings yet
1509.04612v2
4 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
S10_DNN_Regularization_wip
No ratings yet
S10_DNN_Regularization_wip
11 pages
DL Unit-3
No ratings yet
DL Unit-3
10 pages
ML 5th and 6th
No ratings yet
ML 5th and 6th
37 pages
Module - 2 Ver 1.4
No ratings yet
Module - 2 Ver 1.4
35 pages
Lect 7
No ratings yet
Lect 7
43 pages
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
No ratings yet
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
39 pages
Lecture # 4-2 Autoregressive Models
No ratings yet
Lecture # 4-2 Autoregressive Models
39 pages
Lec 2
No ratings yet
Lec 2
5 pages
Lec 05 Regularization
No ratings yet
Lec 05 Regularization
77 pages
9.b Handout-2-Regularization
No ratings yet
9.b Handout-2-Regularization
5 pages
Unit-2 L4 (2)
No ratings yet
Unit-2 L4 (2)
23 pages
09 Milestone Project 2 Skimlit
No ratings yet
09 Milestone Project 2 Skimlit
32 pages
Fall 2022 Midterm Notes PDF
No ratings yet
Fall 2022 Midterm Notes PDF
15 pages
Assignment 3
No ratings yet
Assignment 3
25 pages
DL+lect+7 (1)
No ratings yet
DL+lect+7 (1)
15 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
05. Chap 7-2 Regularization for Deep Learning-Hyun-Lim Yang
No ratings yet
05. Chap 7-2 Regularization for Deep Learning-Hyun-Lim Yang
49 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Emrs Hostal Warden
No ratings yet
Emrs Hostal Warden
3 pages
T315XW03 V5 Auo
No ratings yet
T315XW03 V5 Auo
31 pages
Excel Lab 2
No ratings yet
Excel Lab 2
5 pages
Bara' CV 2020
No ratings yet
Bara' CV 2020
2 pages
Notes - Equations in One Variable
No ratings yet
Notes - Equations in One Variable
8 pages
Makalah Karya Seni Rupa Murni - PDF
No ratings yet
Makalah Karya Seni Rupa Murni - PDF
32 pages
SG814318BEN E02 Brochure TPS SUPPLY-PROGRAM 04-2024 S
No ratings yet
SG814318BEN E02 Brochure TPS SUPPLY-PROGRAM 04-2024 S
24 pages
Global Cement - Jan24
No ratings yet
Global Cement - Jan24
76 pages
PVRS Energy Model Template v11
No ratings yet
PVRS Energy Model Template v11
18 pages
How To Fix "ERR - EMPTY - RESPONSE" Google Chrome Error
No ratings yet
How To Fix "ERR - EMPTY - RESPONSE" Google Chrome Error
17 pages
200 passwords 2024
No ratings yet
200 passwords 2024
1 page
Mil PRF 32432
No ratings yet
Mil PRF 32432
61 pages
Chapter 6
No ratings yet
Chapter 6
23 pages
737 MAX RTS Preliminary Summary V 1
100% (1)
737 MAX RTS Preliminary Summary V 1
96 pages
Senate Hearing, 106TH Congress - Music On The Internet: Is There An Upside To Downloading?
No ratings yet
Senate Hearing, 106TH Congress - Music On The Internet: Is There An Upside To Downloading?
96 pages
Course (PIPING SYSTEMS - MECHANICAL DESIGN AND SPECIFICATION - ME-41)
No ratings yet
Course (PIPING SYSTEMS - MECHANICAL DESIGN AND SPECIFICATION - ME-41)
3 pages
AMISP Contract Final
No ratings yet
AMISP Contract Final
190 pages
Moisture Suppressant For Electrical Fittings
No ratings yet
Moisture Suppressant For Electrical Fittings
2 pages
FortiSwitch
No ratings yet
FortiSwitch
2 pages
N3886 Android Programming BCA Science Sem 6
No ratings yet
N3886 Android Programming BCA Science Sem 6
11 pages
Mini Hi-Fi System: Owner'S Manual
No ratings yet
Mini Hi-Fi System: Owner'S Manual
36 pages
Airtable - Low Rated Chrome Extensions
No ratings yet
Airtable - Low Rated Chrome Extensions
14 pages
Wind Power
No ratings yet
Wind Power
4 pages
Certificate of Graduation
No ratings yet
Certificate of Graduation
29 pages
Project Avinash Ray DVT Car Insurance
No ratings yet
Project Avinash Ray DVT Car Insurance
4 pages
Project:Ashwin Medical College & Hospital Pvt. LTD.: Sainbu, Nakkhu Road, Lalitpur
No ratings yet
Project:Ashwin Medical College & Hospital Pvt. LTD.: Sainbu, Nakkhu Road, Lalitpur
6 pages
Gear Pump KRACHT KF3 - 63 PDF
100% (1)
Gear Pump KRACHT KF3 - 63 PDF
11 pages
Eagle Lift MTP11A
No ratings yet
Eagle Lift MTP11A
25 pages
CGM XXX Me Ac at Mar XX 624003
No ratings yet
CGM XXX Me Ac at Mar XX 624003
58 pages

Regularization

Uploaded by

Regularization

Uploaded by

Regularization

def random_deletion(words, p):

#obviously, if there's only one word, don't delete it

#randomly delete words with probability p

• NLP Data Augmentation

print("Given input: ")

def dropout(X, drop_probability):

print("\n After Dropout: ")

output from above code

v: float; The validation error for the output network.

# Otherwise, update the waiting time

return theta_prime, i_prime, v

• Noise added to weights can be interpreted as a more traditional form of regularization.

One thing to note is that is differentiable when w!=0 as shown below,

#Extract our features

# Use matrix cross product (*) to simultaneously

# Multiply the mean derivative by the learning rate

To understand the Note above,

def update_weights_with_l2_regularization(features, targets, weights, lr,lambda):

#Extract our features

# Use matrix cross product (*) to simultaneously

# Multiply the mean derivative by the learning rate

weights[0][0] = weights[0][0] - lr * np.mean(d_w1) - 2 * lambda * weights[0][0]

You might also like