0% found this document useful (0 votes)
4 views

Module-4_3

Uploaded by

as.business.023
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Module-4_3

Uploaded by

as.business.023
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Regularization in Deep Learning

L1, L2, and Dropout


Recap: Overfitting
• Overfitting refers to the phenomenon where a
neural network models the training data very
well but fails when it sees new data from the
same problem domain. Overfitting is caused
by noise in the training data that the neural
network picks up during training and learns it
as an underlying concept of the data.
data
Overfitting
• This learned noise,, however,, is unique
q to each
training set. As soon as the model sees new data
from the same problem domain, but that does
not contain this noise,
noise the performance of the
neural network gets much worse.
• “Why does the neural network picks up that noise
in the first place?”
• The reason for this is that the complexity of this
network is too high.
high A fit of a neural network with
higher complexity is shown in the image on the
right‐hand side.
Graph 1. Model with a good fit and high variance
Overfitting
• The model with a higher complexity is able to pick up and
learn patterns (noise) in the data that are just caused by
some random fluctuation or error. The network would be
able to model each data sample of the distribution one‐by‐
one, while
hil nott recognizing
i i ththe ttrue ffunction
ti th
thatt d
describes
ib
the distribution.
• New arbitrary samples generated with the true function
wouldld h
have a hi
high h di
distance
t tto th
the fit off th
the model.
d l W We also
l
say that the model has a high variance.
• On the other hand, the lower complexity network on the
l ft side
left id models
d l the
th didistribution
t ib ti much h better
b tt byb nott trying
t i
too hard to model each data pattern individually.
Overfitting
• In practice
practice, overfitting causes the neural
network model to perform very well during
training but the performance gets much
training,
worse during inference time when faced with
brand new data.
data
Regularization

• Regularization refers to a set of different


techniques that lower the complexity of a
neural network model during training,
training and
thus prevent the overfitting.
• There are three very popular and efficient
regularization techniques called L1, L2, and
dropout which we are going to discuss in the
following.
L2 Regularization

• The L2 regularization is the most common


type of all regularization techniques and is
also commonly known as weight decay or Ride
Regression.
• During the L2 regularization the loss function
of the neural network as extended by a so‐
called regularization term,
term which is called here
Ω.
L2 Regularization

• The regularization term Ω is defined as the


E lid
Euclidean N
Norm (or
( L2 norm)) off th
the weight
i ht
matrices, which is the sum over all squared
g values of a weight
weight g matrix.
L2 Regularization

• The regularization term is weighted by the scalar


alpha divided by two and added to the regular
loss function that is chosen for the current task.
This leads to a new expression for the loss
function:
Gradient Descent during L2
• Alpha is sometimes called as the regularization rate and is
an additional hyperparameter we introduce into the neural
network. Simply speaking alpha determines how much we
regularize our model.
• In the next step we can compute the gradient of the new
loss function and put the gradient into the update rule for
the weights:
Gradient Descent during L2
• Some reformulations of the update rule lead to the
expression which very much looks like the update rule
for the weights during regular gradient descent:
• The only difference is that by adding the regularization
term we introduce an additional subtraction from the
current weights (first term in the equation).
• In other words independent of the gradient of the loss
function we are making our weights a little bit smaller
each time an update is performed.
L1 Regularization

• In the case of L1 regularization (also knows as


Lasso regression), we simply use another
regularization term Ω.
Ω This term is the sum of
the absolute values of the weight parameters
in a weight matrix:
L1 Regularization

• The derivative of the new loss function leads


to the following expression, which the sum of
the gradient of the old loss function and sign
of a weight value times alpha.
Why do L1 and L2 Regularizations
work?
• consider the plots of the AND functions,
where represents the operation performed
during L1(red) and the operation performed
during L2 (blue)regularization.
Why do L1 and L2 Regularizations work?
• In the case of L2 regularization, weight parameters decrease, but
not necessarily become zero, since the curve becomes flat near
zero.
• On the other hand during the L1 regularization, the weight are
always forced all the way towards zero.
• In the case of L2, you can think of solving an equation, where the
sum of squared weight values is equal or less than a value s. s is
the constant that exists for each possible value of the
regularization term α. For just two weight values W1 and W2 this
equation would look as follows:
W1 ² + W²² ≤ s
• While, the L1 regularization can be thought of as an equation
where the sum of modules of weight values is less than or equal to
a value s. This would look like the following expression: |W1| +
|W2| ≤ s
Visualization
• The introduced equations for L1 and L2 regularizations are
constraint functions, which we can visualize:

The left image shows the constraint function (green area) for the
L1 regularization and the right image shows the constraint
function for the L2 regularization. The red ellipses are contours
off the
h loss
l function
f that
h is used d during
d the
h gradient
d d
descent. In
the center of the contours there is a set of optimal weights for
which the loss function has a global minimum.
Visualization
• In the case of L1 and L2 regularization, the estimates of W1
and W2 are given by the first point where the ellipse
intersects with the green constraint area.
• Since L2 regularization has a circular constraint area, the
intersection won’t generally occur on an axis, and this the
estimates for W1 and W2 will be exclusively non‐zero.
• In the case of L1, the constraints area has a diamond shape
with corners. And thus the contours of the loss function will
often intersect the constraint region at an axis. Then this
occurs, one of the estimates (W1 or W2) will be zero.

• In a high dimensional space, many of the weight


parameters will equal
p q zero simultaneously. y
What does Regularization achieve?

• Performing L2 regularization encourages the


weight values towards zero (but not exactly zero)
• Performing L1 regularization encourages the
weight values to be zero
• Intuitively speaking smaller weights reduce the
impact of the hidden neurons. In that case, those
hidden neurons become neglectable and the
overall complexity of the neural network gets
reduced.
What does Regularization achieve?

• As mentioned earlier: less complex models typically avoid


modeling
d li noise
i iin th
the ddata,
t and d th
therefore,
f th
there iis no overfitting.
fitti
• But you have to be careful. When choosing the regularization
term α. The goal is to strike the right balance between low
complexity of the model and accuracy
• If your alpha value is too high, your model will be simple, but you
run the risk of underfitting your data. Your model won’t learn
enough about the training data to make useful predictions.
• If your alpha value is too low, your model will be more complex, and
you run the risk of overfitting your data. Your model will learn too
much about the particularities of the training data, and won’t be
able to generalize to new data
data.

You might also like