0% found this document useful (0 votes)
11 views43 pages

Unit 2.3

deep learning

Uploaded by

jadhavrohan7337
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views43 pages

Unit 2.3

deep learning

Uploaded by

jadhavrohan7337
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

MIT Art Design and Technology University

MIT School of Computing, Pune


21BTCS031 – Deep Learning & Neural Networks

Class - L.Y. CORE (SEM-I)

Unit - II Deep Networks


Dr. Anant Kaulage
Dr. Sunita Parinam
Dr. Mayura Shelke
Dr. Aditya Pai
AY 2024-2025 SEM-I
Regularization

Unit II
Introduction

A central problem in machine learning is how to


make an algorithm that will perform well not just
on the training data, but also on new inputs
Strategies used in machine learning are explicitly
designed to reduce the test error, possibly at the
expense of increased training error
These strategies are known collectively as
regularization
“any modification we make to a learning algorithm
that is intended to reduce its generalization error
but not its training error.”
Intuition

Loss function is the sum of squared difference


between the actual value and the predicted
value
Intuition

When we penalize the weights θ_3 and θ_4


and make them too small, very close to zero. It
makes those terms negligible and helps simplify
the model.
Parameter Norm Penalties

Many regularization approaches are based on


limiting the capacity of models, such as neural
networks, linear regression, or logistic regression
add a parameter norm penalty Ω(θ) to the objective
function J
Regularized objective function by J˜:

where α ∈ [0, ∞) is a hyperparameter that weights


the relative contribution of the norm penalty term,
Ω, relative to the standard objective function J
Parameter Norm Penalties

we typically choose to use a parameter norm


penalty Ω that penalizes only the weights at
each layer and leaves the biases unregularized.
The biases typically require less data to fit
accurately than the weights
Regularizing the bias parameters can introduce
a significant amount of underfitting
It is sometimes desirable to use a separate
penalty with a different α coefficient for each
layer of the network
L2 Parameter Regularization

L2 parameter norm penalty commonly known


as weight decay
This regularization strategy drives the weights
closer to the origin by adding a regularization
term Ω(θ) = ½||w||22 to the objective function
is also known as ridge regression or
Tikhonov regularization
Consider behavior of weight decay
regularization for gradient of the regularized
objective function
assume no bias parameter, so θ is just w
L2 Parameter Regularization

with the corresponding parameter gradient

To take a single gradient step to update the


weights,

modified the learning rule to multiplicatively shrink the weight vector


by a constant factor on each step
L2 Parameter Regularization

Consider quadratic approximation to the


objective function in the neighborhood of the
value of the weights that obtains minimal
unregularized training cost,
w∗ = arg minw J(w).
If the objective function is truly quadratic, The
approximation Jˆ is given by

H is the Hessian matrix of J with respect to w


evaluated at w∗.
L2 Parameter Regularization

To study the effect of weight decay, we modify


equation by adding the weight decay gradient
We can now solve for the minimum of the
regularized version of Jˆ. We use the variable
w˜ to represent the location of the minimum

As α approaches 0, the regularized solution w˜


approaches w∗
L2 Parameter Regularization
L2 Parameter Regularization

What happens as α grows?


Because H is real and symmetric, we
can decompose it into a diagonal
matrix Λ and an orthonormal basis of
eigenvectors, Q, such that
H = QΛQT
L2 Parameter Regularization
L2 Parameter Regularization
L2 Parameter Regularization

We see that the effect of weight decay is to


rescale w∗ along the axes defined by the
eigenvectors of H.
Specifically, the component of w∗ that is aligned
with the i-th eigenvector of H is rescaled by a
factor of λi/λi +α.
1. The weight vector(w*) is getting
rotated to ( ~ w)
2. All of its elements are shrinking but
some are shrinking more than the
others
3.This ensures that only important
features are given high weights
L1 Regularization

L1 regularization on the model parameter w is


defined as:

the sum of absolute values of the individual


parameters

As with L2 weight decay, L1 weight decay controls


the strength of the regularization by scaling the
penalty using a positive Ω hyperparameter α.
Regularized objective function J˜(w;X, y) is given by
L1 Regularization

sign(w) is simply the sign of w applied element-wise


In comparison to L2 regularization, L1 regularization
results in a solution that is more sparse
“Sparse” solutions, with many parameters set to zero:
● can be more interpretable
● can require less memory and less computation
● might generalize better (but also often not!).
L1 Regularization

Like L2 regularization, we penalize weights with


large magnitudes.
`However, the solutions are qualitatively
different: with L1 regularization some of the
parameters will often be exactly zero
Why L1?
● The L1 regularizer is popular because it gives
sparse solutions and it is convex.
● If the error function is also convex it is possible
to find the global optimum
L2 VS L1

• L1 penalizes sum of • L2 regularization penalizes


absolute value of weights. sum of square weights.
• L1 has a sparse solution • L2 has a non sparse solution
• L1 has multiple solutions • L2 has one solution
• L1 has built in feature • L2 has no feature selection
selection • L2 is not robust to outliers
• L1 is robust to outliers • L2 gives better prediction
• L1 generates model that are when output variable is a
simple and interpretable but function of all input features
cannot learn complex • L2 regularization is able to
patterns learn complex data patterns
Data Augmentation

The best way to make a machine


learning model generalize better is to
train it on more data
effective technique for a specific
classification problem: object recognition
One must be careful not to apply
transformations that would change the
correct class
Data Augmentation
Noise Robustness

Noise can be applied to the inputs as a dataset


augmentation strategy
Noise applied to the weights can also be
interpreted as equivalent (under some
assumptions) to a more traditional form of
regularization
Consider the regression setting, where we wish
to train a function ˆy(x) that maps a set of
features x to a scalar using the least-squares
cost function between the model predictions
yˆ(x) and the true values y:
Noise Robustness
We can show that for a
simple input output
neural network, adding
Gaussian noise to the
input is equivalent to
weight decay (L2
regularization)
Can be viewed as data
augmentation
Noise Robustness
Noise Robustness
Injecting Noise to Output

Most datasets have some amount of mistakes


in the y labels.
It can be harmful to maximize log p(y | x) when
y is a mistake.
One way to prevent this is to explicitly model
the noise on the labels
we can assume that for some small constant ε,
the training set label y is correct with probability
1− ε, and otherwise any of the other possible
labels might be correct
Early Stopping

When training large models with sufficient


representational capacity to overfit the task, we
often observe that training error decreases
steadily over time, but validation set error
begins to rise again
Early stopping
Early stopping
Early stopping
Ensemble Methods
Ensemble Methods
Ensemble Methods
Typically model averaging(bagging ensemble)
always helps
Training several large neural networks for making an
ensemble is prohibitively expensive
Option 1: Train several neural networks having
different architectures( obviously expensive)
Option 2: Train multiple instances of the same
network using different training samples (again
expensive)
Even if we manage to train with option 1 or option 2,
combining several models at test time is infeasible in
real time applications
Dropout

Dropout is a technique which addresses


both these issues.
Effectively it allows training several
neural networks without any significant
computational overhead.
Also gives an efficient approximate way
of combining exponentially many
different neural networks
Dropout

Dropout refers to dropping out units


Temporarily remove a node and all its
incoming/outgoing connections resulting in a
thinned network
SUMMARY

You might also like