0% found this document useful (0 votes)
4 views23 pages

Unit-2 L1

Uploaded by

pari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views23 pages

Unit-2 L1

Uploaded by

pari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Unit 2

Regularization
• Regularization: Regularization: Overview, Parameter Penalties,
Norm Penalties as Constrained Optimization, Regularization and
Underconstrained Problems, Data Augmentation, Noise Robustness,
Batch Normalization, Semi-Supervised Learning, Multi-Task
Learning, Early Stopping, Parameter Tying and Parameter Sharing,
SparseRepresentations, Bagging, Dropout. Tuning Neural Networks,
Hyperparameters
Regularization

• Definition: Regularization is one of the most


important concepts of deep learning. It is a technique
to prevent the model from overfitting by adding extra
information to it.
Regularization
• Generalization error or Test error: Sometimes the deep learning model performs well with
the training data but does not perform well with the test data. It means the model is
not able to predict the output when deals with unseen data by introducing noise in
the output, and hence the model is called overfitted.

• At the left end of the graph,


training error and
generalization error are both
high. This is the underfitting
regime.

• As we increase capacity, training error decreases, but the gap between training
and generalization error increases. Eventually, the size of this gap outweighs
the decrease in training error, and we enter the overfitting regime, where
capacity is too large, above the optimal capacity.
Regularization (Overview)
• There are three situations when a deep learning model is trained
1. First regime is,excluded the true data generating process—corresponding to
underfitting and inducing bias, or
2. Matched the true data generating process (Just fit) (desired), or
3. Included the generating process but also many other possible generating
processes—the overfitting regime where variance rather than bias dominates the
estimation error.
• The goal of regularization is to take a model from the third regime into the
second regime.
• The best-fitting model (in the sense of minimizing generalization error) is a
large model that has been regularized appropriately.
• In regularization technique, we reduce the magnitude of the
features by keeping the same number of features.”
Overfitting occurs when our deep learning model tries to cover all
the data points or more than the required data points present in the
given dataset. The overfitted model has low bias and high variance.

Ques: How to avoid the


Overfitting in
Model
Ans: Regularization
Regularization Techniques
Parameter Norm Penalties
• The Parameter Norm Penalty approaches are based on limiting the capacity of
neural network models, by adding a parameter norm penalty Ω(θ) to the
objective function J.

• where α ∈ [0, ∞) is a hyperparameter that weights the relative contribution of


the norm penalty term, Ω.
• We choose to use a parameter norm penalty Ω that penalizes only the weights
of the affine transformation at each layer and leaves the biases unregularized.
As biases typically require less data to fit accurately than the weights.
• Different choices for the parameter norm Ω can result in different solutions
being preferred.
• Using a separate penalty with a different α coefficient for each layer is
sometimes desirable. Because it can be expensive to search for the correct
value of multiple hyperparameters, it is still reasonable to use the same weight
decay at all layers just to reduce the size of search space.
Parameter Norm Penalties
1. L2 Parameter Regularization (ridge regression or Tikhonov regularization):
This regularization strategy drives the weights closer to the origin1 by
adding a regularization term to the objective function.
L2 Parameter Regularization
• The addition of the weight decay term has modified the learning rule
to multiplicatively shrink the weight vector by a constant factor on
each step, just before performing the usual gradient update.
• The solid ellipses represent contours of equal
value of the unregularized objective.
• The dotted circles represent contours of equal
value of the L2 regularizer.
• At the point , these competing objectives reach
an equilibrium.
L Regularization
1
• Formally, L1 regularization on the model parameter w is defined as:

• we can see that the regularization contribution to the gradient no longer


scales linearly with each ; instead it is a constant factor with a sign equal to
sign().
• In comparison to L2 regularization, L1 regularization results in a solution that
is more sparse. Sparsity in this context refers to the fact that some
Norm Penalties as Constrained
Optimization
Norm Penalties as Constrained
Optimization
• To minimize a function subject to constraints, a generalized Lagrange function,
consisting of the original objective function plus a set of penalties can be
constructed.
• If we wanted to constrain to be less than some constant , we could construct a
generalized Lagrange function
Norm Penalties as Constrained
Optimization
• Solving this problem requires modifying both θ and α. Many different
procedures are possible. And α must increase whenever Ω(θ) > k and
decrease whenever Ω(θ) < k.
• We can fix α* and view the problem as just a function of θ

• We can thus think of the parameter norm penalty as imposing a constraint on


the weights.
Norm Penalties as Constrained
Optimization
• How α influences weights:
• If Ω is L2 norm, weights are then constrained to lie in an L2 ball.
• If Ω is the L1 norm, Weights are constrained to lie in a region of
limited L1 norm
• Usually, we do not know size of constraint region that we impose
by using weight decay with coefficient α* because the value of
α* does not directly tell us the value of k.
- Larger α will result in a smaller constraint region
- Smaller α will result in a larger constraint region
Norm Penalties as Constrained
Optimization
• Reprojection
• Sometimes we may wish to use explicit constraints rather than
penalties
-We can modify SGD to take a step downhill on J(θ) and then project θ
back to the nearest point that satisfies Ω(θ)<k
-This is useful when we have an idea of what value of k is appropriate and
we do not want to spend time searching for the value of α that
corresponds to this k.
• Rationale for explicit constraints/Reprojection
1. Dead weights
2. Stability
Norm Penalties as Constrained
Optimization
• Eliminating dead weights
• A reason to use explicit constraints and reprojection rather than
enforcing constraints with penalties:
• Penalties can cause nonconvex optimization procedures to get stuck in
local minima corresponding to small θ
• This manifests as training with dead units
• Explicit constraints implemented by reprojection can work much better
because they do not encourage weights to approach the origin.
Norm Penalties as Constrained
Optimization
• Stability of Optimization
• Explicit constraints with reprojection can be useful because
these impose some stability on the optimization procedure.
• When using high learning rates, it is possible to encounter a
positive feedback learning loop in which large weights induce
large gradients, which then induce a large update of the weights.
Can lead to numerical overflow
• Explicit constraints with reprojection prevent this feedback loop
from continuing to increase magnitudes of weights without
bound
Regularization and Underconstrained
Problems
Regularization and Underconstrained Problems
Regularization and Underconstrained Problems
Underconstrained logistic regression
Regularization and Underconstrained Problems
Solution for Undercontrained Iterative

You might also like