We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43
MIT Art Design and Technology University
MIT School of Computing, Pune
21BTCS031 – Deep Learning & Neural Networks
Class - L.Y. CORE (SEM-I)
Unit - II Deep Networks
Dr. Anant Kaulage Dr. Sunita Parinam Dr. Mayura Shelke Dr. Aditya Pai AY 2024-2025 SEM-I Regularization
Unit II Introduction
A central problem in machine learning is how to
make an algorithm that will perform well not just on the training data, but also on new inputs Strategies used in machine learning are explicitly designed to reduce the test error, possibly at the expense of increased training error These strategies are known collectively as regularization “any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.” Intuition
Loss function is the sum of squared difference
between the actual value and the predicted value Intuition
When we penalize the weights θ_3 and θ_4
and make them too small, very close to zero. It makes those terms negligible and helps simplify the model. Parameter Norm Penalties
Many regularization approaches are based on
limiting the capacity of models, such as neural networks, linear regression, or logistic regression add a parameter norm penalty Ω(θ) to the objective function J Regularized objective function by J˜:
where α ∈ [0, ∞) is a hyperparameter that weights
the relative contribution of the norm penalty term, Ω, relative to the standard objective function J Parameter Norm Penalties
we typically choose to use a parameter norm
penalty Ω that penalizes only the weights at each layer and leaves the biases unregularized. The biases typically require less data to fit accurately than the weights Regularizing the bias parameters can introduce a significant amount of underfitting It is sometimes desirable to use a separate penalty with a different α coefficient for each layer of the network L2 Parameter Regularization
L2 parameter norm penalty commonly known
as weight decay This regularization strategy drives the weights closer to the origin by adding a regularization term Ω(θ) = ½||w||22 to the objective function is also known as ridge regression or Tikhonov regularization Consider behavior of weight decay regularization for gradient of the regularized objective function assume no bias parameter, so θ is just w L2 Parameter Regularization
with the corresponding parameter gradient
To take a single gradient step to update the
weights,
modified the learning rule to multiplicatively shrink the weight vector
by a constant factor on each step L2 Parameter Regularization
Consider quadratic approximation to the
objective function in the neighborhood of the value of the weights that obtains minimal unregularized training cost, w∗ = arg minw J(w). If the objective function is truly quadratic, The approximation Jˆ is given by
H is the Hessian matrix of J with respect to w
evaluated at w∗. L2 Parameter Regularization
To study the effect of weight decay, we modify
equation by adding the weight decay gradient We can now solve for the minimum of the regularized version of Jˆ. We use the variable w˜ to represent the location of the minimum
Because H is real and symmetric, we can decompose it into a diagonal matrix Λ and an orthonormal basis of eigenvectors, Q, such that H = QΛQT L2 Parameter Regularization L2 Parameter Regularization L2 Parameter Regularization
We see that the effect of weight decay is to
rescale w∗ along the axes defined by the eigenvectors of H. Specifically, the component of w∗ that is aligned with the i-th eigenvector of H is rescaled by a factor of λi/λi +α. 1. The weight vector(w*) is getting rotated to ( ~ w) 2. All of its elements are shrinking but some are shrinking more than the others 3.This ensures that only important features are given high weights L1 Regularization
L1 regularization on the model parameter w is
defined as:
the sum of absolute values of the individual
parameters
As with L2 weight decay, L1 weight decay controls
the strength of the regularization by scaling the penalty using a positive Ω hyperparameter α. Regularized objective function J˜(w;X, y) is given by L1 Regularization
sign(w) is simply the sign of w applied element-wise
In comparison to L2 regularization, L1 regularization results in a solution that is more sparse “Sparse” solutions, with many parameters set to zero: ● can be more interpretable ● can require less memory and less computation ● might generalize better (but also often not!). L1 Regularization
Like L2 regularization, we penalize weights with
large magnitudes. `However, the solutions are qualitatively different: with L1 regularization some of the parameters will often be exactly zero Why L1? ● The L1 regularizer is popular because it gives sparse solutions and it is convex. ● If the error function is also convex it is possible to find the global optimum L2 VS L1
• L1 penalizes sum of • L2 regularization penalizes
absolute value of weights. sum of square weights. • L1 has a sparse solution • L2 has a non sparse solution • L1 has multiple solutions • L2 has one solution • L1 has built in feature • L2 has no feature selection selection • L2 is not robust to outliers • L1 is robust to outliers • L2 gives better prediction • L1 generates model that are when output variable is a simple and interpretable but function of all input features cannot learn complex • L2 regularization is able to patterns learn complex data patterns Data Augmentation
The best way to make a machine
learning model generalize better is to train it on more data effective technique for a specific classification problem: object recognition One must be careful not to apply transformations that would change the correct class Data Augmentation Noise Robustness
Noise can be applied to the inputs as a dataset
augmentation strategy Noise applied to the weights can also be interpreted as equivalent (under some assumptions) to a more traditional form of regularization Consider the regression setting, where we wish to train a function ˆy(x) that maps a set of features x to a scalar using the least-squares cost function between the model predictions yˆ(x) and the true values y: Noise Robustness We can show that for a simple input output neural network, adding Gaussian noise to the input is equivalent to weight decay (L2 regularization) Can be viewed as data augmentation Noise Robustness Noise Robustness Injecting Noise to Output
Most datasets have some amount of mistakes
in the y labels. It can be harmful to maximize log p(y | x) when y is a mistake. One way to prevent this is to explicitly model the noise on the labels we can assume that for some small constant ε, the training set label y is correct with probability 1− ε, and otherwise any of the other possible labels might be correct Early Stopping
When training large models with sufficient
representational capacity to overfit the task, we often observe that training error decreases steadily over time, but validation set error begins to rise again Early stopping Early stopping Early stopping Ensemble Methods Ensemble Methods Ensemble Methods Typically model averaging(bagging ensemble) always helps Training several large neural networks for making an ensemble is prohibitively expensive Option 1: Train several neural networks having different architectures( obviously expensive) Option 2: Train multiple instances of the same network using different training samples (again expensive) Even if we manage to train with option 1 or option 2, combining several models at test time is infeasible in real time applications Dropout
Dropout is a technique which addresses
both these issues. Effectively it allows training several neural networks without any significant computational overhead. Also gives an efficient approximate way of combining exponentially many different neural networks Dropout
Dropout refers to dropping out units
Temporarily remove a node and all its incoming/outgoing connections resulting in a thinned network SUMMARY