0% found this document useful (0 votes)

120 views30 pages

Unit4 DL Final

Uploaded by

sipik50968

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

120 views30 pages

Unit4 DL Final

Uploaded by

sipik50968

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 30

UNIT IV

Regularization for Deep Learning: Parameter norm Penalties, Norm Penalties as

Constrained Optimization, Regularization and Under-Constrained Problems, Dataset
Augmentation, Noise Robustness, Semi-Supervised learning, Multi-task learning, Early
Stopping, Parameter Typing and Parameter Sharing, Sparse Representations, Bagging and
other Ensemble Methods, Dropout, Adversarial Training, Tangent Distance, tangent Prop
and Manifold, Tangent Classifier

_________________________________________________________________________________

REGULARIZATION

In Machine Learning, and more so in Deep Learning, overfitting is a major issue that occurs
during training. A model is considered as overfitting the training data when the training error keeps
decreasing but the test error (or the generalisation error) starts increasing. At this point we tend to
believe that the model is learning the training data distribution and not generalising to unseen data.

Regularization is a modification we make to the learning algorithm or the model architecture

that reduces its generalisation error, possibly at the expense of increased training error. There are
various ways of doing this, some of which include restriction on parameter values or adding terms to
the objective function, etc. These constraints are designed to encode some sort of prior knowledge,
with a preference towards simpler models to promote generalisation (see Occam’s Razor).

Regularization based on regularizing estimators, Regularization of an estimator works by

trading increased bias for reduced variance.

Regularization topics are listed below:

 Parameter Norm Penalties
 Norm Penalties as Constrained Optimization
 Regularization and Under-Constrained Problems
 Dataset Augmentation
 Noise Robustness
 Semi-Supervised Learning
 Mutlitask Learning
 Early Stopping
 Parameter Tying and Parameter Sharing
 Sparse Representations
 Bagging and Other Ensemble Methods
 Dropout
 Adversarial Training
 Tangent Distance, Tangent Prop and Manifold Tangent Classifier

1. Parameter Norm Penalties

Linear models such as linear regression and logistic regression allow simple, straight forward
and effective regularization strategies. The idea here is to limit the capacity (the space of all possible
model families) of the model by adding a parameter norm penalty, Ω(θ), to the objective
function, J:

Here, α ϵ [0,∞] is a hyperparameter that weights the relative contribution of the norm penalty term, Ω
relative to the standard objective function J, Setting α to 0 results in no regularization , Larger values
of α correspond to more regularization.

1.1 L² Parameter Regularization

one of the simplest and most common kinds of parameter norm penalty is L2 Parameter
Regularization Commonly known as weight decay Here, we have the parameter norm penalty by
adding a following regularization term to the objective function:

L2 regularization also known as ridge regression or tikhonov regularization

Applying the 2nd order Taylor-Series approximation (ignoring all terms of order greater than 2
in the Taylor-Series expansion) at the point w* (where J̃ (θ; X, y) assumes the minimum value, i.e.,
∇J̃ (w*)= 0), we get the following expression (as the first order gradient term is 0):

Finally, ∇ Ĵ(w) = H(w — w*) since the first term is just a constant and the derivative of X’ H X (’
represents the transpose) is 2 H X. The overall gradient of the objective function (gradient of Ĵ +
gradient of Ω(θ))becomes:

As α approaches 0, w comes closer to w*. Finally, since H is real and symmetric, it can be decomposed
into a diagonal matrix ∧ and an orthonormal set of eigenvectors, Q.That is, H = Q’ ∧ Q.
Because of the marked term, the value of each weight is rescaled along the eigenvectors of H. The
value of the weights along the eigenvector i,is rescaled by λi /(λi+α), where λi represents the
eigenvalue corresponding to that eigenvector.

To look at its application to Machine Learning, we have to look at linear regression. The objective
function there is exactly quadratic, given by:

The diagram below illustrates this well:

1.2 L¹ parameter regularization

Here, the parameter norm penalty is given by: Ω(θ) = ||w||¹ ie as the sum of absolute values of the
individal parameter. This makes the gradient of the overall objective function:

Now, the last term, sign(w), creates some difficulty as the gradient no longer scales linearly with w.
This leads to a few complexities in arriving at the optimal solution (details of which can be found in
this Appendix notebook):

My current interpretation of the max term is that, there shouldn’t be a zero crossing, as the gradient of
the absolute value function is not differentiable at zero.

Thus, L¹ regularization has the property of sparsity, which is its fundamental distinguishing feature
from L². Hence, L¹ is used for feature selection as in LASSO(Least abso;ute shrinkage and selection
operator).

2. NORM PENALTIES AS CONSTRAINED OPTIMIZATION

we know that to minimize any function under some constraints, we can construct a
generalized Lagrange function consisting of the original objective function along with the a set of
penalties.

Each penalty is a product between a coefficient called a Karush Kuhn Tucker (KKT) multiplier
---Suppose we wanted Ω(θ) < k, then we could construct the following Lagrangian:

The solution to the constrained problem is given by

We get optimal θ and α by solving the Lagrangian. Many different procedures are possible-
some way use gradient descent, while others may use analytical solutions for where the gradient is zero
– but in all procedures α must increase whenever Ω(θ) > k, then the weights need to be compensated
highly and hence, α should be large to reduce its value below k. Likewise, if Ω(θ)<k, then the norm
shouldn’t be reduced too much and hence, α should be small. All positive α encourage Ω(θ) to shrink.

This is now similar to the parameter norm penalty regularized objective function as both of
them encourage lower values of the norm. Thus, parameter norm penalties naturally impose a
constraint, like the L²-regularization, defining a constrained L²-ball. Larger α implies a smaller
constrained region as it pushes the values really low, hence, allowing a small radius and vice versa.

The idea of constraints over penalties is important for several reasons. Large penalties might
cause non-convex optimization algorithms to get stuck in local minima due to small values of θ,
leading to the formation of so-called dead cells, as the weights entering and leaving them are too small
to have an impact. Constraints don’t enforce the weights to be near zero, rather being confined to a
constrained region.
These are units that do not contribute much to the behavior of the function learned by the
network because the weights going into or out of them are all very small. When training with a
penalty on the norm of the weights, these configurations can be locally optimal, even if it is possible
to significantly reduce J by making the weights larger. Explicit constraints implemented by re-
projection can work much better in these cases because they do not encourage the weights to approach
the origin. Explicit constraints implemented by re-projection only have an effect when the weights
become large and attempt to leave the constraint region.

Another reason is that constraints induce higher stability. With higher learning rates, there
might be a large weight, leading to a large gradient, which could go on iteratively leading to numerical
overflow in the value of θ. Constrains, along with reprojection (to the corresponding ball), prevent the
weights from becoming too large, thus, maintaining stability.
Finally, explicit constraints with reprojection can be useful because they impose some stability on the
optimization procedure. When using high learning rates, it is possible to encounter a positive feedback
loop in which large weights induce large gradients which then induce a large update to the weights. If
these updates consistently increase the size of the weights, then θ rapidly moves away from
the origin until numerical overflow occurs. Explicit constraints with reprojection prevent this
feedback loop from continuing to increase the magnitude of the weights without bound. Hinton et al.
(2012c) recommend using constraints combined with a high learning rate to allow rapid exploration of
parameter space while maintaining some stability.

A final suggestion made by Hinton was to restrict the individual column norms of the weight
matrix rather than the Frobenius norm of the entire weight matrix, so as to prevent any hidden unit
from having a large weight. The idea here is that if we restrict the Frobenius norm, it doesn’t guarantee
that the individual weights would be small, just their norm. So, we might have large weights being
compensated by extremely small weights to make the overall norm small. Restricting each hidden unit
individually gives us the required guarantee.

3. REGULARIZED & UNDER-CONSTRAINED PROBLEMS

Underdetermined problems are those problems that have infinitely many solutions. A logistic
regression problem having linearly separable classes with w as a solution, will always have 2w as a
solution and so on. In some machine learning problems, regularization is necessary. For e.g., many
algorithms (e.g. PCA) require the inversion of X TX, which might be singular. This matrix can be
singular whenever the data generating distribution truly has no variance in some direction, or when no
variance is observed in some direction because there are fewer examples (rows of X) than input
features(columns of X ).In such a case, we can use a regularized form instead. (X TX + αI) is
guaranteed to be invertible.

These linear problems have closed form solutions when the relevant matrix is invertible. It is
also possible for a problem with no closed form solution to be underdetermined. An example is
logistic regression applied to a problem where the classes are linearly separable. If a weight vector w
is able to achieve perfect classification, then 2w will also achieve perfect classification and higher
likelihood. An iterative optimization procedure like stochastic gradient descent will continually
increase the magnitude of w and, in theory, will never halt. In practice, a numerical implementation of
gradient descent will eventually reach sufficiently large weights to cause numerical overflow, at
which point its behavior will depend on how the programmer has decided to handle values that are not
real numbers.

Most forms of regularization are able to guarantee the convergence of iterative methods
applied to underdetermined problems. For example, weight decay will cause gradient descent to quit
increasing the magnitude of the weights when the slope of the likelihood is equal to the weight decay
coefficient.
The idea of using regularization to solve underdetermined problems extends beyond machine
learning. The same idea is useful for several basic linear algebra problems.
Regularization can solve underdetermined problems. For e.g. the Moore-Penrose
pseudoinverse defined X+ of a matrix X is:

This can be seen as performing a linear regression with L²-regularization.

4. DATASET AUGMENTATION
The best way to make a machine learning model generalize better is to train it on more data.
Of course, in practice, the amount of data we have is limited. One way to get around this problem is to
create fake data and add it to the training set.
For some machine learning tasks, it is reasonably straightforward to create new fake data. This
approach is easiest for classification. A classifier needs to take a complicated, high dimensional input
x and summarize it with a single category identity y. This means that the main task facing a classifier
is to be invariant to a wide variety of transformations. We can generate new (x, y) pairs easily just by
transforming the x inputs in our training set.
Data augmentation is the final strategy that we need to mention. Although not strictly a
regularization method, it sure has its place here.
This approach is not as readily applicable to many other tasks. For example, it is difficult to
generate new fake data for a density estimation task unless we have already solved the density
estimation problem. Dataset augmentation has been a particularly effective technique for a specific
classification problem: object recognition. Images are high dimensional and include an enormous
variety of factors of variation, many of which can be easily simulated. Operations like translating the
training images a few pixels in each direction can often greatly improve generalization, even if the
model has already been designed to be partially translation invariant by using the convolution and
pooling techniques

One must be careful not to apply transformations that would change the correct class. For example,
optical character recognition tasks require recognizing the difference between ‘b’ and ‘d’ and the
difference between ‘6’ and ‘9’, so horizontal flips and 180◦ rotations are not appropriate ways of
augmenting datasets for these tasks.

There are also transformations that we would like our classifiers to be invariant to, but which
are not easy to perform. For example, out-of-plane rotation can not be implemented as a simple
geometric operation on the input pixels. Dataset augmentation is effective for speech recognition tasks
as well. Injecting noise in the input to a neural network (Sietsma and Dow, 1991) can also be seen as a
form of data augmentation. For many classification and even some regression tasks, the task should
still be possible to solve even if small random noise is added to the input. Neural networks prove not
to be very robust to noise, however (Tang and Eliasmith, 2010). One way to improve the robustness
of neural networks is simply to train them with random noise applied to their inputs. Input noise
injection is part of some unsupervised learning algorithms such as the denoising autoencoder
(Vincent et al., 2008). Noise injection also works when the noise is applied to the hidden units, which
can be seen as doing dataset augmentation at multiple levels of abstraction. Poole et al. (2014)
recently showed that this approach can be highly effective provided that the magnitude of the noise is
carefully tuned. Dropout, a powerful regularization strategy , can be seen as a process of constructing
new inputs by multiplying by noise.

Hand-designed dataset augmentation schemes can dramatically reduce the generalization

error of a machine learning technique. To compare the performance of one machine learning
algorithm to another, it is necessary to perform controlled experiments. When comparing machine
learning algorithm A and machine learning algorithm B, it is necessary to make sure that both
algorithms were evaluated using the same hand-designed dataset augmentation schemes. Suppose that
algorithm A performs poorly with no dataset augmentation and algorithm B performs well when
combined with numerous synthetic transformations of the input. In such a case it is likely the
synthetic transformations caused the improved performance, rather than the use of machine learning
algorithm B. Sometimes deciding whether an experiment has been properly controlled requires
subjective judgment. For example, machine learning algorithms that inject noise into the input are
performing a form of dataset augmentation. Usually, operations that are generally applicable (such as
adding Gaussian noise to the input) are considered part of the machine learning algorithm, while
operations that are specific to one application domain (such as randomly cropping an image) are
considered to be separate pre-processing steps.

Data augmentation refers to the process of generating new training examples to our dataset.
More training data means lower model’s variance, so lower generalization error. Simple as that It can
also be seen as a form of noise injection in the training dataset.

Data augmentation can be achieved in many different ways. Let’s explore some of them.

1. Basic Data Manipulations: The first simple thing to do is to perform geometric

transformations on data. Most notably, if we’re talking about images we have solutions such
as: Image flipping, cropping, rotations, translations, image color modification, image mixing
etc. Cutout is a commonly used idea where we remove certain image regions. Another idea,
called Mixup, is the process of blending two images from the dataset into one image.

2. Feature Space Augmentation : Instead of transforming data in the input space as above, we
can apply transformations on the feature space. For example, an autoencoder might be used to
extract the latent representation. Noise can then be added in the latent representation which
results in a transformation of the original data point.

3. GAN-based Augmentation: Generative Adversarial Networks have been proven to work

extremely well on data generation so they are a natural choice for data augmentation.
4. Meta-Learning: In meta-learning, we use neural networks to optimize other neural networks
by tuning their hyperparameters, improving their layout, and more. A similar approach can
also be applied in data augmentation. In simple terms, we use a classification network to tune
an augmentation network into generating better images. Example: We feed random images to
an Augmentation Network (most likely a GAN), which will generate augmented images. Both
the augmented image and the original are passed into a second network, which compares
them and tells us how good the augmented image is. After repeating the process the
augmentation network becomes better and better at producing new images.

Having more data is the most desirable thing to improving a machine learning model’s performance. In
many cases, it is relatively easy to artificially generate data. For a classification task, we desire for the
model to be invariant to certain types of transformations, and we can generate the
corresponding (x,y)pairs by translating the input x. But for certain problems, like density estimation,
we can’t apply this directly unless we have already solved the density estimation problem.

However, caution needs to be maintained while augmenting data to make sure that the class doesn’t
change. For e.g., if the labels contain both “b” and “d”, then horizontal flipping would be a bad idea for
data augmentation. Adding random noise to the inputs is another form of data augmentation, while
adding noise to hidden units can be seen as doing data augmentation at multiple levels of abstraction.

Finally, when comparing machine learning models, we need to evaluate them using the same hand-
designed data augmentation schemes or else it might happen that algorithm A outperforms algorithm
B, just because it was trained on a dataset which had more / better data augmentation.

5. NOISE ROBUSTNESS

Noise with infinitesimal variance at the input of the model is equivalent to imposing a penalty
on the norm of the weights. Noise added to hidden units is very important and is discussed in Dropout.
Noise can even be added to the weights. This has several interpretations. One of them is that adding
noise to weights is a stochastic implementation of Bayesian inference over the weights, where the
weights are considered to be uncertain, with the uncertainty being modelled by a probability
distribution. It is also interpreted as a more traditional form of regularization by ensuring stability in
learning.

For e.g. in the linear regression case, we want to learn the mapping y(x) for each feature vector x, by
using the least squares cost function between the model predictions y(x) and the true value y.

The training set consists of m labelled example { (x(1), y(1)), ..... (x(m), y(m))}
Now, suppose a zero mean unit variance Gaussian random noise, ϵ, is added to the weights.
We till want to learn the appropriate mapping through reducing the mean square. Minimizing the loss
after adding noise to the weights is equivalent to adding another regularization term which makes sure
that small perturbations in the weight values don’t affect the predictions much, thus stabilising training.
Despite
the injection of noise, we are still interested in minimizing the squared error of the output of the
network. The objective function thus becomes:

Sometimes we may have the wrong output labels, in which case maximizing p(y | x)may not be a good
idea. In such a case, we can add noise to the labels by assigning a probability of (1- ϵ) that the label is
correct and a probability of ϵ that it is not. In the latter case, all the other labels are equally likely. Label
Smoothing regularizes a model with k softmax outputs by assigning the classification targets with
probability (1-ϵ ) or choosing any of the remaining (k-1) classes with probability ϵ / (k-1).

6. SEMI-SUPERVISED LEARNING
In the paradigm of semi-supervised learning, both unlabeled examples from P (x) and labeled
examples fromP (x, y) are used to estimate P (y | x) or predict y from X

x.P(x,y) denotes the joint distribution of x and y, i.e., corresponding to a training sample x, I
have a label y. P(x) denotes the marginal distribution of x, i.e., just the training examples without any
labels. In Semi-supervised Learning, we use both P(x,y)(some labelled samples) and P(x)(unlabelled
samples) to estimate P(y|x)(since we want to predict the class, given the training sample). We want to
learn some representation h = f(x)such that samples which are closer in the input space have similar
representations and a linear classifier in the new space achieves better generalization error.

Instead of separating the supervised and unsupervised criteria, we can instead have a
generative model of P(x) or P(x, y)) which shares parameters with the discriminative modelP(Y|X).
The idea is to share the unsupervised/generative criterion with the supervised criterion to express a
prior belief that the structure of P(x) or P(x, y)) is connected to the structure of P(y|x), which is
expressed by the shared parameters.

7. MULTITASK LEARNING

The idea is to improve the generalization error by pooling together examples from multiple
tasks. Similar to how more data leads to more generalization, using a part of the model for different
tasks constrains that part to learn good values.
Above diagram illustrates a very common form of multi-task learning, in whichdifferent
supervised tasks (predicting y(i) given x) share the same input x, as well as some intermediate-level
representation h (shared) capturing a common pool of factors. The model can generally be divided into
two kinds of parts and associated parameters:
1. Task-specific parameters (which only benefit from the examples of their taskto achieve good
generalization). These are the upper layers of the neuralnetwork in above figure
2. Generic parameters, shared across all the tasks (which benefit from thepooled data of all the tasks).
These are the lower layers of the neural network in above figure.

Improved generalization and generalization error bounds can be achieved because of the
shared parameters, for which statistical strength can be greatly improved (in proportion with the
increased number of examples for the shared parameters, compared to the scenario of single-task
models). Of course this will happen only if some assumptions about the statistical relationship
between the different tasks are valid, meaning that there is something shared across some of the tasks.
Multitask learning leads to better generalization when there is actually some relationship
between the tasks, which actually happens in the context of Deep Learning where some of the factors,
which explain the variation observed in the data, are shared across different tasks.

8. EARLY STOPPING

Early stopping is one of the most commonly used strategies because it is very simple and
quite effective. It refers to the process of stopping the training when the training error is no longer
decreasing but the validation error is starting to rise.

When training large models with sufficient representational capacity to overfit the task, we
often observe that training error decreases steadily over time, but validation set error begins to rise
again.
This means we can obtain a model with better validation set error (and thus, hopefully better
test set error) by returning to the parameter setting at the point in time with the lowest validation set
error. Every time the error on the validation set improves, we store a copy of the model parameters.
When the training algorithm terminates, we return these parameters, rather than the latest parameters.
The algorithm terminates when no parameters have improved over the best recorded validation error
for some pre-specified number of iterations

Algorithm
The early stopping meta-algorithm for determining the bestamount of time to train. This
meta-algorithm is a general strategy that workswell with a variety of training algorithms and ways of
quantifying error on thevalidation set.
This strategy is known as early stopping. It is probably the most commonlyused form of
regularization in deep learning. Its popularity is due both to itseffectiveness and its simplicity.
One way to think of early stopping is as a very efficient hyperparameter selection algorithm.
In this view, the number of training steps is just another hyperparameter.

This implies that we store the trainable parameters periodically and track the validation error.
After the training stopped, we return the trainable parameters to the exact point where the validation
error started to rise, instead of the last ones.

A different way to think of early stopping is as a very efficient hyperparameter selection

algorithm, which sets the number of epochs to the absolute best. It essentially restricts the
optimization procedure to a small volume of the trainable parameters space close to the initial
parameters.

It can also be proven that in the case of a simple linear model with a quadratic error function
and simple gradient descent, early stopping is equivalent to L2 regularization.

After a certain point of time during training, for a model with extremely high representational
capacity, the training error continues to decrease but the validation error begins to increase (which we
referred to as overfitting). In such a scenario, a better idea would be to return back to the point where
the validation error was the least. Thus, we need to keep calculating the validation metric after each
epoch and if there is any improvement, we store that parameter setting. Upon termination of training,
we return the last saved parameters.

The idea of Early Stopping is that if the validation error doesn’t improve over a certain fixed
number of iterations, we terminate the algorithm. This effectively reduces the capacity of the model by
reducing the number of steps required to fit the model. The evaluation on the validation set can be done
both on another GPU in parallel or done after the epoch. A drawback of weight decay was that we had
to manually tweak the weight decay coefficient, which, if chosen wrongly, can lead the model to local
minima by squashing the weight values too much. In Early Stopping, no such parameter needs to be
tweaked which reduces the number of hyperparameters that we need to tune.

However, since we are setting aside some part of the training data for validation, we are not using the
complete training set. So, once Early Stopping is done, a second phase of training can be done where
the complete training set is used. There are two choices here:

 Train from scratch for the same number of steps as in the Early Stopping case.
 Use the weights learned from the first phase of training and retrain using the complete data.

Other than lowering the number of training steps, it reduces the computational cost also by
regularizing the model without having to add additional penalty terms. It affects the optimization
procedure by restricting it to a small volume of the parameter space, in the neighbourhood of the initial
parameters. Suppose 𝛕 and ϵ represent the number of iterations and the learning rate respectively.
Then, ϵ𝛕 effectively represents the capacity of the model. Intuitively, this can be seen as the inverse of
the weight decay co-efficient λ. When ϵ𝛕 is small (or λ is large), the parameter space is small and vice
versa. This equivalence holds true for a linear model with quadratic cost function (initial parameters w⁰
= 0). Taking the Taylor Series Approximation of J(w) around the empirically optimal weights w*:

multiplying with Q’ on both sides and using the fact that Q’Q = I (Q is orthonormal):

Assuming ϵ to be small enough:

The equation for L² regularization is given by:

Thus, if the hyperparameters are such that:

L²-regularization can be seen as equivalent to Early Stopping.
Parameter values corresponding to directions of significant curvature (of the objective
function) are regularized less than directions of less curvature. Of course, in the context of early
stopping, this really means that parameters that correspond to directions of significant curvature tend
to learn early relative to parameters corresponding to directions of less curvature.
Early stopping therefore has the advantage over weight decay that early stopping
automatically determines the correct amount of regularization while weight decay requires many
training experiments with different values of its hyperparameter.

9. PARAMETER TYING AND PARAMETER SHARING

Deep neural networks may contain millions, even billions, of parameters or weights,
making storage and computation significantly expensive and motivating a large body of work
to reduce their complexity by using, e.g., sparsity inducing regularization. Parameter sharing
and parameter tying is another well-known approach for controlling the complexity of Deep
Neural Networks by forcing certain weights to share the same value. Specific forms of weight
sharing are hard-wired to express certain invariances. One typical example is the shift-
invariance of convolutional layers. However, other weights groups may be tied together
during the learning process to reduce network complexity further

Parameter tying is a regularization technique. We divide the parameters or weights of

a machine learning model into groups by leveraging prior knowledge, and all parameters in
each group are constrained to take the same value.

Example
Two models perform the same classification task (with the same set of classes) but
with different input data.
• Model A with parameters wt(A).
• Model B with parameters wt(B).
The two models hash the input to two different but related outputs.

Some standard regularisers like l1 and l2 penalize model parameters for deviating
from the fixed value of zero. One of the side effects of Lasso or group-Lasso regularization in
learning a Deep Neural Networks is that there is a possibility that many of the parameters
may become zero.
overcome the issues we face while using Lasso or group lasso is countered by a regularizer,
the group version of the ordered weighted one norm, known as group-OWL (GrOWL). GrOWL
supports sparsity and simultaneously learns which parameters should share a similar value. GrOWL
has been effective in linear regression, identifying and coping with strongly correlated covariates.
Unlike standard sparsity-inducing regularizers (e.g., Lasso), GrOWL eliminates unimportant neurons
by setting all their weights to zero and explicitly identifies strongly correlated neurons by tying the
corresponding weights to an expected value.
This ability of GrOWL motivates the following two-stage procedure:
(i) use GrOWL regularization during training to simultaneously identify significant neurons and
groups of parameters that should be tied together.
(ii) retrain the network, enforcing the structure unveiled in the previous phase, i.e., keeping only the
significant neurons and implementing the learned tying structure.

Let us imagine that the tasks are similar enough (perhaps with similar input and output
distributions) that we believe the model parameters should be close to each other: ∀i, wi(A)should be
close to wi(B)
We can leverage this information through regularization. Specifically, we can use a parameter
norm penalty of the form:
Ω(w(A),w(B)) = ||w(A)− w(B)||22
Here we used an L2 penalty, but other choices are also possible.
Parameter Sharing
Parameter sharing forces sets of parameters to be similar as we interpret various models or
model components as sharing a unique set of parameters. We only need to store only a subset of
memory.
Suppose two models A and B, perform a classification task on similar input and output
distributions. In such a case, we'd expect the parameters for both models to be identical to each other
as well. We could impose a norm penalty on the distance between the weights, but a more popular
method is to force the parameters to be equal. The idea behind Parameter Sharing is the essence of
forcing the parameters to be similar. A significant benefit here is that we need to store only a subset of
the parameters (e.g., storing only the parameters for model A instead of storing for both A and B),
which leads to significant memory savings.

While a parameter norm penalty is one way to regularize parameters to be close to

one another, the more popular way is to use constraints: to force sets of parameters to be equal. This
method of regularization is often referred to as parameter sharing, because we interpret the various
models or model components as sharing a unique set of parameters.
A significant advantage of parameter sharing over regularizing the parameters to be close (via
a norm penalty) is that only a subset of the parameters (the unique set) need to be stored in memory.
In certain models—such as the convolutional neural network—this can lead to significant reduction in
the memory footprint of the model.
Till now, most of the methods focused on bringing the weights to a fixed point, e.g. 0 in the
case of norm penalty. However, there might be situations where we might have some prior knowledge
on the kind of dependencies that the model should encode. Suppose, two models A and B, perform a
classification task on similar input and output distributions. In such a case, we’d expect the parameters
for both the models to be similar to each other as well. We could impose a norm penalty on the
distance between the weights, but a more popular method is to force the set of parameters to be equal.
This is the essence behind Parameter Sharing.
A major benefit here is that we need to store only a subset of the parameters (e.g. storing only
the parameters for model A instead of storing for both A and B) which leads to large memory savings.
In the example of Convolutional Neural Networks or CNNs , the same feature is computed across
different regions of the image and hence, a cat is detected irrespective of whether it is at position i or
i+1 . This means that we can find a cat with the same cat detector whether the cat appears at column i
or column i + 1 in the image.
Parameter sharing has allowed CNNs to dramatically lower the number of unique model
parameters and to significantly increase network sizes without requiring a corresponding increase in
training data. It remains one of the best examples of how to effectively incorporate domain knowledge
into the network architecture.

10. SPARSE REPRESENTATIONS

Weight decay acts by placing a penalty directly on the model parameters. Another strategy is
to place a penalty on the activations of the units in a neural network, encouraging their activations to
be sparse. This indirectly imposes a complicated penalty on the model parameters.
L1 penalization induces a sparse parametrization—meaning that many of the parameters
become zero (or close to zero). Representational sparsity, on the other hand, describes a representation
where many of the elements of the representation are zero (or close to zero). A simplified view of this
distinction can be illustrated in the context of linear regression:

We can place penalties on even the activation values of the units which indirectly imposes a
penalty on the parameters. This leads to representational sparsity, where many of the activation values
of the units are zero. In the figure below, h is a representation of x, which is sparse. Representational
sparsity is obtained similarly to the way parameter sparsity is obtained, by placing a penalty on the
representation h instead of the weights.
In the first expression, we have an example of a sparsely parametrized linear regression
model. In the second, we have linear regression with a sparse representation h of the data x. That is, h
is a function of x that, in some sense, represents the information present in x, but does so with a sparse
vector.
Representational regularization is accomplished by the same sorts of mechanisms that we
have used in parameter regularization.
Norm penalty regularization of representations is performed by adding to the loss function J a
norm penalty on the representation. This penalty is denoted Ω(h). As before, we denote the
regularized loss function by J˜:
J˜(θ;X, y) = J(θ;X, y) + αΩ(h)
whereα ∈[0, ∞) weights the relative contribution of the norm penalty term, with larger values of α
corresponding to more regularization.

Another idea could be to average the activation values across various examples and push it
towards some value. An example of getting representational sparsity by imposing hard constraint on
the activation value is the Orthogonal Matching Pursuit (OMP) algorithm, where a
representation h is learned for the input x by solving the constrained optimization problem:

where ||h||0 is the number of non-zero entries of h. This problem can be solvedefficiently when W is
constrained to be orthogonal. This method is often called OMP-k with the value of k specified to
indicate the number of non-zero features allowed.

11. BAGGING AND OTHER ENSEMBLE METHODS

Bagging (short for bootstrap aggregating) is a technique for reducing generalization error
by combining several models. The idea is to train several different models separately, then have all of
the models vote on the output for test examples. This is an example of a general strategy in machine
learning called model averaging. Techniques employing this strategy are known as ensemble
methods.
The reason that model averaging works is that different models will usually not make all the
same errors on the test set.

The techniques which train multiple models and take the maximum vote across those models
for the final prediction are called ensemble methods. The idea is that it’s highly unlikely that multiple
models would make the same error in the test set.

Suppose that we have K regression models, with the model making an error ϵ i on each
example, where ϵi is drawn from a zero mean multivariate normal distribution such that: 𝔼(ϵi²)=v and
𝔼(ϵiϵj)=c. The error on each example is then the average across all the models: 1/K(∑ iϵi).
The mean of this average error is 0 (as the mean of each of the individual ϵiϵi is 0). The
variance of the average error is given by:

In the case where the errors are perfectly correlated and c = v, the mean squared error reduces
to v, so the model averaging does not help at all. In the case where the errors are perfectly
uncorrelated and c = 0, the expected squared error of the ensemble is only (1/K)v. This means that the
expected squared error of the ensemble decreases linearly with the ensemble size. In other words, on
average, the ensemble will perform at least as well as any of its members, and if the members make
independent errors, the ensemble will perform significantly better than its members.

There are various ensembling techniques. In the case of Bagging (Bootstrap Aggregating), the
same training algorithm is used multiple times. The dataset is broken into K parts by sampling with
replacement (see figure below for clarity) and a model is trained on each of those K parts. Because of
sampling with replacement, the K parts have a few similarities as well as a few differences. These
differences cause the difference in the predictions of the K models.
Model averaging is a very strong technique. Model averaging is an extremely powerful and
reliable method for reducing generalization error. Its use is usually discouraged when benchmarking
algorithms for scientific papers, because any machine learning algorithm can benefit substantially
from model averaging at the price of increased computation and memory. For this reason, benchmark
comparisons are usually made using a single model.
Not all techniques for constructing ensembles are designed to make the ensemble more
regularized than the individual models. For example, a technique called boosting constructs an
ensemble with higher capacity than the individual models. Boosting has been applied to build
ensembles of neural networks by incrementally adding neural networks to the ensemble. Boosting has
also been applied interpreting an individual neural network as an ensemble, incrementally adding
hidden units to the neural network.

12. DROPOUT

Dropout is a computationally inexpensive, yet powerful regularization technique. The problem

with bagging is that we can’t train an exponentially large number of models and store them for
prediction later. Dropout makes bagging practical by making an inexpensive approximation.
Dropout is a technique where randomly selected neurons are ignored during training. They
are “dropped out” randomly. This means that their contribution to the activation of downstream
neurons is temporally removed on the forward pass, and any weight updates are not applied to the
neuron on the backward pass.
As a neural network learns, neuron weights settle into their context within the network.
Weights of neurons are tuned for specific features, providing some specialization. Neighboring
neurons come to rely on this specialization, which, if taken too far, can result in a fragile model too
specialized for the training data. This reliance on context for a neuron during training is referred to as
complex co-adaptations.

You can imagine that if neurons are randomly dropped out of the network during training,
other neurons will have to step in and handle the representation required to make predictions for the
missing neurons. This is believed to result in multiple independent internal representations being
learned by the network.

The effect is that the network becomes less sensitive to the specific weights of neurons. This,
in turn, results in a network capable of better generalization and less likely to overfit the training data.

In a simplistic view, dropout trains the ensemble of all sub-networks formed by randomly
removing a few non-output units by multiplying their outputs by 0. For every training sample, a mask
is computed for all the input and hidden units independently. For clarification, suppose we
have h hidden units in some layer. Then, a mask for that layer refers to a h dimensional vector with
values either 0(remove the unit) or 1(keep the unit).

There are a few differences from bagging though:

 In bagging, the models are independent of each other, whereas in dropout, the different models
share parameters, with each model taking as input, a sample of the total parameters.
 In bagging, each model is trained till convergence, but in dropout, each model is trained for just
one step and the parameter sharing makes sure that subsequent updates ensure better predictions
in the future.

At test time, we combine the predictions of all the models. In the case of bagging with K models, this
was given by the arithmetic mean. In case of dropout, the probability that a model is chosen is given by
p(μ), with μ denoting the mask vector. The prediction then becomes ∑ p(μ)p(y|x, μ). This is not
computationally feasible, and there’s a better method to compute this in one go, using the geometric
mean instead of the arithmetic mean.

We need to take care of two main things when working with geometric mean:

 None of the probabilities should be zero.

 Re-normalization to make sure all the probabilities sum to 1.

The advantage for dropout is that first term can be approximated in one pass of the complete model by
dividing the weight values by the keep probability (weight scaling inference rule). The motivation
behind this is to capture the right expected values from the output of each unit, i.e. the total expected
input to a unit at train time is equal to the total expected input at test time. A big advantage of dropout
then is that it doesn’t place any restriction on the type of model or training procedure to use.

Points to note:

 Reduces the representational capacity of the model and hence, the model should be large enough
to begin with.

 Works better with more data.

 Equivalent to L² for linear regression, with different weight decay coefficient for each input
feature.
Biological Interpretration:

During sexual reproduction, genes could be swapped between organisms if they are unable to correctly
adapt to the unusual features of any organism. Thus, the units in dropout learn to perform well
regardless of the presence of other hidden units, and also in many different contexts.

Adding noise in the hidden layer is more effective than adding noise in the input layer. For e.g. let’s
assume that some unit learns to detect a nose in a face recognition task. Now, if this unit is removed,
then some other unit either learns to redundantly detect a nose or associates some other feature (like
mouth) for recognising a face. In either way, the model learns to make more use of the information in
the input. On the other hand, adding noise to the input won’t completely removed the nose information,
unless the noise is so large as to remove most of the information from the input.

Another strategy to regularize deep neural networks is dropout. Dropout falls into noise
injection techniques and can be seen as noise injection into the hidden units of the network.

In practice, during training, some number of layer outputs are randomly ignored (dropped
out) with probability p.

During test time, all units are present, but they have been scaled down by p. This is happening
because after dropout, the next layers will receive lower values. In the test phase though, we are
keeping all units so the values will be a lot higher than expected. That’s why we need to scale them
down.

By using dropout, the same layer will alter its connectivity and will search for alternative
paths to convey the information in the next layer. As a result, each update to a layer during training is
performed with a different “view” of the configured layer. Conceptually, it approximates training a
large number of neural networks with different architectures in parallel.

"Dropping" values means temporarily removing them from the network for the current
forward pass, along with all its incoming and outgoing connections. Dropout has the effect of making
the training process noisy. The choice of the probability p depends on the architecture.
This conceptualization suggests that perhaps dropout breaks up situations where network
layers co-adapt to correct mistakes from prior layers, making the model more robust. It increases the
sparsity of the network and in general, encourages sparse representations! Sparsity can be added to
any model with hidden units and is a powerful tool in our regularization arsenal.

Other Dropout variations

There are many more variations of Dropout that have been proposed over the years. To keep this
article relatively digestible, I won’t go into many details for each one. But I will briefly mention a few
of them. Feel free to check out paperswithcode.com for more details on each one, alongside the
original paper and code.

1. Inverted dropout also randomly drops some units with a probability p. The difference with
traditional dropout is: During training, it also scales the activations by the inverse of the keep
probability 1−p. The reason behind this is: to prevent the activations from becoming too large
thus the need to modify the network during the testing phase. The end result will be similar to
the traditional dropout.

2. Gaussian dropout: instead of dropping units during training, is injecting noise to the weights
of each unit. The noise is, more often than not ,Gaussian. This results in:

1. A reduction in the computational effort during testing time.

2. No weight scaling is required.

3. Faster training overall

3. DropConnect follows a slightly different approach. Instead of zeroing out random activations
(units), it zeros random weights during each forward pass. The weights are dropped with a
probability of 1−p. This essentially transforms a fully connected layer to a sparsely connected
layer. Mathematically we can represent DropConnect as: r=a((M∗W)v) where r is the layers’
output, v the input, W the weights and M a binary matrix. M is a mask that instantiates a
different connectivity pattern from each data sample. Usually, the mask is derived from each
training example. DropConnect can be seen as a generalization of Dropout to the full-
connection structure of a layer.

4. Variational Dropout: we use the same dropout mask on each timestep. This means that we
will drop the same network units each time. This was initially introduced for Recurrent
Neural Networks and it follows the same principles as variational inference.

5. Attention Dropout: popular over the past years because of the rapid advancements of
attention-based models like Transformers. As you may have guessed, we randomly dropped
certain attention units with a probability p.
6. Adaptive Dropout: a technique that extends dropout by allowing the dropout probability to be
different for different units. The intuition is that there may be hidden units that can
individually make confident predictions for the presence or absence of an important feature or
combination of features.

7. Embedding Dropout: a strategy that performs dropout on the embedding matrix and is used
for a full forward and backward pass.

8. DropBlock: is used in Convolutional Neural networks and it discards all units in a continuous
region of the feature map.

Stochastic Depth

Stochastic depth goes a step further. It drops entire network blocks while keeping the model intact
during testing. The most popular application is in large ResNets where we bypass certain blocks
through their skip connections.

In particular, Stochastic depth drops out each layer in the network that has residual connections
around it. It does so with a specified probability p that is a function of the layer depth.

Mathematically we can express this as:

Hl=ReLU(blfl(Hl−1)+id(Hl−1))

where b is a Bernoulli random variable that shows if a block is active or inactive. If b=0 the block will
be inactive and if b=1 active.

Points Considered for Dropout

 Generally, use a small dropout value of 20%-50% of neurons, with 20% providing a good starting
point. A probability too low has minimal effect, and a value too high results in under-learning by the
network.
 Use a larger network. You are likely to get better performance when Dropout is used on a larger
network, giving the model more of an opportunity to learn independent representations.
 Use Dropout on incoming (visible) as well as hidden units. Application of Dropout at each layer of
the network has shown good results.
 Use a large learning rate with decay and a large momentum. Increase your learning rate by a factor of
10 to 100 and use a high momentum value of 0.9 or 0.99.
 Constrain the size of network weights. A large learning rate can result in very large network weights.
Imposing a constraint on the size of network weights, such as max-norm regularization, with a size of
4 or 5 has been shown to improve results.
Dropout is extremely common in computer vision applications. Convolutional neural
networks are computer vision’s most widely used deep learning models. Dropout, on the other hand,
is not particularly useful on convolutional layers. This is because dropout tries to increase robustness
by making neurons redundant. Without relying on single neurons, a model should learn parameters.
This is very helpful if your layer has a lot of parameters.

Key Takeaways :

 As a result, dropout layers in convolutional neural networks are often found after fully connected
layers but not after convolutional layers.
 Other regularising techniques, such as batch normalization in convolutional networks, have largely
overtaken dropout in recent years.
 Because convolutional layers have fewer parameters, they necessitate less regularisation.

Drawbacks of Dropout

A dropout network may take 2-3 times longer to train than a normal network. Finding a
regularizer virtually comparable to a dropout layer is one method to reap the benefits of dropout
without slowing down training. This regularizer is a modified variant of L 2 regularisation for linear
regression. An analogous regularizer for more complex models has yet to be discovered until that time
when doubt drops out.

13. ADVERSARIAL TRAINING

Deep Learning has outperformed humans in the task of Image Recognition, which might lead
us to believe that these models have acquired a human-level understanding of an image. However,
experimentally searching for an x′ (given an x), such that prediction made by the model changes, shows
otherwise. As shown in the image below, although the newly formed image (adversarial image) looks
almost exactly the same to a human, the model classifies it wrongly and that too with very high
confidence:
Adversarial examples have many implications, for example, in computer security, that are beyond the
scope of this chapter. However, they are interesting in the context of regularization because one can
reduce the error rate on the original i.i.d. test set via adversarial training—training on adversarially
perturbed examples from the training set
Good fellow et al. showed that one of the primary causes of these adversarial examples is
excessive linearity. Neural networks are built out of primarily linear building blocks. In some
experiments the overall function they implement proves to be highly linear as a result. These linear
functions are easy to optimize. Unfortunately, the value of a linear function can change very rapidly if
it has numerous inputs. If we change each input by , then a linear function with weights w can change
by as much as ||w||1, which can be a very large amount if w is high-dimensional. Adversarial training
discourages this highly sensitive locally linear behavior by encouraging the network to be locally
constant in the neighbour hood of the training data. This can be seen as a way of explicitly introducing
a local constancy prior into supervised neural nets.
Adversarial training helps to illustrate the power of using a large function family in
combination with aggressive regularization. Purely linear models, like logistic regression, are not able
to resist adversarial examples because they are forced to be linear. Neural networks are able to
represent functions that can range from nearly linear to nearly locally constant and thus have the
flexibility to capture linear trends in the training data while still learning to resist local perturbation

Adversarial training refers to training on images which are adversarially generated and it has
been shown to reduce the error rate. The main factor attributed to the above mentioned behaviour is the
linearity of the model (say y = Wx), caused by the main building blocks being primarily linear. Thus, a
small change of ϵ in the input causes a drastic change of Wϵ in the output. The idea of adversarial
training is to avoid this jumping and induce the model to be locally constant in the neighbour hood of
the training data.

This can also be used in semi-supervised learning. For an unlabelled sample x, we can assign the label
ŷ (x) using our model. Then, we find an adversarial example, x′, such that y(x′)≠ŷ (x) (an adversary
found this way is called virtual adversarial example). The objective then is to assign the same class to
both x and x′. The idea behind this is that different classes are assumed to lie on disconnected
manifolds and a little push from one manifold shouldn’t land in any other manifold.
14. TANGENT DISTANCE, TANGENT PROP AND MANIFOLD TANGENT CLASSIFIER

Many ML models assume the data to lie on a low dimensional manifold to overcome the curse
of dimensionality. The inherent assumption which follows is that small perturbations that cause the
data to move along the manifold (it originally belonged to), shouldn’t lead to different class
predictions. The idea of the tangent distance algorithm to find the K-nearest neighbors using the
distance metric as the distance between manifolds. A manifold Mi is approximated by the tangent plane
at Xi, hence, this technique needs tangent vectors to be specified.

The tangent prop algorithm proposed to learn a neural network based classifier, f(x), which is
invariant to known transformations causing the input to move along its manifold. Local invariance
would require that ▽ f(x) is perpendicular to the tangent vectors V(i). This can also be achieved by
adding a penalty term that minimizes the directional directive of f(x) along each of the V(i).
It is similar to data augmentation. In both cases, the user of the algorithm encodes his or her
prior knowledge of the task by specifying a set of transformations that should not alter the output of
the network. The difference is that in the case of dataset augmentation, the network is explicitly
trained to correctly classify distinct inputs that were created by applying more than an infinitesimal
amount of these transformations. Tangent propagation does not require explicitly visiting a new input
point. Instead, it analytically regularizes the model to resist perturbation in the directions
corresponding to the specified transformation.

Drawbacks.
 First, it only regularizes the model to resist infinitesimal perturbation. Explicit dataset
augmentation confers resistance to larger perturbations.
 Second, the infinitesimal approach poses difficulties for models based on rectified
linear units.
These models can only shrink their derivatives by turning units off or shrinking their weights.
They are not able to shrink their derivatives by saturating at a high value with large weights, as
sigmoid or tanh units can. Dataset augmentation works well with rectified linear units because
different subsets of rectified units can activate for different transformed versions of each original
input.
The manifold tangent classifier makes use of this technique to avoid needing user-specified
tangent vectors. These estimated tangent vectors go beyond the classical in variants that arise out of
the geometry of images (such as translation, rotation and scaling)and include factors that must be
learned because they are object-specific (such as moving body parts).

Manifold Tangent Classifier works in two parts:

 Use Auto encoders to learn the manifold structures using Unsupervised Learning.

 Use these learned manifolds with tangent prop

Tangent propagation is also related to double backprop and adversarial training Double
backprop regularizes the Jacobian to be small, while adversarial training finds inputs near the original
inputs and trains the model to produce the same output on these as on the original inputs. Tangent
propagation and dataset augmentation using manually specified transformations both require that the
model should be invariant to certain specified directions of change in the input.
Double backprop and adversarial training both require that the model should be invariant to
all directions of change in the input so long as the change is small. Justas dataset augmentation is the
non-infinitesimal version of tangent propagation, adversarial training is the non-infinitesimal version
of double backprop.

Activation Functions - Ipynb - Colaboratory
No ratings yet
Activation Functions - Ipynb - Colaboratory
10 pages
Model Building Through
No ratings yet
Model Building Through
21 pages
Artificial Neural Networks: Part 1/3
No ratings yet
Artificial Neural Networks: Part 1/3
25 pages
DL Question Bank Answers
No ratings yet
DL Question Bank Answers
55 pages
Study Materials - Restricted Boltzmann Machine
No ratings yet
Study Materials - Restricted Boltzmann Machine
6 pages
Math4ml PDF
No ratings yet
Math4ml PDF
21 pages
ML Lab
No ratings yet
ML Lab
21 pages
ML First Unit
No ratings yet
ML First Unit
70 pages
ML Unit-Iv
No ratings yet
ML Unit-Iv
18 pages
Regression Notes
100% (1)
Regression Notes
20 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
02 ML Supervised Learning
No ratings yet
02 ML Supervised Learning
32 pages
Machine Learning Module-3
No ratings yet
Machine Learning Module-3
23 pages
Constraint Satisfaction Problems: AIMA: Chapter 6
No ratings yet
Constraint Satisfaction Problems: AIMA: Chapter 6
64 pages
Answers All 2007
0% (1)
Answers All 2007
64 pages
ML Unit-3
No ratings yet
ML Unit-3
24 pages
Assignment # 01 Bscs - 7 Semester: Machine Learning
100% (1)
Assignment # 01 Bscs - 7 Semester: Machine Learning
5 pages
Linear Regression 18may
No ratings yet
Linear Regression 18may
28 pages
RBM, DBN, and DBM
No ratings yet
RBM, DBN, and DBM
79 pages
Query Operation 2021
No ratings yet
Query Operation 2021
35 pages
DL Unit-2 Notes PPT
No ratings yet
DL Unit-2 Notes PPT
39 pages
Different Types of Regression Models
No ratings yet
Different Types of Regression Models
18 pages
CS230 Midterm Solutions Fall 2022
No ratings yet
CS230 Midterm Solutions Fall 2022
20 pages
SRM Valliammai Engineering College (An Autonomous Institution)
No ratings yet
SRM Valliammai Engineering College (An Autonomous Institution)
9 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Lecture 2.1.2activation Function
No ratings yet
Lecture 2.1.2activation Function
15 pages
CS 601 Machine Learning Unit 3
No ratings yet
CS 601 Machine Learning Unit 3
37 pages
Unit - 3
No ratings yet
Unit - 3
42 pages
IIT Madras Notes Machine Learning
No ratings yet
IIT Madras Notes Machine Learning
13 pages
Mathematics For Machine Learning-I
No ratings yet
Mathematics For Machine Learning-I
10 pages
3 Unit - Dspu
No ratings yet
3 Unit - Dspu
23 pages
Soft Max
No ratings yet
Soft Max
6 pages
DSE 3 Unit 4
No ratings yet
DSE 3 Unit 4
8 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
RBF, KNN, SVM, DT
No ratings yet
RBF, KNN, SVM, DT
9 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
29 pages
Support Vector Machine - Explanation
No ratings yet
Support Vector Machine - Explanation
12 pages
2.building Blocks of Neural Networks
100% (1)
2.building Blocks of Neural Networks
2 pages
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
No ratings yet
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
26 pages
Ciucci
No ratings yet
Ciucci
17 pages
Unit 5
No ratings yet
Unit 5
23 pages
Overfitting vs. Underfitting, Bias vs. Variance
No ratings yet
Overfitting vs. Underfitting, Bias vs. Variance
7 pages
Lecture Notes in Earth Sciences
No ratings yet
Lecture Notes in Earth Sciences
267 pages
Karthik Nambiar 60009220193
No ratings yet
Karthik Nambiar 60009220193
9 pages
UNIT2
No ratings yet
UNIT2
25 pages
Curse of Dimensionality
No ratings yet
Curse of Dimensionality
9 pages
Confusion Matrix, Accuracy, Precision, Recall, F1 Score
No ratings yet
Confusion Matrix, Accuracy, Precision, Recall, F1 Score
1 page
Nueral Network Mcqs
No ratings yet
Nueral Network Mcqs
6 pages
Unit 5
No ratings yet
Unit 5
36 pages
A Modified Tikhonov Regularization Method For An Axisymmetric Backward Heat Equation
No ratings yet
A Modified Tikhonov Regularization Method For An Axisymmetric Backward Heat Equation
8 pages
Notes 3
No ratings yet
Notes 3
59 pages
Unit Iv
No ratings yet
Unit Iv
34 pages
Machine Learning Advances For Time Series Forecasting: Ricardo P. Masini
No ratings yet
Machine Learning Advances For Time Series Forecasting: Ricardo P. Masini
44 pages
Chap 11 12 - Practical Methodology and Applications - Heechul Lim
100% (1)
Chap 11 12 - Practical Methodology and Applications - Heechul Lim
60 pages
Unit 4
No ratings yet
Unit 4
79 pages
10 1016@j Electacta 2019 05 010
No ratings yet
10 1016@j Electacta 2019 05 010
14 pages
ML Unit-2
No ratings yet
ML Unit-2
26 pages
Deep Learning - Unit-III Two Marks
100% (1)
Deep Learning - Unit-III Two Marks
3 pages
Must Know Questions Deep Learning
No ratings yet
Must Know Questions Deep Learning
22 pages
NNunit 2
No ratings yet
NNunit 2
25 pages
ML Unit Ii
No ratings yet
ML Unit Ii
30 pages
A Comprehensive Review of Plusminus Ratings For Evaluating Individual Players in Team Sports
No ratings yet
A Comprehensive Review of Plusminus Ratings For Evaluating Individual Players in Team Sports
23 pages
2009 Ridge Regression
No ratings yet
2009 Ridge Regression
8 pages
Project Report Forest Fire Final
No ratings yet
Project Report Forest Fire Final
26 pages
Deep Learnig
No ratings yet
Deep Learnig
16 pages
Module2.3 Hyperparameter Optimization
No ratings yet
Module2.3 Hyperparameter Optimization
29 pages
1.1. Linear Models - Scikit-Learn 1.4.2 Documentation
No ratings yet
1.1. Linear Models - Scikit-Learn 1.4.2 Documentation
17 pages
DL Question Bank
No ratings yet
DL Question Bank
23 pages
Data Analytics Unit 2
No ratings yet
Data Analytics Unit 2
13 pages
Rajesh (DL Unit1) 04dec2024
No ratings yet
Rajesh (DL Unit1) 04dec2024
125 pages
DL Chpter 3
No ratings yet
DL Chpter 3
8 pages
Principal Component Analysis (PCA) - by Kavishka Abeywardana - Jun, 2024 - Medium
No ratings yet
Principal Component Analysis (PCA) - by Kavishka Abeywardana - Jun, 2024 - Medium
22 pages
4th Unit DL Final Class Notes
No ratings yet
4th Unit DL Final Class Notes
68 pages
Slide 1
No ratings yet
Slide 1
4 pages
NN&DL Unit-IV Regularization For Deep Learning
No ratings yet
NN&DL Unit-IV Regularization For Deep Learning
16 pages
Pa 1 Unit
No ratings yet
Pa 1 Unit
23 pages
Unit1 TDL Compressed
No ratings yet
Unit1 TDL Compressed
402 pages
L2 Linear Regression
No ratings yet
L2 Linear Regression
61 pages
Ridge Regression
No ratings yet
Ridge Regression
24 pages
REGULARIZATION TOOLS - A Matlab Package For Analysis and Solution of Discrete Ill-Posed Problems
No ratings yet
REGULARIZATION TOOLS - A Matlab Package For Analysis and Solution of Discrete Ill-Posed Problems
35 pages
Análise Espacial Com Regressão Linear e Kernel
No ratings yet
Análise Espacial Com Regressão Linear e Kernel
12 pages
2023 02 28 530310 Full-2
No ratings yet
2023 02 28 530310 Full-2
31 pages
Regularization
No ratings yet
Regularization
5 pages
Smoothed Bootstrap - Nelson-Siegel Revisited June 2010
No ratings yet
Smoothed Bootstrap - Nelson-Siegel Revisited June 2010
38 pages
MC4411 Project Work - Format
No ratings yet
MC4411 Project Work - Format
65 pages
Predictive Analytics
No ratings yet
Predictive Analytics
46 pages
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
From Everand
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
Sebastian Thelen
5/5 (1)
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet

Unit4 DL Final

Uploaded by

Unit4 DL Final

Uploaded by

UNIT IV

Regularization for Deep Learning: Parameter norm Penalties, Norm Penalties as

Regularization is a modification we make to the learning algorithm or the model architecture

Regularization based on regularizing estimators, Regularization of an estimator works by

Regularization topics are listed below:

1. Parameter Norm Penalties

1.1 L² Parameter Regularization

L2 regularization also known as ridge regression or tikhonov regularization

The diagram below illustrates this well:

2. NORM PENALTIES AS CONSTRAINED OPTIMIZATION

The solution to the constrained problem is given by

3. REGULARIZED & UNDER-CONSTRAINED PROBLEMS

This can be seen as performing a linear regression with L²-regularization.

Hand-designed dataset augmentation schemes can dramatically reduce the generalization

1. Basic Data Manipulations: The first simple thing to do is to perform geometric

3. GAN-based Augmentation: Generative Adversarial Networks have been proven to work

A different way to think of early stopping is as a very efficient hyperparameter selection

Assuming ϵ to be small enough:

The equation for L² regularization is given by:

Thus, if the hyperparameters are such that:

9. PARAMETER TYING AND PARAMETER SHARING

Parameter tying is a regularization technique. We divide the parameters or weights of

While a parameter norm penalty is one way to regularize parameters to be close to

10. SPARSE REPRESENTATIONS

11. BAGGING AND OTHER ENSEMBLE METHODS

Dropout is a computationally inexpensive, yet powerful regularization technique. The problem

There are a few differences from bagging though:

 None of the probabilities should be zero.

 Re-normalization to make sure all the probabilities sum to 1.

 Works better with more data.

Other Dropout variations

1. A reduction in the computational effort during testing time.

2. No weight scaling is required.

3. Faster training overall

Mathematically we can express this as:

Points Considered for Dropout

13. ADVERSARIAL TRAINING

Manifold Tangent Classifier works in two parts:

 Use these learned manifolds with tangent prop

You might also like