Unit4 DL Final
Unit4 DL Final
_________________________________________________________________________________
REGULARIZATION
In Machine Learning, and more so in Deep Learning, overfitting is a major issue that occurs
during training. A model is considered as overfitting the training data when the training error keeps
decreasing but the test error (or the generalisation error) starts increasing. At this point we tend to
believe that the model is learning the training data distribution and not generalising to unseen data.
Linear models such as linear regression and logistic regression allow simple, straight forward
and effective regularization strategies. The idea here is to limit the capacity (the space of all possible
model families) of the model by adding a parameter norm penalty, Ω(θ), to the objective
function, J:
Here, α ϵ [0,∞] is a hyperparameter that weights the relative contribution of the norm penalty term, Ω
relative to the standard objective function J, Setting α to 0 results in no regularization , Larger values
of α correspond to more regularization.
one of the simplest and most common kinds of parameter norm penalty is L2 Parameter
Regularization Commonly known as weight decay Here, we have the parameter norm penalty by
adding a following regularization term to the objective function:
Applying the 2nd order Taylor-Series approximation (ignoring all terms of order greater than 2
in the Taylor-Series expansion) at the point w* (where J̃ (θ; X, y) assumes the minimum value, i.e.,
∇J̃ (w*)= 0), we get the following expression (as the first order gradient term is 0):
Finally, ∇ Ĵ(w) = H(w — w*) since the first term is just a constant and the derivative of X’ H X (’
represents the transpose) is 2 H X. The overall gradient of the objective function (gradient of Ĵ +
gradient of Ω(θ))becomes:
As α approaches 0, w comes closer to w*. Finally, since H is real and symmetric, it can be decomposed
into a diagonal matrix ∧ and an orthonormal set of eigenvectors, Q.That is, H = Q’ ∧ Q.
Because of the marked term, the value of each weight is rescaled along the eigenvectors of H. The
value of the weights along the eigenvector i,is rescaled by λi /(λi+α), where λi represents the
eigenvalue corresponding to that eigenvector.
To look at its application to Machine Learning, we have to look at linear regression. The objective
function there is exactly quadratic, given by:
Here, the parameter norm penalty is given by: Ω(θ) = ||w||¹ ie as the sum of absolute values of the
individal parameter. This makes the gradient of the overall objective function:
Now, the last term, sign(w), creates some difficulty as the gradient no longer scales linearly with w.
This leads to a few complexities in arriving at the optimal solution (details of which can be found in
this Appendix notebook):
My current interpretation of the max term is that, there shouldn’t be a zero crossing, as the gradient of
the absolute value function is not differentiable at zero.
Thus, L¹ regularization has the property of sparsity, which is its fundamental distinguishing feature
from L². Hence, L¹ is used for feature selection as in LASSO(Least abso;ute shrinkage and selection
operator).
we know that to minimize any function under some constraints, we can construct a
generalized Lagrange function consisting of the original objective function along with the a set of
penalties.
Each penalty is a product between a coefficient called a Karush Kuhn Tucker (KKT) multiplier
---Suppose we wanted Ω(θ) < k, then we could construct the following Lagrangian:
This is now similar to the parameter norm penalty regularized objective function as both of
them encourage lower values of the norm. Thus, parameter norm penalties naturally impose a
constraint, like the L²-regularization, defining a constrained L²-ball. Larger α implies a smaller
constrained region as it pushes the values really low, hence, allowing a small radius and vice versa.
The idea of constraints over penalties is important for several reasons. Large penalties might
cause non-convex optimization algorithms to get stuck in local minima due to small values of θ,
leading to the formation of so-called dead cells, as the weights entering and leaving them are too small
to have an impact. Constraints don’t enforce the weights to be near zero, rather being confined to a
constrained region.
These are units that do not contribute much to the behavior of the function learned by the
network because the weights going into or out of them are all very small. When training with a
penalty on the norm of the weights, these configurations can be locally optimal, even if it is possible
to significantly reduce J by making the weights larger. Explicit constraints implemented by re-
projection can work much better in these cases because they do not encourage the weights to approach
the origin. Explicit constraints implemented by re-projection only have an effect when the weights
become large and attempt to leave the constraint region.
Another reason is that constraints induce higher stability. With higher learning rates, there
might be a large weight, leading to a large gradient, which could go on iteratively leading to numerical
overflow in the value of θ. Constrains, along with reprojection (to the corresponding ball), prevent the
weights from becoming too large, thus, maintaining stability.
Finally, explicit constraints with reprojection can be useful because they impose some stability on the
optimization procedure. When using high learning rates, it is possible to encounter a positive feedback
loop in which large weights induce large gradients which then induce a large update to the weights. If
these updates consistently increase the size of the weights, then θ rapidly moves away from
the origin until numerical overflow occurs. Explicit constraints with reprojection prevent this
feedback loop from continuing to increase the magnitude of the weights without bound. Hinton et al.
(2012c) recommend using constraints combined with a high learning rate to allow rapid exploration of
parameter space while maintaining some stability.
A final suggestion made by Hinton was to restrict the individual column norms of the weight
matrix rather than the Frobenius norm of the entire weight matrix, so as to prevent any hidden unit
from having a large weight. The idea here is that if we restrict the Frobenius norm, it doesn’t guarantee
that the individual weights would be small, just their norm. So, we might have large weights being
compensated by extremely small weights to make the overall norm small. Restricting each hidden unit
individually gives us the required guarantee.
These linear problems have closed form solutions when the relevant matrix is invertible. It is
also possible for a problem with no closed form solution to be underdetermined. An example is
logistic regression applied to a problem where the classes are linearly separable. If a weight vector w
is able to achieve perfect classification, then 2w will also achieve perfect classification and higher
likelihood. An iterative optimization procedure like stochastic gradient descent will continually
increase the magnitude of w and, in theory, will never halt. In practice, a numerical implementation of
gradient descent will eventually reach sufficiently large weights to cause numerical overflow, at
which point its behavior will depend on how the programmer has decided to handle values that are not
real numbers.
Most forms of regularization are able to guarantee the convergence of iterative methods
applied to underdetermined problems. For example, weight decay will cause gradient descent to quit
increasing the magnitude of the weights when the slope of the likelihood is equal to the weight decay
coefficient.
The idea of using regularization to solve underdetermined problems extends beyond machine
learning. The same idea is useful for several basic linear algebra problems.
Regularization can solve underdetermined problems. For e.g. the Moore-Penrose
pseudoinverse defined X+ of a matrix X is:
4. DATASET AUGMENTATION
The best way to make a machine learning model generalize better is to train it on more data.
Of course, in practice, the amount of data we have is limited. One way to get around this problem is to
create fake data and add it to the training set.
For some machine learning tasks, it is reasonably straightforward to create new fake data. This
approach is easiest for classification. A classifier needs to take a complicated, high dimensional input
x and summarize it with a single category identity y. This means that the main task facing a classifier
is to be invariant to a wide variety of transformations. We can generate new (x, y) pairs easily just by
transforming the x inputs in our training set.
Data augmentation is the final strategy that we need to mention. Although not strictly a
regularization method, it sure has its place here.
This approach is not as readily applicable to many other tasks. For example, it is difficult to
generate new fake data for a density estimation task unless we have already solved the density
estimation problem. Dataset augmentation has been a particularly effective technique for a specific
classification problem: object recognition. Images are high dimensional and include an enormous
variety of factors of variation, many of which can be easily simulated. Operations like translating the
training images a few pixels in each direction can often greatly improve generalization, even if the
model has already been designed to be partially translation invariant by using the convolution and
pooling techniques
One must be careful not to apply transformations that would change the correct class. For example,
optical character recognition tasks require recognizing the difference between ‘b’ and ‘d’ and the
difference between ‘6’ and ‘9’, so horizontal flips and 180◦ rotations are not appropriate ways of
augmenting datasets for these tasks.
There are also transformations that we would like our classifiers to be invariant to, but which
are not easy to perform. For example, out-of-plane rotation can not be implemented as a simple
geometric operation on the input pixels. Dataset augmentation is effective for speech recognition tasks
as well. Injecting noise in the input to a neural network (Sietsma and Dow, 1991) can also be seen as a
form of data augmentation. For many classification and even some regression tasks, the task should
still be possible to solve even if small random noise is added to the input. Neural networks prove not
to be very robust to noise, however (Tang and Eliasmith, 2010). One way to improve the robustness
of neural networks is simply to train them with random noise applied to their inputs. Input noise
injection is part of some unsupervised learning algorithms such as the denoising autoencoder
(Vincent et al., 2008). Noise injection also works when the noise is applied to the hidden units, which
can be seen as doing dataset augmentation at multiple levels of abstraction. Poole et al. (2014)
recently showed that this approach can be highly effective provided that the magnitude of the noise is
carefully tuned. Dropout, a powerful regularization strategy , can be seen as a process of constructing
new inputs by multiplying by noise.
Data augmentation refers to the process of generating new training examples to our dataset.
More training data means lower model’s variance, so lower generalization error. Simple as that It can
also be seen as a form of noise injection in the training dataset.
Data augmentation can be achieved in many different ways. Let’s explore some of them.
2. Feature Space Augmentation : Instead of transforming data in the input space as above, we
can apply transformations on the feature space. For example, an autoencoder might be used to
extract the latent representation. Noise can then be added in the latent representation which
results in a transformation of the original data point.
Having more data is the most desirable thing to improving a machine learning model’s performance. In
many cases, it is relatively easy to artificially generate data. For a classification task, we desire for the
model to be invariant to certain types of transformations, and we can generate the
corresponding (x,y)pairs by translating the input x. But for certain problems, like density estimation,
we can’t apply this directly unless we have already solved the density estimation problem.
However, caution needs to be maintained while augmenting data to make sure that the class doesn’t
change. For e.g., if the labels contain both “b” and “d”, then horizontal flipping would be a bad idea for
data augmentation. Adding random noise to the inputs is another form of data augmentation, while
adding noise to hidden units can be seen as doing data augmentation at multiple levels of abstraction.
Finally, when comparing machine learning models, we need to evaluate them using the same hand-
designed data augmentation schemes or else it might happen that algorithm A outperforms algorithm
B, just because it was trained on a dataset which had more / better data augmentation.
5. NOISE ROBUSTNESS
Noise with infinitesimal variance at the input of the model is equivalent to imposing a penalty
on the norm of the weights. Noise added to hidden units is very important and is discussed in Dropout.
Noise can even be added to the weights. This has several interpretations. One of them is that adding
noise to weights is a stochastic implementation of Bayesian inference over the weights, where the
weights are considered to be uncertain, with the uncertainty being modelled by a probability
distribution. It is also interpreted as a more traditional form of regularization by ensuring stability in
learning.
For e.g. in the linear regression case, we want to learn the mapping y(x) for each feature vector x, by
using the least squares cost function between the model predictions y(x) and the true value y.
The training set consists of m labelled example { (x(1), y(1)), ..... (x(m), y(m))}
Now, suppose a zero mean unit variance Gaussian random noise, ϵ, is added to the weights.
We till want to learn the appropriate mapping through reducing the mean square. Minimizing the loss
after adding noise to the weights is equivalent to adding another regularization term which makes sure
that small perturbations in the weight values don’t affect the predictions much, thus stabilising training.
Despite
the injection of noise, we are still interested in minimizing the squared error of the output of the
network. The objective function thus becomes:
Sometimes we may have the wrong output labels, in which case maximizing p(y | x)may not be a good
idea. In such a case, we can add noise to the labels by assigning a probability of (1- ϵ) that the label is
correct and a probability of ϵ that it is not. In the latter case, all the other labels are equally likely. Label
Smoothing regularizes a model with k softmax outputs by assigning the classification targets with
probability (1-ϵ ) or choosing any of the remaining (k-1) classes with probability ϵ / (k-1).
6. SEMI-SUPERVISED LEARNING
In the paradigm of semi-supervised learning, both unlabeled examples from P (x) and labeled
examples fromP (x, y) are used to estimate P (y | x) or predict y from X
x.P(x,y) denotes the joint distribution of x and y, i.e., corresponding to a training sample x, I
have a label y. P(x) denotes the marginal distribution of x, i.e., just the training examples without any
labels. In Semi-supervised Learning, we use both P(x,y)(some labelled samples) and P(x)(unlabelled
samples) to estimate P(y|x)(since we want to predict the class, given the training sample). We want to
learn some representation h = f(x)such that samples which are closer in the input space have similar
representations and a linear classifier in the new space achieves better generalization error.
Instead of separating the supervised and unsupervised criteria, we can instead have a
generative model of P(x) or P(x, y)) which shares parameters with the discriminative modelP(Y|X).
The idea is to share the unsupervised/generative criterion with the supervised criterion to express a
prior belief that the structure of P(x) or P(x, y)) is connected to the structure of P(y|x), which is
expressed by the shared parameters.
7. MULTITASK LEARNING
The idea is to improve the generalization error by pooling together examples from multiple
tasks. Similar to how more data leads to more generalization, using a part of the model for different
tasks constrains that part to learn good values.
Above diagram illustrates a very common form of multi-task learning, in whichdifferent
supervised tasks (predicting y(i) given x) share the same input x, as well as some intermediate-level
representation h (shared) capturing a common pool of factors. The model can generally be divided into
two kinds of parts and associated parameters:
1. Task-specific parameters (which only benefit from the examples of their taskto achieve good
generalization). These are the upper layers of the neuralnetwork in above figure
2. Generic parameters, shared across all the tasks (which benefit from thepooled data of all the tasks).
These are the lower layers of the neural network in above figure.
Improved generalization and generalization error bounds can be achieved because of the
shared parameters, for which statistical strength can be greatly improved (in proportion with the
increased number of examples for the shared parameters, compared to the scenario of single-task
models). Of course this will happen only if some assumptions about the statistical relationship
between the different tasks are valid, meaning that there is something shared across some of the tasks.
Multitask learning leads to better generalization when there is actually some relationship
between the tasks, which actually happens in the context of Deep Learning where some of the factors,
which explain the variation observed in the data, are shared across different tasks.
8. EARLY STOPPING
Early stopping is one of the most commonly used strategies because it is very simple and
quite effective. It refers to the process of stopping the training when the training error is no longer
decreasing but the validation error is starting to rise.
When training large models with sufficient representational capacity to overfit the task, we
often observe that training error decreases steadily over time, but validation set error begins to rise
again.
This means we can obtain a model with better validation set error (and thus, hopefully better
test set error) by returning to the parameter setting at the point in time with the lowest validation set
error. Every time the error on the validation set improves, we store a copy of the model parameters.
When the training algorithm terminates, we return these parameters, rather than the latest parameters.
The algorithm terminates when no parameters have improved over the best recorded validation error
for some pre-specified number of iterations
Algorithm
The early stopping meta-algorithm for determining the bestamount of time to train. This
meta-algorithm is a general strategy that workswell with a variety of training algorithms and ways of
quantifying error on thevalidation set.
This strategy is known as early stopping. It is probably the most commonlyused form of
regularization in deep learning. Its popularity is due both to itseffectiveness and its simplicity.
One way to think of early stopping is as a very efficient hyperparameter selection algorithm.
In this view, the number of training steps is just another hyperparameter.
This implies that we store the trainable parameters periodically and track the validation error.
After the training stopped, we return the trainable parameters to the exact point where the validation
error started to rise, instead of the last ones.
It can also be proven that in the case of a simple linear model with a quadratic error function
and simple gradient descent, early stopping is equivalent to L2 regularization.
After a certain point of time during training, for a model with extremely high representational
capacity, the training error continues to decrease but the validation error begins to increase (which we
referred to as overfitting). In such a scenario, a better idea would be to return back to the point where
the validation error was the least. Thus, we need to keep calculating the validation metric after each
epoch and if there is any improvement, we store that parameter setting. Upon termination of training,
we return the last saved parameters.
The idea of Early Stopping is that if the validation error doesn’t improve over a certain fixed
number of iterations, we terminate the algorithm. This effectively reduces the capacity of the model by
reducing the number of steps required to fit the model. The evaluation on the validation set can be done
both on another GPU in parallel or done after the epoch. A drawback of weight decay was that we had
to manually tweak the weight decay coefficient, which, if chosen wrongly, can lead the model to local
minima by squashing the weight values too much. In Early Stopping, no such parameter needs to be
tweaked which reduces the number of hyperparameters that we need to tune.
However, since we are setting aside some part of the training data for validation, we are not using the
complete training set. So, once Early Stopping is done, a second phase of training can be done where
the complete training set is used. There are two choices here:
Train from scratch for the same number of steps as in the Early Stopping case.
Use the weights learned from the first phase of training and retrain using the complete data.
Other than lowering the number of training steps, it reduces the computational cost also by
regularizing the model without having to add additional penalty terms. It affects the optimization
procedure by restricting it to a small volume of the parameter space, in the neighbourhood of the initial
parameters. Suppose 𝛕 and ϵ represent the number of iterations and the learning rate respectively.
Then, ϵ𝛕 effectively represents the capacity of the model. Intuitively, this can be seen as the inverse of
the weight decay co-efficient λ. When ϵ𝛕 is small (or λ is large), the parameter space is small and vice
versa. This equivalence holds true for a linear model with quadratic cost function (initial parameters w⁰
= 0). Taking the Taylor Series Approximation of J(w) around the empirically optimal weights w*:
multiplying with Q’ on both sides and using the fact that Q’Q = I (Q is orthonormal):
Example
Two models perform the same classification task (with the same set of classes) but
with different input data.
• Model A with parameters wt(A).
• Model B with parameters wt(B).
The two models hash the input to two different but related outputs.
Some standard regularisers like l1 and l2 penalize model parameters for deviating
from the fixed value of zero. One of the side effects of Lasso or group-Lasso regularization in
learning a Deep Neural Networks is that there is a possibility that many of the parameters
may become zero.
overcome the issues we face while using Lasso or group lasso is countered by a regularizer,
the group version of the ordered weighted one norm, known as group-OWL (GrOWL). GrOWL
supports sparsity and simultaneously learns which parameters should share a similar value. GrOWL
has been effective in linear regression, identifying and coping with strongly correlated covariates.
Unlike standard sparsity-inducing regularizers (e.g., Lasso), GrOWL eliminates unimportant neurons
by setting all their weights to zero and explicitly identifies strongly correlated neurons by tying the
corresponding weights to an expected value.
This ability of GrOWL motivates the following two-stage procedure:
(i) use GrOWL regularization during training to simultaneously identify significant neurons and
groups of parameters that should be tied together.
(ii) retrain the network, enforcing the structure unveiled in the previous phase, i.e., keeping only the
significant neurons and implementing the learned tying structure.
Let us imagine that the tasks are similar enough (perhaps with similar input and output
distributions) that we believe the model parameters should be close to each other: ∀i, wi(A)should be
close to wi(B)
We can leverage this information through regularization. Specifically, we can use a parameter
norm penalty of the form:
Ω(w(A),w(B)) = ||w(A)− w(B)||22
Here we used an L2 penalty, but other choices are also possible.
Parameter Sharing
Parameter sharing forces sets of parameters to be similar as we interpret various models or
model components as sharing a unique set of parameters. We only need to store only a subset of
memory.
Suppose two models A and B, perform a classification task on similar input and output
distributions. In such a case, we'd expect the parameters for both models to be identical to each other
as well. We could impose a norm penalty on the distance between the weights, but a more popular
method is to force the parameters to be equal. The idea behind Parameter Sharing is the essence of
forcing the parameters to be similar. A significant benefit here is that we need to store only a subset of
the parameters (e.g., storing only the parameters for model A instead of storing for both A and B),
which leads to significant memory savings.
We can place penalties on even the activation values of the units which indirectly imposes a
penalty on the parameters. This leads to representational sparsity, where many of the activation values
of the units are zero. In the figure below, h is a representation of x, which is sparse. Representational
sparsity is obtained similarly to the way parameter sparsity is obtained, by placing a penalty on the
representation h instead of the weights.
In the first expression, we have an example of a sparsely parametrized linear regression
model. In the second, we have linear regression with a sparse representation h of the data x. That is, h
is a function of x that, in some sense, represents the information present in x, but does so with a sparse
vector.
Representational regularization is accomplished by the same sorts of mechanisms that we
have used in parameter regularization.
Norm penalty regularization of representations is performed by adding to the loss function J a
norm penalty on the representation. This penalty is denoted Ω(h). As before, we denote the
regularized loss function by J˜:
J˜(θ;X, y) = J(θ;X, y) + αΩ(h)
whereα ∈[0, ∞) weights the relative contribution of the norm penalty term, with larger values of α
corresponding to more regularization.
Another idea could be to average the activation values across various examples and push it
towards some value. An example of getting representational sparsity by imposing hard constraint on
the activation value is the Orthogonal Matching Pursuit (OMP) algorithm, where a
representation h is learned for the input x by solving the constrained optimization problem:
where ||h||0 is the number of non-zero entries of h. This problem can be solvedefficiently when W is
constrained to be orthogonal. This method is often called OMP-k with the value of k specified to
indicate the number of non-zero features allowed.
The techniques which train multiple models and take the maximum vote across those models
for the final prediction are called ensemble methods. The idea is that it’s highly unlikely that multiple
models would make the same error in the test set.
Suppose that we have K regression models, with the model making an error ϵ i on each
example, where ϵi is drawn from a zero mean multivariate normal distribution such that: 𝔼(ϵi²)=v and
𝔼(ϵiϵj)=c. The error on each example is then the average across all the models: 1/K(∑ iϵi).
The mean of this average error is 0 (as the mean of each of the individual ϵiϵi is 0). The
variance of the average error is given by:
In the case where the errors are perfectly correlated and c = v, the mean squared error reduces
to v, so the model averaging does not help at all. In the case where the errors are perfectly
uncorrelated and c = 0, the expected squared error of the ensemble is only (1/K)v. This means that the
expected squared error of the ensemble decreases linearly with the ensemble size. In other words, on
average, the ensemble will perform at least as well as any of its members, and if the members make
independent errors, the ensemble will perform significantly better than its members.
There are various ensembling techniques. In the case of Bagging (Bootstrap Aggregating), the
same training algorithm is used multiple times. The dataset is broken into K parts by sampling with
replacement (see figure below for clarity) and a model is trained on each of those K parts. Because of
sampling with replacement, the K parts have a few similarities as well as a few differences. These
differences cause the difference in the predictions of the K models.
Model averaging is a very strong technique. Model averaging is an extremely powerful and
reliable method for reducing generalization error. Its use is usually discouraged when benchmarking
algorithms for scientific papers, because any machine learning algorithm can benefit substantially
from model averaging at the price of increased computation and memory. For this reason, benchmark
comparisons are usually made using a single model.
Not all techniques for constructing ensembles are designed to make the ensemble more
regularized than the individual models. For example, a technique called boosting constructs an
ensemble with higher capacity than the individual models. Boosting has been applied to build
ensembles of neural networks by incrementally adding neural networks to the ensemble. Boosting has
also been applied interpreting an individual neural network as an ensemble, incrementally adding
hidden units to the neural network.
12. DROPOUT
You can imagine that if neurons are randomly dropped out of the network during training,
other neurons will have to step in and handle the representation required to make predictions for the
missing neurons. This is believed to result in multiple independent internal representations being
learned by the network.
The effect is that the network becomes less sensitive to the specific weights of neurons. This,
in turn, results in a network capable of better generalization and less likely to overfit the training data.
In a simplistic view, dropout trains the ensemble of all sub-networks formed by randomly
removing a few non-output units by multiplying their outputs by 0. For every training sample, a mask
is computed for all the input and hidden units independently. For clarification, suppose we
have h hidden units in some layer. Then, a mask for that layer refers to a h dimensional vector with
values either 0(remove the unit) or 1(keep the unit).
In bagging, the models are independent of each other, whereas in dropout, the different models
share parameters, with each model taking as input, a sample of the total parameters.
In bagging, each model is trained till convergence, but in dropout, each model is trained for just
one step and the parameter sharing makes sure that subsequent updates ensure better predictions
in the future.
At test time, we combine the predictions of all the models. In the case of bagging with K models, this
was given by the arithmetic mean. In case of dropout, the probability that a model is chosen is given by
p(μ), with μ denoting the mask vector. The prediction then becomes ∑ p(μ)p(y|x, μ). This is not
computationally feasible, and there’s a better method to compute this in one go, using the geometric
mean instead of the arithmetic mean.
We need to take care of two main things when working with geometric mean:
The advantage for dropout is that first term can be approximated in one pass of the complete model by
dividing the weight values by the keep probability (weight scaling inference rule). The motivation
behind this is to capture the right expected values from the output of each unit, i.e. the total expected
input to a unit at train time is equal to the total expected input at test time. A big advantage of dropout
then is that it doesn’t place any restriction on the type of model or training procedure to use.
Points to note:
Reduces the representational capacity of the model and hence, the model should be large enough
to begin with.
Equivalent to L² for linear regression, with different weight decay coefficient for each input
feature.
Biological Interpretration:
During sexual reproduction, genes could be swapped between organisms if they are unable to correctly
adapt to the unusual features of any organism. Thus, the units in dropout learn to perform well
regardless of the presence of other hidden units, and also in many different contexts.
Adding noise in the hidden layer is more effective than adding noise in the input layer. For e.g. let’s
assume that some unit learns to detect a nose in a face recognition task. Now, if this unit is removed,
then some other unit either learns to redundantly detect a nose or associates some other feature (like
mouth) for recognising a face. In either way, the model learns to make more use of the information in
the input. On the other hand, adding noise to the input won’t completely removed the nose information,
unless the noise is so large as to remove most of the information from the input.
Another strategy to regularize deep neural networks is dropout. Dropout falls into noise
injection techniques and can be seen as noise injection into the hidden units of the network.
In practice, during training, some number of layer outputs are randomly ignored (dropped
out) with probability p.
During test time, all units are present, but they have been scaled down by p. This is happening
because after dropout, the next layers will receive lower values. In the test phase though, we are
keeping all units so the values will be a lot higher than expected. That’s why we need to scale them
down.
By using dropout, the same layer will alter its connectivity and will search for alternative
paths to convey the information in the next layer. As a result, each update to a layer during training is
performed with a different “view” of the configured layer. Conceptually, it approximates training a
large number of neural networks with different architectures in parallel.
"Dropping" values means temporarily removing them from the network for the current
forward pass, along with all its incoming and outgoing connections. Dropout has the effect of making
the training process noisy. The choice of the probability p depends on the architecture.
This conceptualization suggests that perhaps dropout breaks up situations where network
layers co-adapt to correct mistakes from prior layers, making the model more robust. It increases the
sparsity of the network and in general, encourages sparse representations! Sparsity can be added to
any model with hidden units and is a powerful tool in our regularization arsenal.
There are many more variations of Dropout that have been proposed over the years. To keep this
article relatively digestible, I won’t go into many details for each one. But I will briefly mention a few
of them. Feel free to check out paperswithcode.com for more details on each one, alongside the
original paper and code.
1. Inverted dropout also randomly drops some units with a probability p. The difference with
traditional dropout is: During training, it also scales the activations by the inverse of the keep
probability 1−p. The reason behind this is: to prevent the activations from becoming too large
thus the need to modify the network during the testing phase. The end result will be similar to
the traditional dropout.
2. Gaussian dropout: instead of dropping units during training, is injecting noise to the weights
of each unit. The noise is, more often than not ,Gaussian. This results in:
3. DropConnect follows a slightly different approach. Instead of zeroing out random activations
(units), it zeros random weights during each forward pass. The weights are dropped with a
probability of 1−p. This essentially transforms a fully connected layer to a sparsely connected
layer. Mathematically we can represent DropConnect as: r=a((M∗W)v) where r is the layers’
output, v the input, W the weights and M a binary matrix. M is a mask that instantiates a
different connectivity pattern from each data sample. Usually, the mask is derived from each
training example. DropConnect can be seen as a generalization of Dropout to the full-
connection structure of a layer.
4. Variational Dropout: we use the same dropout mask on each timestep. This means that we
will drop the same network units each time. This was initially introduced for Recurrent
Neural Networks and it follows the same principles as variational inference.
5. Attention Dropout: popular over the past years because of the rapid advancements of
attention-based models like Transformers. As you may have guessed, we randomly dropped
certain attention units with a probability p.
6. Adaptive Dropout: a technique that extends dropout by allowing the dropout probability to be
different for different units. The intuition is that there may be hidden units that can
individually make confident predictions for the presence or absence of an important feature or
combination of features.
7. Embedding Dropout: a strategy that performs dropout on the embedding matrix and is used
for a full forward and backward pass.
8. DropBlock: is used in Convolutional Neural networks and it discards all units in a continuous
region of the feature map.
Stochastic Depth
Stochastic depth goes a step further. It drops entire network blocks while keeping the model intact
during testing. The most popular application is in large ResNets where we bypass certain blocks
through their skip connections.
In particular, Stochastic depth drops out each layer in the network that has residual connections
around it. It does so with a specified probability p that is a function of the layer depth.
Hl=ReLU(blfl(Hl−1)+id(Hl−1))
where b is a Bernoulli random variable that shows if a block is active or inactive. If b=0 the block will
be inactive and if b=1 active.
Generally, use a small dropout value of 20%-50% of neurons, with 20% providing a good starting
point. A probability too low has minimal effect, and a value too high results in under-learning by the
network.
Use a larger network. You are likely to get better performance when Dropout is used on a larger
network, giving the model more of an opportunity to learn independent representations.
Use Dropout on incoming (visible) as well as hidden units. Application of Dropout at each layer of
the network has shown good results.
Use a large learning rate with decay and a large momentum. Increase your learning rate by a factor of
10 to 100 and use a high momentum value of 0.9 or 0.99.
Constrain the size of network weights. A large learning rate can result in very large network weights.
Imposing a constraint on the size of network weights, such as max-norm regularization, with a size of
4 or 5 has been shown to improve results.
Dropout is extremely common in computer vision applications. Convolutional neural
networks are computer vision’s most widely used deep learning models. Dropout, on the other hand,
is not particularly useful on convolutional layers. This is because dropout tries to increase robustness
by making neurons redundant. Without relying on single neurons, a model should learn parameters.
This is very helpful if your layer has a lot of parameters.
Key Takeaways :
As a result, dropout layers in convolutional neural networks are often found after fully connected
layers but not after convolutional layers.
Other regularising techniques, such as batch normalization in convolutional networks, have largely
overtaken dropout in recent years.
Because convolutional layers have fewer parameters, they necessitate less regularisation.
Drawbacks of Dropout
A dropout network may take 2-3 times longer to train than a normal network. Finding a
regularizer virtually comparable to a dropout layer is one method to reap the benefits of dropout
without slowing down training. This regularizer is a modified variant of L 2 regularisation for linear
regression. An analogous regularizer for more complex models has yet to be discovered until that time
when doubt drops out.
Deep Learning has outperformed humans in the task of Image Recognition, which might lead
us to believe that these models have acquired a human-level understanding of an image. However,
experimentally searching for an x′ (given an x), such that prediction made by the model changes, shows
otherwise. As shown in the image below, although the newly formed image (adversarial image) looks
almost exactly the same to a human, the model classifies it wrongly and that too with very high
confidence:
Adversarial examples have many implications, for example, in computer security, that are beyond the
scope of this chapter. However, they are interesting in the context of regularization because one can
reduce the error rate on the original i.i.d. test set via adversarial training—training on adversarially
perturbed examples from the training set
Good fellow et al. showed that one of the primary causes of these adversarial examples is
excessive linearity. Neural networks are built out of primarily linear building blocks. In some
experiments the overall function they implement proves to be highly linear as a result. These linear
functions are easy to optimize. Unfortunately, the value of a linear function can change very rapidly if
it has numerous inputs. If we change each input by , then a linear function with weights w can change
by as much as ||w||1, which can be a very large amount if w is high-dimensional. Adversarial training
discourages this highly sensitive locally linear behavior by encouraging the network to be locally
constant in the neighbour hood of the training data. This can be seen as a way of explicitly introducing
a local constancy prior into supervised neural nets.
Adversarial training helps to illustrate the power of using a large function family in
combination with aggressive regularization. Purely linear models, like logistic regression, are not able
to resist adversarial examples because they are forced to be linear. Neural networks are able to
represent functions that can range from nearly linear to nearly locally constant and thus have the
flexibility to capture linear trends in the training data while still learning to resist local perturbation
Adversarial training refers to training on images which are adversarially generated and it has
been shown to reduce the error rate. The main factor attributed to the above mentioned behaviour is the
linearity of the model (say y = Wx), caused by the main building blocks being primarily linear. Thus, a
small change of ϵ in the input causes a drastic change of Wϵ in the output. The idea of adversarial
training is to avoid this jumping and induce the model to be locally constant in the neighbour hood of
the training data.
This can also be used in semi-supervised learning. For an unlabelled sample x, we can assign the label
ŷ (x) using our model. Then, we find an adversarial example, x′, such that y(x′)≠ŷ (x) (an adversary
found this way is called virtual adversarial example). The objective then is to assign the same class to
both x and x′. The idea behind this is that different classes are assumed to lie on disconnected
manifolds and a little push from one manifold shouldn’t land in any other manifold.
14. TANGENT DISTANCE, TANGENT PROP AND MANIFOLD TANGENT CLASSIFIER
Many ML models assume the data to lie on a low dimensional manifold to overcome the curse
of dimensionality. The inherent assumption which follows is that small perturbations that cause the
data to move along the manifold (it originally belonged to), shouldn’t lead to different class
predictions. The idea of the tangent distance algorithm to find the K-nearest neighbors using the
distance metric as the distance between manifolds. A manifold Mi is approximated by the tangent plane
at Xi, hence, this technique needs tangent vectors to be specified.
The tangent prop algorithm proposed to learn a neural network based classifier, f(x), which is
invariant to known transformations causing the input to move along its manifold. Local invariance
would require that ▽ f(x) is perpendicular to the tangent vectors V(i). This can also be achieved by
adding a penalty term that minimizes the directional directive of f(x) along each of the V(i).
It is similar to data augmentation. In both cases, the user of the algorithm encodes his or her
prior knowledge of the task by specifying a set of transformations that should not alter the output of
the network. The difference is that in the case of dataset augmentation, the network is explicitly
trained to correctly classify distinct inputs that were created by applying more than an infinitesimal
amount of these transformations. Tangent propagation does not require explicitly visiting a new input
point. Instead, it analytically regularizes the model to resist perturbation in the directions
corresponding to the specified transformation.
Drawbacks.
First, it only regularizes the model to resist infinitesimal perturbation. Explicit dataset
augmentation confers resistance to larger perturbations.
Second, the infinitesimal approach poses difficulties for models based on rectified
linear units.
These models can only shrink their derivatives by turning units off or shrinking their weights.
They are not able to shrink their derivatives by saturating at a high value with large weights, as
sigmoid or tanh units can. Dataset augmentation works well with rectified linear units because
different subsets of rectified units can activate for different transformed versions of each original
input.
The manifold tangent classifier makes use of this technique to avoid needing user-specified
tangent vectors. These estimated tangent vectors go beyond the classical in variants that arise out of
the geometry of images (such as translation, rotation and scaling)and include factors that must be
learned because they are object-specific (such as moving body parts).
Use Auto encoders to learn the manifold structures using Unsupervised Learning.
Tangent propagation is also related to double backprop and adversarial training Double
backprop regularizes the Jacobian to be small, while adversarial training finds inputs near the original
inputs and trains the model to produce the same output on these as on the original inputs. Tangent
propagation and dataset augmentation using manually specified transformations both require that the
model should be invariant to certain specified directions of change in the input.
Double backprop and adversarial training both require that the model should be invariant to
all directions of change in the input so long as the change is small. Justas dataset augmentation is the
non-infinitesimal version of tangent propagation, adversarial training is the non-infinitesimal version
of double backprop.