0% found this document useful (0 votes)
6 views65 pages

Mod 4

The document discusses the construction and training of Deep Feed Forward Neural Networks, emphasizing the importance of choosing appropriate activation functions and regularization techniques to optimize performance and prevent overfitting. It covers various activation functions such as Sigmoid, Tanh, and ReLU, and highlights the significance of Batch Normalization and regularization methods like L1 and L2 to enhance model efficiency. Additionally, it touches on strategies like early stopping and data augmentation to mitigate overfitting in neural network training.

Uploaded by

Pragyamita Basu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views65 pages

Mod 4

The document discusses the construction and training of Deep Feed Forward Neural Networks, emphasizing the importance of choosing appropriate activation functions and regularization techniques to optimize performance and prevent overfitting. It covers various activation functions such as Sigmoid, Tanh, and ReLU, and highlights the significance of Batch Normalization and regularization methods like L1 and L2 to enhance model efficiency. Additionally, it touches on strategies like early stopping and data augmentation to mitigate overfitting in neural network training.

Uploaded by

Pragyamita Basu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Deep Feed Forward Neural Networks

Recap- MLP can represent any function


The MLP can be constructed to represent anything
• But how do we construct it? – i.e. how do we determine the weights (and biases) of the network
to best represent a target function
• Assuming that the architecture of the network is given

Training Neural Networks-Empirical Risk Minimization


Learning is cast as optimization. Ideally it would to optimize
classification error

Error surface
Recap- Pattern to be learnt at lower level

• The neurons in an MLP build up complex patterns from


simple pattern hierarchically
– Each layer learns to “detect” simple combinations
of the patterns detected by earlier layers
• This is because the basic units themselves are simple
– Typically linear classifiers or thresholding units
– Incapable of individually holding complex patterns
Deep neural architecture with multiple layers

8-9-9-9-4
8x9=72
9x9=81
9x9=81
9x4=36
Total =270 parameters

i j=err*f’(netj)
wij
ai

i= jwij*f’(neti)
Training the multi-layer neural network
when using sigmoid neurons the gradient will usually vanish.
Activation functions – which is better?

Sigmoid Activation function:


• It is an activation function of form f(x) = 1 / 1 + exp(-x) .
• Its Range is between 0 and 1.
• It is a S — shaped curve.
• It is easy to understand and apply but it has major reasons which have made it fall out of popularity -
• Vanishing gradient problem
• Secondly , its output isn’t zero centered. It makes the gradient updates go too far in different
directions. 0 < output < 1, and it makes optimization harder.
• Sigmoids saturate and kill gradients.
• Sigmoids have slow convergence.

Reference:
https://medium.com/datadriveninvestor/deep-learning-best-practices-activation-functions-weight-initialization-methods-
part-1-c235ff976ed#:~:text=The%20activation%20function%20is%20the,the%20next%20layer%20as%20input.
Activation functions – which is better?

Hyperbolic Tangent function- Tanh :


• It’s mathematical formula is f(x) = 1 — exp(-2x) / 1 + exp(-2x).
• Now it’s output is zero centered because its range in between -1 to 1 i.e -1 < output < 1 .
• Hence optimization is easier in this method hence in practice it is always preferred over Sigmoid function
. But still it suffers from Vanishing gradient problem
Activation functions – which is better?
Then how do we deal and rectify the vanishing gradient problem ?
ReLu- Rectified Linear units :
• It has become very popular in the past of years.
• It was recently proved that it had 6 times improvement in convergence from Tanh function.
• It’s just R(x) = max(0,x) i.e if x < 0 , R(x) = 0 and if x >= 0 , R(x) = x.
• Hence as seeing the mathematical form of this function we can see that it is very simple and efficient . A lot
of times in Machine learning and computer science we notice that most simple and consistent techniques and
methods are only preferred and are best. Hence it avoids and rectifies vanishing gradient problem . Almost
all deep learning Models use ReLu nowadays.
But its limitation is that it should only be used within Hidden layers of a Neural Network Model.
• Hence for output layers we should use a Softmax function for a Classification problem to compute the
probabilites for the classes , and for a regression problem it should simply use a linear function.
Loss functions and output

Classification Regression

Training Rn x {class_1, ..., class_n} Rn x Rm


examples (one-hot encoding)

Output Soft-max Linear (Identity)


Layer [map Rn to a probability distribution] or Sigmoid

f(x)=x
Cost (loss) Cross-entropy
function Mean Squared Error
𝑛 𝐾
1 (𝑖) (𝑖) (𝑖) 𝑖 𝑛
𝐽 𝜃 =− 𝑦𝑘 log 𝑦𝑘 + 1 − 𝑦𝑘 log 1 − 𝑦𝑘 1 2
𝑛 𝐽 𝜃 = 𝑦 (𝑖) − 𝑦 (𝑖)
𝑖=1 𝑘=1 𝑛
𝑖=1
- (for binary classifier with scalar output y=0/1)
Mean Absolute Error
𝑛
- (for desired one-hot 1
output y=[0,0,..1,..0]) 𝐽 𝜃 = 𝑦 (𝑖) − 𝑦 (𝑖)
𝑛
𝑖=1
Overfitting
x2

Learned hypothesis may fit the x1


training data very well, even
outliers (noise) but fail to generalize
to new examples (test data)
Regularization techniques reduce the effective capacity of a model, i.e., the ability for the model to handle
more complex classification tasks.
This makes sense since the basic cause of Overfitting is that the model capacity exceeds the requirements
for the problem.
Some commonly used Regularization techniques include: Early Stopping, L1 Regularization, L2
Regularization, Dropout Regularization, Training Data Augmentation, Batch Normalization
Overfitting- Drop out
Regularization to prevent over-fit

In Deep Learning there are two well-known regularization techniques:


L1 and L2 regularization.
Both add a penalty to the cost based on the model complexity, so instead of
calculating the cost by simply using a loss function, there will be an additional
element (called “regularization term”) that will be added in order to penalize
complex models.
Regularization to prevent over-fit L1 Regularization
tk = target output of kth neuron
Ok= predicted output of kth neuron
Ok= f(net) = f(X*W)
Regularization to prevent over-fit L2 Regularization

It’s correct to say that neural network L2 regularization and


weight decay are the same thing, but it’s also correct to say they
do the same thing but in slightly different ways.
L2 regularization is a technique used to reduce the likelihood of
neural network model overfitting. Overfitting occurs when you
train a neural network too long. The trained model predicts very
well on the training data (often nearly 100% accuracy) but when
presented with new data the model predicts poorly.
Neural networks that have been over-trained are often
characterized by having weights that are large. A good NN might
have weight values that range between -5.0 to +5.0 but a NN that
is overfitted might have some weight values such as 25.0 or -32.0.
So, one approach for discouraging overfitting is to prevent weight
values from getting large in magnitude.

Weight decay: To prevent overfitting, every time we update a weight w with the gradient ∇E in respect to w, we also
subtract from it λ∙w. This gives the weights a tendency to decay towards zero, hence the name.
Regularization to prevent over-fit

Weight decay: To prevent overfitting, every time we update a


weight w with the gradient ∇J in respect to w, we also subtract from
it λ∙w. This gives the weights a tendency to decay towards zero, hence
the name.
Ridge regression uses L2 regularization which adds the following
penalty term to the ordinary least squares equation.
The L2 term is equal to the square of the magnitude of the coefficients.

This constraint results in minimized coefficients (aka shrinkage) that trend towards zero the larger the value of lambda.
Shrinking the coefficients leads to a lower variance and in turn a lower error value. Therefore Ridge regression
decreases the complexity of a model but does not reduce the number of variables, it rather just shrinks their effect.
Ridge Regression

An important fact we need to notice about


ridge regression is that it enforces the β
coefficients to be lower, but it does not enforce
them to be zero.
That is, it will not get rid of irrelevant features
but rather minimize their impact on the
trained model.
Lasso regression uses the L1 penalty term and stands for Least Absolute Shrinkage
and Selection Operator. The penalty applied for L1 is equal to the absolute value
of the magnitude of the coefficients:

Given a suitable lambda value lasso regression can drive some coefficients to zero. The larger the value of lambda the more
features are shrunk to zero. This can eliminate some features entirely and give us a subset of predictors that helps mitigate
multi-collinearity and model complexity. Predictors not shrunk towards zero signify that they are important and thus L1
regularization allows for feature selection (sparse selection).
Lasso Regression
Lasso method overcomes the disadvantage of Ridge
regression by not only punishing high values of the
coefficients β but actually setting them to zero if
they are not relevant. Therefore, you might end up
with fewer features included in the model than you
started with, which is a huge advantage.
Regularization to prevent overfit
L1 regularization (LASSO regression) produces sparse matrices. Sparse matrices are
zero-matrices in which some elements are ones (the sparsity refers to the ones), but in
this context a sparse matrix could be several close-to-zero values and other larger values.
If we find a model with neurons whose weights are close to zero it means we don’t need
those neurons because the model deactivates them with zeros and we might not need a
specific feature/input leading to a simpler model. For instance, if we have 50 coefficients
but only 10 are non-zero, the other 40 are irrelevant to make our predictions. This is not
only interesting from the efficiency point of view but also from the economic point of
view: gathering data and extracting its features might be a very expensive task (in terms
of time and money). Reducing this will benefit us.
Due to the absolute value, L1 regularization provides with a non-differentiable term, but
despite of that, there are methods to minimize it.
Regularization to prevent overfit

L2 regularization (Ridge regression) on the other hand leads to a balanced minimization


of the weights. Since L2 uses squares, it emphasizes the errors, and it can be a problem
when there are outliers in the data. Unlike L1, L2 has an analytical solution which makes
it computationally efficient.
Both regularizations have a λ parameter which is directly proportional to the penalty: the
larger λ the stronger penalty to find complex models and it will be more likely that the
model will avoid them. Likewise, if λ is zero, regularization is deactivated.
Batch Normalization

Neural networks learn the problem using BackPropagation algorithm. By


backpropagation, the neurons learn how much error they did and correct
themselves, i.e, correct their “weights” and “biases”. By this they learn the
problem to produce correct outputs given the inputs. BackPropagation
involves computing gradients for each layer and propagating it backward,
hence the name.
But during backpropagation of errors to the weights and biases, we’ll face a
undesired property of Internal Covariate Shift. This makes the network too
long to train.
Batch Normalization

Internal Covariate Shift. During training, each layer is trying to correct itself
for the error made up during the forward propagation. But every single layer
acts separately, trying to correct itself for the error made up.
For example, in the network given above, the 2nd layer adjusts its weights and
biases to correct for the output. But due to this readjustment, the output of 2nd
layer, i.e, the input of 3rd layer is changed for same initial input. So the third
layer has to learn from scratch to produce the correct outputs for the same
data.
This presents the problem of a layer starting to learn after it’s previous layer,
i.e, 3rd layer learns after 2nd finished, 4th starts learning after 3rd, etc.
Batch Normalization

More specifically, due to changes in weights of previous layers, the distribution of input
values for current layer changes, forcing it to learn from new “input distribution”
Normalization -In a dataset, all the features (columns) may not be in same range. For eg.
Price of house (thousands), Age of house (within 100) etc. It takes lot of time to train for
these kind of datasets.
Usually, in simpler ML algorithms like linear regression, the input is “normalized” before
training to make them into single distribution. Normalization is to convert the distribution
of all inputs to have mean=0 and standard deviation=1. So most of the values lie
between -1 and 1.
Raw and normalized data
Batch Normalization

We can even apply this normalization to the input of neural networks. It fastens up training
as in linear regression. But since the 2nd layer changes this distribution, the consecutive
layers are not benefited. So, what can we do?
Why not add normalization between each layers? This is what Batch normalization does.

To reduce this problem of internal covariate shift, Batch Normalization adds Normalization
“layer” between each layers. An important thing to note here is that normalization has to be
done separately for each dimension (input neuron), over the ‘mini-batches’, and not
altogether with all dimensions. Hence the name ‘batch’ normalization.
Batch Normalization
Due to this normalization “layers” between each fully connected layers, the range of input
distribution of each layer stays the same, no matter the changes in the previous layer.
Given x inputs from k-th neuron.

Where E(x) – mean of xk

Normalization brings all the inputs centered around 0. This way, there is not much change in
each layer input. So, layers in the network can learn from the back-propagation
simultaneously, without waiting for the previous layer to learn. This fastens up the training of
networks.
Batch Normalization

There are usually two types in which Batch Normalization can be applied:
1. Before activation function (non-linearity)
2. After non-linearity

Most of the activation functions have problems while applied this way. For sigmoid and
tanh activation, normalized region is more of linear than nonlinear. For Relu activation,
half of the inputs are zeroed out.

So, some transformation has to be done to move the distribution away from 0.
Batch Normalization

A scaling factor γ and shifting factor β are used to do this.

As training progresses, these γ and β also learn through backpropagation so as to improve


accuracy. This imposes that 2 extra parameters be learnt for each layer to increase training speed.
Use of scaling and shifting is particularly much useful because, it provides more flexibility.
Suppose if we decide not to use BatchNorm, we can set γ = σ and β = mean, thus giving back the
original values.
Recently, it has been observed that the BatchNorm when applied after activation, performs better
and even gives better accuracy. For such case, we may decide to use only BatchNorm alone and
not scaling and shifting. For such, set γ = 1 and β = 0. Nevertheless, γ and β are included in Batch
Normalization Algorithm.
Other regularization methods

1.Data Augmentation: Suppose we are building an image classification model and are
lacking the requisite data due to various reasons. In such cases, we can use data
augmentation, i.e., applying some changes such as flipping the image, taking random crops
of the image, randomly rotating images, etc. These can potentially help us get more training
data and hence reduce overfitting.
Other regularization methods
2.Early Stopping: Early Stopping is one of the most popular, and also effective, techniques
to prevent overfitting. Use the validation data set to compute the loss function at the end of
each training epoch, and once the loss stops decreasing, stop the training and use the test
data to compute the final classification accuracy. In practice it is more robust to wait until
the validation loss has stopped decreasing for four or five successive epochs before
stopping. The justification for this rule is quite simple: The point at which the validation
loss starts to increase is when the model starts to overfit the training data, since from this
point onwards its generalization ability starts to decrease. Early Stopping can be used by
itself or in combination with other Regularization techniques.
Other regularization methods
Once the DLN model has been trained, its true test is how well it is able to classify inputs that it
has not seen before, which is also known as its Generalization Ability.
There are two kinds of problems that can afflict ML models in general:
i) Even after the model has been fully trained such that its training error is small,
it exhibits a high test error rate. This is known as the problem of Over-fitting.
ii) The training error fails to come down in-spite of several epochs of training. This is
known as the problem of Under-fitting.
The following factors determine how well a model is able to generalize from the training dataset to
the test dataset:
The model capacity and its relation to data complexity:
Model’s capacity is its ability to fit a wide variety of functions. In general if the model capacity is
less than the data complexity then it leads to under-fitting, while if the converse is true, then it can
lead to over-fitting. Hence we should try to choose a model whose capacity matches the
complexity of the training and test datasets.
Even if the model capacity and the data complexity are well matched, we can still encounter the
overfitting problem. This is due to an insufficient amount of training data.
Based on these observations, a useful rule of thumb is the following: The chances of encountering
the overfitting problem increases as the model capacity grows, but decreases as more training data
is available.
Other regularization methods
Note that we are attempting to reduce the training error (to avoid the underfitting problem) and the
test error (to avoid the overfitting problem) at the same time. This leads to conflicting demands on
the model capacity, since training error reduces monotonically as the model capacity increases, but
the test error starts to increase if the model capacity is too high. In general if we plot the test error
as a function of model capacity, it exhibits a characteristic U shaped curve. The ideal model
capacity is the point at which the test error starts to increase. This criteria is used extensively in
DLNs to determine the best set of hyper-parameters to use.
Other regularization methods

The generalization ability of DLN models relies on a very important assumption, which
is that both the training and test datasets can be generated the same probabilistic model.
In practice what this means is that if we train the model to recognize a certain type of
object, human faces for example, we cannot expect it to perform well if the test data
consists entirely of cat faces.
There is a famous result called the No Free Lunch Theorem that states that if this
assumption is not satisfied, i.e., the training and test datasets distributions are un-
constrained, then every classification algorithm has the same error rate when classifying
previously unobserved points. Hence the only way to do better, is by constraining the
training and test datasets to a narrower class of data that is relevant to the problem being
solved.
Bias- Variance Trade-off
Bias is the difference between the average prediction of our model and the correct
value which we are trying to predict. Model with high bias pays very little attention
to the training data and oversimplifies the model. It always leads to high error on
training and test data.
Variance is the variability of model prediction for a given data point or a value
which tells us spread of our data. Model with high variance pays a lot of attention
to training data and does not generalize on the data which it hasn’t seen before. As
a result, such models perform very well on training data but has high error rates on
test data.

Mathematically let the variable we are trying to predict as Y and other covariates
as X. We assume there is a relationship between the two such that
Y=f(X) + e
Where e is the error term and it’s normally distributed with a mean of 0.
We will make a model f^(X) of f(X) using linear regression or any other modeling
technique.
Bias- Variance Trade-off
So the expected squared error at a point x is

The Err(x) can be further decomposed as

Irreducible error is the error that can’t be reduced by creating good models. It
is a measure of the amount of noise in our data. Here it is important to
understand that no matter how good we make our model, our data will have
certain amount of noise or irreducible error that can not be removed.
Bias- Variance Trade-off

In the above diagram, center of the target is a model that perfectly


predicts correct values. As we move away from the bulls-eye our
predictions become get worse and worse. We can repeat our process of
model building to get separate hits on the target.
Bias- Variance Trade-off

In supervised learning, underfitting happens when a model unable to capture the underlying
pattern of the data. These models usually have high bias and low variance. It happens when
we have very less amount of data to build an accurate model or when we try to build a linear
model with a nonlinear data. Also, these kind of models are very simple to capture the
complex patterns in data like Linear and logistic regression.
In supervised learning, overfitting happens when our model captures the noise along with the
underlying pattern in data. It happens when we train our model a lot over noisy dataset. These
models have low bias and high variance. These models are very complex like Decision trees
which are prone to overfitting.
Bias- Variance Trade-off

Why is Bias Variance Tradeoff?


If our model is too simple and has very few parameters then it may have high bias
and low variance. On the other hand if our model has large number of parameters
then it’s going to have high variance and low bias. So we need to find the right/good
balance without overfitting and underfitting the data.
This tradeoff in complexity is why there is a tradeoff between bias and variance. An
algorithm can’t be more complex and less complex at the same time.
Bias- Variance Trade-off
Total Error
To build a good model, we need to find a good balance between bias and variance
such that it minimizes the total error.

Underfitting model - high bias problem


Cross validation and training error both are high and equal
Overfitting model - high variance problem
Cross validation error will be high
and training error will be low
Bias- Variance Trade-off
Effect of the following operation on bias & variance

Less complexity

more complexity

Less complexity
New optimization methods
Effect of learning rate

https://srdas.github.io/DLBook/GradientDescentTechniques.ht
ml

If it is set to a large value then the algorithm moves quickly at the start of the iteration, but
the large step size can cause a parameter overshoot as the system approaches minimum
which can lead to oscillations. If set too small then the algorithm converges with high
likelihood, however it can take a very long time to do so Hence ideally η should be set
adaptively such that it is large in the initial stages of the optimization and becomes smaller
as it gets closer to the minimum
New optimization methods
Effect of learning rate

A good Learning Rate on the other hand combines a quick decrease during the initial epochs with a lower
steady state value. Learning Rate Annealing is the strategy of reducing the Learning Rate as the system
approaches the minimum, such that rate is high at the start of the training and gradually falls as the training
progresses. This reduction can be done in several ways, popular approaches are:
• Track the validation accuracy and decrease the Learning Rate when it appears to plateau.
• Automatically anneal the Learning Rate based on the number of epochs that the Gradient Descent
algorithm has been through.
New optimization methods
Improvements to the parameter update equations
the base parameter update equation

which help to improve the performance of the Gradient Descent algorithm.


Some of these algorithms automatically adapt the effective Learning Rate as the training
progresses (for example the ADAGRAD, RMSPROP and Adam algorithms), while others
improve the speed of convergence (for example the Momentum, Nesterov Momentum and Adam
algorithms).
New optimization methods - Momentum
Momentum is one of the most popular techniques used to improve the speed of convergence of the
Gradient Descent algorithm.
If the gradient along one of the dimensions is very large, while along the other dimension it is small, then
gradient descent iteration for the parameter on the steep side will fluctuate, while parameter on the shallow
side will progress very slowly down. This slows the speed of convergence. To counteract this replace
gradient descent by the sequence v(n)

change in parameter values on each iteration is now defined as


New optimization methods - Momentum
Nestrov Momentum is a variation on the plain Momentum method described above. Note that the
Momentum parameter update equations can be written as:

the Gradient Descent process speeds up considerably when compared to the plain Momentum method.
New optimization methods- Adaptive learning rate
AdaGrad – Adaptive Gradient algorithm

Modifies the learning rate  at each time step t for every parameter i based on the past gradients that have
been computed for i
New optimization methods- Adaptive learning rate Root Mean Squared Propagation

RMS Prop modifies AdaGrad to perform better in the non-convex setting by changing gradient accumulation
into an exponentially weighted moving average

The exponentially decaying average discards history from the extreme past so that it can converge rapidly.
New optimization methods- Adaptive learning rate
Adam – adaptive moments also stores running average of past squared gradients- variant on the
combination of RMSProp and momentum with few important distinctions. Adam momentum is incorporated
directly as an estimate of the first order moment (with exponential weighting) of the gradient. Apply momentum
to the rescaled gradients in RMSProp
Choosing the models and algorithms
Choosing the models
•If the input data is such that the various classes are approximately linearly separated (which can verified by plotting a
subset of input data points in two dimensions using projection techniques), then a linear model will work.
•Deep Learning Networks are needed for more complex datasets with non-linear boundaries between classes. If the
input data has a 1-D structure, then a Deep Feed Forward Network will suffice.
•If the input data has a 2-D structure (such as black and white images), or a 3-D structure (such color images), then a
Convolutional Neural Network or ConvNet is called for. In some cases 1-D or 2-D data can be aggregated into a higher
level structure which then becomes amenable to processing using ConvNets. ConvNets excel at object detection and
recognition in images and they have also been applied to tasks such as DLN based image generation.
•If the input data forms a sequence with dependencies between the elements of the sequence, then a Recurrent Neural
Network or RNN is required. Typical examples of this kind of data include: Speech waveforms, natural language
sentences, stock prices etc. RNNs are ideal for tasks such as speech recognition, machine translation, captioning etc
Choosing the algorithms
Choice of SGD parameter updation

Momentum and Nestrov momentum-


speed up the speed of convergence

Adagrad and RMSprop-


Automatically adapt the effective learning rate

Adam-
Combines both the advantage and default choice
Choosing the algorithms
Choice of learning rate

• Keep track of the validation error, and reduce the Learning Rate by a factor of 2 when the error
appears to plateau.
• Automatically reduce the Learning Rate using a predetermined schedule. Popular schedules are:
(a) Exponential decrease: η=η010−t/r, so that the Learning Rate drops by a factor of 10

every r steps,
(b) η=η0(1+t/r)−c, this leads to a smaller rate of decrease compared to exponential.
Choosing the algorithms
Choice of activation function

The general rules in this area are: Avoid Sigmoid and Tanh functions, use ReLu as a default
and Leaky ReLU to improve performance, try out MaxOut and ELU on an experimental
basis.
Choosing the algorithms
Weight initialization rule Weight Initialization Rules: Use the Xavier-He initializations as a default
linear neuron with random weights W
Pick the weights from a Gaussian distribution with zero mean and a variance
of 1/nin where nin is the number of input neurons

Backpropagation network with random weights W


Pick the weights from a Gaussian distribution with zero mean and a variance
of 1/nin where nin is the number of input neurons

You might also like