0% found this document useful (0 votes)
25 views62 pages

Unit 4

The document discusses regularization techniques used to prevent overfitting in machine learning models. It explains what overfitting is and how it can be addressed. The main regularization techniques discussed are L1, L2 and dropout regularization. L1 and L2 regularization involve adding a penalty term to the loss function, while dropout randomly disables neurons during training.

Uploaded by

wansejalm527
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views62 pages

Unit 4

The document discusses regularization techniques used to prevent overfitting in machine learning models. It explains what overfitting is and how it can be addressed. The main regularization techniques discussed are L1, L2 and dropout regularization. L1 and L2 regularization involve adding a penalty term to the loss function, while dropout randomly disables neurons during training.

Uploaded by

wansejalm527
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Unit No.

4 Ensemble Methods
Regularization :
Regularization is one of the most important concepts of machine learning. It is a technique to
prevent the model from overfitting by adding extra information to it.

Sometimes the machine learning model performs well with the training data but does not perform
well with the test data. It means the model is not able to predict the output when deals with unseen
data by introducing noise in the output, and hence the model is called overfitted. This problem can
be deal with the help of a regularization technique.

This technique can be used in such a way that it will allow to maintain all variables or features in
the model by reducing the magnitude of the variables. Hence, it maintains accuracy as well as a
generalization of the model.

It mainly regularizes or reduces the coefficient of features toward zero. In simple words, "In
regularization technique, we reduce the magnitude of the features by keeping the same number of
features."

How does Regularization Work?


Regularization works by adding a penalty or complexity term to the complex model. Let's consider
the simple linear regression equation:

y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b

In the above equation, Y represents the value to be predicted

X1, X2, …Xn are the features for Y.

β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here represents
the bias of the model, and b represents the intercept.

Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:

o Ridge Regression
o Lasso Regression

Ridge Regression (L2 regularization)


o Ridge regression is one of the types of linear regression in which a small amount of bias is
introduced so that we can get better long-term predictions.
o Ridge regression is a regularization technique, which is used to reduce the complexity of
the model. It is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it. The amount
of bias added to the model is called Ridge Regression penalty. We can calculate it by
multiplying with the lambda to the squared weight of each individual feature.
o The equation for the cost function in ridge regression will be:

o
In the above equation, the penalty term regularizes the coefficients of the model, and hence
ridge regression reduces the amplitudes of the coefficients that decreases the complexity
of the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation
becomes the cost function of the linear regression model. Hence, for the minimum value
of λ, the model will resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o It helps to solve the problems if we have more parameters than samples.

Lasso Regression:

o Lasso regression is another regularization technique to reduce the complexity of the model.
It stands for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the absolute
weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression
can only shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso
regression will be:
o Some of the features in this technique are completely neglected for model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model as well as
the feature selection.

Key Difference between Ridge Regression and Lasso Regression

o Ridge regression is mostly used to reduce the overfitting in the model, and it includes all
the features present in the model. It reduces the complexity of the model by shrinking the
coefficients.
o Lasso regression helps to reduce the overfitting in the model as well as feature selection.

Regularization is a set of techniques that can prevent overfitting in neural


networks and thus improve the accuracy of a Deep Learning model when
facing completely new data from the problem domain. we will address the
most popular regularization techniques which are called L1, L2, and
dropout.
If the model is able to perform well on the testing dataset, the model can be said to have
generalized well, i.e., correctly understood the patterns provided in the training dataset. This type
of model is called a correct fit model. However, if the model performs really well on the training
data and doesn’t perform well on the testing data, it can be concluded that the model has
memorized the patterns of training data but is not able to generalize well on unseen data. This
model is called an overfit model.

To summarize, overfitting is a phenomenon where the machine learning model learns patterns
and performs well on data that it has been trained on and does not perform well on unseen data.
The graph shows that as the model is trained for a longer duration, the training error lessens.
However, the testing error starts increasing after a specific point. This indicates that the model
has started to overfit.

Methods used to handle overfitting


Overfitting is caused when the training accuracy/metric is relatively higher than validation
accuracy/metric. It can be handled through the following techniques:

1. Training on more training data to better identify the patterns.


2. Data augmentation for better model generalization.
3. Early stopping, i.e., stop the model training when the validation metrics start decreasing
or loss starts increasing.
4. Regularization techniques.

In these techniques, data augmentation and more training data don’t change the model
architecture but try to improve the performance by altering the input data. Early stopping is used
to stop the model training at an appropriate time - before the model overfits, rather than
addressing the issue of overfitting directly. However, regularization is a more robust technique
that can be used to avoid overfitting.

Types of regularization techniques


Regularization is a technique used to address overfitting by directly changing the architecture of
the model by modifying the model’s training process. The following are the commonly used
regularization techniques:

1. L2 regularization
2. L1 regularization
3. Dropout regularization

Here’s a look at each in detail.

L2 regularization

According to regression analysis, L2 regularization is also called ridge regression. In this type of
regularization, the squared magnitude of the coefficients or weights multiplied with a regularizer
term is added to the loss or cost function. L2 regression can be represented with the following
mathematical equation.

Loss:

In the above equation,


You can see that a fraction of the sum of squared values of weights is added to the loss function.
Thus, when gradient descent is applied on loss, the weight update seems to be consistent by
giving almost equal emphasis on all features. You can observe the following:

• Lambda is the hyperparameter that is tuned to prevent overfitting i.e. penalize the
insignificant weights by forcing them to be small but not zero.
• L2 regularization works best when all the weights are roughly of the same size, i.e., input
features are of the same range.
• This technique also helps the model to learn more complex patterns from data without
overfitting easily.

L1 regularization

L1 regularization is also referred to as lasso regression. In this type of regularization, the


absolute value of the magnitude of coefficients or weights multiplied with a regularizer term is
added to the loss or cost function. It can be represented with the following equation.

Loss:
In the above equation,

A fraction of the sum of absolute values of weights to the loss function is added in the L1
regularization. In this way, you will be able to eliminate some coefficients with lesser values by
pushing those values towards 0. You can observe the following by using L1 regularization:

• Since the L1 regularization adds an absolute value as a penalty to the cost function, the
feature selection will be done by retaining only some important features and eliminating
the lower or unimportant features.
• This technique is also robust to outliers, i.e., the model will be able to easily learn about
outliers in the dataset.
• This technique will not be able to learn complex patterns from the input data.

Dropout regularization

Dropout regularization is the technique in which some of the neurons are randomly disabled
during the training such that the model can extract more useful robust features from the model.
This prevents overfitting. You can see the dropout regularization in the following diagram:

• In figure (a), the neural network is fully connected. If all the neurons are trained with the
entire training dataset, some neurons might memorize the patterns occurring in training
data. This leads to overfitting since the model is not generalizing well.
• In figure (b), the neural network is sparsely connected, i.e., only some neurons are active
during the model training. This forces the neurons to extract robust features/patterns from
training data to prevent overfitting.
Image source

Image source

The following are the characteristics of dropout regularization:

• Dropout randomly disables some percent of neurons in each layer. So for every epoch,
different neurons will be dropped leading to effective learning.
• Dropout is applied by specifying the ‘p’ values, which is the fraction of neurons to be
dropped.
• Dropout reduces the dependencies of neurons on other neurons, resulting in more robust
model behavior.
• Dropout is applied only during the model training phase and is not applied during the
inference phase.
• When the model receives complete data during the inference time, you need to scale the
layer outputs ‘x’ by ‘p’ such that only some parts of data will be sent to the next layer.
This is because the layers have seen less amount of data as specified by dropout.
These are some of the most popular regularization techniques that are used to reduce overfitting
during model training. They can be applied according to the use case or dataset being considered
for more accurate model performance on the testing data.

Bias Value Tradeoff :


The bias–variance tradeoff is a central problem in supervised learning. Ideally,
one wants to choose a model that both accurately captures the regularities in
its training data, but also generalizes well to unseen data. Unfortunately, it is
typically impossible to do both simultaneously.
It is important to understand prediction errors (bias and variance) when it comes to
accuracy in any machine-learning algorithm. There is a tradeoff between a model’s
ability to minimize bias and variance which is referred to as the best solution for
selecting a value of Regularization constant. A proper understanding of these errors
would help to avoid the overfitting and underfitting of a data set while training the
algorithm.
What is Bias?
The bias is known as the difference between the prediction of the values by
the Machine Learning model and the correct value. Being high in biasing gives a large
error in training as well as testing data. It recommended that an algorithm should
always be low-biased to avoid the problem of underfitting. By high bias, the data
predicted is in a straight line format, thus not fitting accurately in the data in the data
set. Such fitting is known as the Underfitting of Data. This happens when
the hypothesis is too simple or linear in nature. Refer to the graph given below for an
example of such a situation.
High Bias in the Model

What is Variance?
The variability of model prediction for a given data point which tells us the spread of
our data is called the variance of the model. The model with high variance has a very
complex fit to the training data and thus is not able to fit accurately on the data which it
hasn’t seen before. As a result, such models perform very well on training data but have
high error rates on test data. When a model is high on variance, it is then said to
as Overfitting of Data. Overfitting is fitting the training set accurately via complex
curve and high order hypothesis but is not the solution as the error with unseen data is
high. While training a data model variance should be kept low. The high variance data
looks as follows.
High Variance in the Model

Bias Variance Tradeoff


If the algorithm is too simple (hypothesis with linear equation) then it may be on high
bias and low variance condition and thus is error-prone. If algorithms fit too complex
(hypothesis with high degree equation) then it may be on high variance and low bias. In
the latter condition, the new entries will not perform well. Well, there is something
between both of these conditions, known as a Trade-off or Bias Variance Trade-off.
This tradeoff in complexity is why there is a tradeoff between bias and variance. An
algorithm can’t be more complex and less complex at the same time. For the graph, the
perfect tradeoff will be like this.
We try to optimize the value of the total error for the model by using the Bias-
Variance Tradeoff.

The best fit will be given by the hypothesis on the tradeoff point. The error to
complexity graph to show trade-off is given as –
Region for the Least Value of Total Error

This is referred to as the best point chosen for the training of the algorithm which gives
low error in training as well as testing data.
Early Stopping :-
Early Stopping is a regularization technique for deep neural networks that
stops training when parameter updates no longer begin to yield improves on a
validation set.
What is Early Stopping?
In Regularization by Early Stopping, we stop training the model when the performance
on the validation set is getting worse- increasing loss decreasing accuracy, or poorer
scores of the scoring metric. By plotting the error on the training dataset and the
validation dataset together, both the errors decrease with a number of iterations until the
point where the model starts to overfit. After this point, the training error still decreases
but the validation error increases.
So, even if training is continued after this point, early stopping essentially returns the
set of parameters that were used at this point and so is equivalent to stopping training at
that point. So, the final parameters returned will enable the model to have low variance
and better generalization. The model at the time the training is stopped will have a
better generalization performance than the model with the least training error.

on the
validation set is getting worse- increasing loss or decreasing accuracy or poorer scores
Early stopping can be thought of as implicit regularization, contrary to regularization
via weight decay. This method is also efficient since it requires less amount of training
data, which is not always available. Due to this fact, early stopping requires lesser time
for training compared to other regularization methods. Repeating the early stopping
process many times may result in the model overfitting the validation dataset, just as
similar as overfitting occurs in the case of training data.
The number of iterations(i.e. epoch) taken to train the model can be considered
a hyperparameter. Then the model has to find an optimum value for this
hyperparameter (by hyperparameter tuning) for the best performance of the learning
model.
Benefits of Early Stopping:
• Helps in reducing overfitting
• It improves generalisation
• It requires less amount of training data
• Takes less time compared to other regularisation models
• It is simple to implement
Limitations of Early Stopping:
• If the model stops too early, there might be risk of underfitting
• It may not be beneficial for all types of models
• If validation set is not chosen properly, it may not lead to the most optimal
stopping
To summarize, early stopping can be best used to prevent overfitting of the model, and
saving resources. It would give best results if taken care of few things like – parameter
tuning, preventing the model from overfitting, and ensuring that the model learns
enough from the data.

Dataset Augmentation : -
Data augmentation is a technique of artificially increasing the training set by
creating modified copies of a dataset using existing data. It includes making
minor changes to the dataset or using deep learning to generate new data
points.
Our model was effectively trained to classify the training data. It did not generalize well for the
validation data to fix the overfishing issue. Now, let's discuss one more technique to improve the
model training process. This technique is known as data augmentation. It is the process by which
we create new data for our model to use during the training process.

This is done by taking our existing dataset and transforming or altering the image in useful ways
to create new images.
After applying the transformation, the newly created images are known as augmented images
because they essentially allow us to augment our dataset by adding new data to it. The data
augmentation technique is useful because it allows our model to look at each image in our
dataset from a variety of different perspective. This allows our model to extract relevant features
more accurately and to obtain more feature-related data from each training image.

Now our biggest question is how we will use that augmentation to reduce overfitting. The
overfitting occurs when our model is too closely fit the training set.

There is no need to start collecting new images and adding them to our datasets. We can use data
augmentation which introduces minor alteration to our existing datasets such darker shading, flips,
zooming, rotations or translation. Our model will interpret them as separate distinct images. It will
not only reduce over fitting but it also prevents our network from learning irrelevant patterns and
boosts overall performance. We have the following steps to perform data augmentation:

Step 1:

To perform data augmentation on training dataset, we have to make to make a separate transform
statement. For validation dataset the transform will remain same. So we first copy our transform1
statement and treat it as transform_train as:

transform_train=transforms.Compose([transforms.Resize((32,32)),transforms.ToTensor()
,transforms.Normalize((0.5,),(0.5,))])
Step 2:

Now, we will add alternation in our transform_train statement. The alternations will be a
RandomHorizontalFlip, RandomRotation which is used for rotation of an image by a certain angle
and that angle will be passes as an argument.

1. transform_train=transforms.Compose([transforms.Resize((32,32)),
2. transform.RandomHorizontalFlip(),
3. transform.RandomRotation(),
4. transforms.ToTensor(),
5. transforms.Normalize((0.5,),(0.5,))])

To add even more variety to our dataset, we will use a fine type transformation. Fine transformation
represent simple transformation which preserve straight lines and planes with the object. Scaling,
translation, shear and zooming is a transformation which fits this category.

1. transform_train=transforms.Compose([transforms.Resize((32,32)),
2. transform.RandomHorizontalFlip(),
3. transform.RandomRotation(),
4. transform.RandomAffine(0,shear=10,scale=(0.8,1.2)),
5. transforms.ToTensor(),
6. transforms.Normalize((0.5,),(0.5,))])

In RandomAffine(), the first argument is decrease which we set zero to deactivate rotation, second
argument is the shear transformation and the last one is the scaling transformation and use a topple
to define the range of zoom which we have required. We defined a lower and upper limit of 0.8
and 1.2 to scale images to 80 or 120 percent of their size.

Step 3:

Now, we move onto our next augmentation to create new augmented images with a randomized
variety of brightness, contrast and saturation. We will add another transformation i.e. ColorJitter
as:

1. transform_train=transforms.Compose([transforms.Resize((32,32)),
2. transform.RandomHorizontalFlip(),
3. transform.RandomRotation(10),
4. transform.RandomAffine(0,shear=10,scale=(0.8,1.2)),
5. transform.ColorJitter(brightness=0.2,contrast=0.2,saturation=0.2)
6. transforms.ToTensor(),
7. transforms.Normalize((0.5,),(0.5,))])

Step 4:

Before executing our code, we have to change the training_dataset statement because now we have
another transform for the training dataset. So

1. training_dataset=datasets.CIFAR10(root='./data',train=True,download=True,transform=tr
ansform_train

Parameter Sharing & Tying :


What are the benefits of parameter sharing in CNNs?
Convolution Neural Networks have a couple of techniques known as
parameter sharing and parameter tying. Parameter sharing is the method of
sharing weights by all neurons in a particular feature map. Therefore helps
to reduce the number of parameters in the whole system, making it
computationally cheap.
Parameter Tying
Parameter tying is a regularization technique. We divide the parameters or weights of a machine
learning model into groups by leveraging prior knowledge, and all parameters in each group are
constrained to take the same value. In simple terms, we want to express that specific parameter
should be close to each other.
Example
Two models perform the same classification task (with the same set of classes) but with different
input data.
• Model A with parameters wt(A).
• Model B with parameters wt(B).
The two models hash the input to two different but related outputs.

Some standard regularisers like l1 and l2 penalize model parameters for deviating from the fixed
value of zero. One of the side effects of Lasso or group-Lasso regularization in learning a Deep
Neural Networks is that there is a possibility that many of the parameters may become zero.
Thus, reducing the amount of memory required to store the model and lowering the
computational cost of applying it. A significant drawback of Lasso (or group-Lasso)
regularization is that in the presence of groups of highly correlated features, it tends to select
only one or an arbitrary convex combination of elements from each group. Moreover, the
learning process of Lasso tends to be unstable because the subsets of parameters that end up
selected may change dramatically with minor changes in the data or algorithmic procedure. In
Deep Neural Networks, it is almost unavoidable to encounter correlated features due to the high
dimensionality of the input to each layer and because neurons tend to adapt, producing strongly
correlated features that we pass as an input to the subsequent layer.

To overcome the issues we face while using Lasso or group lasso is countered by a regularizer,
the group version of the ordered weighted one norm, known as group-OWL (GrOWL). GrOWL
supports sparsity and simultaneously learns which parameters should share a similar value.
GrOWL has been effective in linear regression, identifying and coping with strongly correlated
covariates. Unlike standard sparsity-inducing regularizers (e.g., Lasso), GrOWL eliminates
unimportant neurons by setting all their weights to zero and explicitly identifies strongly
correlated neurons by tying the corresponding weights to an expected value. This ability of
GrOWL motivates the following two-stage procedure:
(i) use GrOWL regularization during training to simultaneously identify significant neurons and
groups of parameters that should be tied together.
(ii) retrain the network, enforcing the structure unveiled in the previous phase, i.e., keeping only
the significant neurons and implementing the learned tying structure.

Parameter Sharing
Parameter sharing forces sets of parameters to be similar as we interpret various models or model
components as sharing a unique set of parameters. We only need to store only a subset of
memory.
Suppose two models A and B, perform a classification task on similar input and output
distributions. In such a case, we'd expect the parameters for both models to be identical to each
other as well. We could impose a norm penalty on the distance between the weights, but a more
popular method is to force the parameters to be equal. The idea behind Parameter Sharing is the
essence of forcing the parameters to be similar. A significant benefit here is that we need to store
only a subset of the parameters (e.g., storing only the parameters for model A instead of storing
for both A and B), which leads to significant memory savings.
Example
The most extensive use of parameter sharing is in convolutional neural networks. Natural images
have specific statistical properties that are robust to translation. For example photo of a cat
remains a photo of a cat if it is translated one pixel to the right. Convolution Neural Networks
consider this property by sharing parameters across multiple image locations. Thus we can find a
cat with the same cat detector in column i or i+1 in the image.

INJECTING NOISE AT INPUT : -


The method of noise injection refers to adding “noise” artificially to the
ANN input data during the training process. Jitter is one particular method of
implementing noise injection. With this method, a noise vector is added to
each training case in between training iterations.
We know that there are mainly two problems in the case of building or training a neural network:
over-fitting and under-fitting. Well, we know that the concept of overfitting is that the model is
overtrained on the given input samples. Thus this leads to around 100% accuracy of the model.
But this results in showing less accuracy in the test data. In order to reduce this problem, we have
some methodologies like generalization techniques. Along with them, the new concept of noise
injection is added to reduce the problem of overfitting.

Noise Injection
The concept of noise injection is simple. We know that the main root cause for the problem of
overfitting is the size of the dataset, and if the dataset we are dealing with is too small, then our
model will get complete accuracy on the train data, but this model won’t show much accuracy on
the holdout dataset. So, we need to increase the size of the dataset by upsampling the whole
dataset, either collecting the new data or adding some noise or unwanted data. The concept of
collecting a new data sample and adding it to the dataset is a routine and effort-needed task. Thus
the concept of noise injection to the dataset is required and developed. The type of noise you are
going to add to the dataset is purely based on the actual dataset.
Why add Noise?
This small dataset problem challenges machine learning to develop this procedure. The main
problems that we are facing with the small dataset are that we have very few samples. Thus our
model will effectively learn all of those and work well for training data. Similarly, since the
model learned fewer samples, this model can’t make a better mapping between the input and
output data, thus resulting in a poor relation to the output of a particular input.
Wait! Don’t you think why the addition of noise will improve the model? Doesn’t the addition of
it degrade the model performance?
Well!, the answer is NO. This is because we know what regularization is? And how does it
work? Similarly, the addition of noise to the dataset that is causing the overfitting problem, this
addition leads to a regularization effect while training the model and thus improves the model
performance.
Thus, adding noise expands the training dataset size. Each time when a training sample is
exposed to the model, some random noise is added to the input variables making them different
every time it is exposed to the model. In this way, adding noise to input samples is a simple form
of “data augmentation”. Thus the noise addition makes the model not memorize the samples
much efficiently, resulting in a smooth mapping function.

Key Points on adding Noise


Well, the most common noise added during the training of the model is Gaussian noise or white
noise. We all know the Gaussian noise has a mean of zero and a standard deviation of one. The
addition of this Gaussian noise to the inputs of a neural network is called “Jitter”.
The next and most important point to be noted is how much noise you are going to add? If you
add less noise, this is of no use. Similarly, if you add more noise, the model will lose its hands.
The main advantage of Gaussian noise is that we can have a look at the standard deviation of the random
Noise, and thus can control it by the amount of spread it.
The major point is that we need to add noise only during the training stage, like adding noise to
activations, weights, gradients, and outputs.

Noise Injection is a technique used in machine learning and deep learning models to improve
their generalization capabilities and robustness. It involves adding random noise to the input data
or the model’s internal layers during training. This process helps the model learn more complex
and diverse patterns, ultimately leading to better performance on unseen data. Noise injection is
particularly useful in scenarios where the training data is limited or noisy, as it can help
prevent overfitting and improve the model’s ability to generalize to new data.

Overview
In machine learning, models are trained to learn patterns from the input data and make
predictions or decisions based on those patterns. However, when the training data is limited or
contains noise, the model may learn to fit the noise rather than the underlying patterns, leading to
overfitting. Overfitting occurs when a model performs well on the training data but poorly on
new, unseen data.

Noise injection is a regularization technique that helps mitigate overfitting by introducing


random noise into the input data or the model’s internal layers during training. This forces the
model to learn more complex and diverse patterns, making it more robust and better able to
generalize to new data. Noise injection can be applied in various ways, including adding noise to
the input data, weights, activations, or gradients.

Types of Noise Injection


Input Noise

Input noise is added directly to the input data during training. This can be done by adding
Gaussian noise, uniform noise, or other types of random noise to the input features. The added
noise makes it harder for the model to memorize the training data, forcing it to learn more
general patterns. Input noise is particularly useful when the input data is noisy or when the model
is prone to overfitting.

Weight Noise

Weight noise is added to the model’s weights during training. This can be done by adding
Gaussian noise, uniform noise, or other types of random noise to the weights before each update.
Weight noise helps regularize the model by preventing it from relying too much on any single
weight or feature. This can improve the model’s generalization capabilities and make it more
robust to changes in the input data.

Activation Noise

Activation noise is added to the model’s activations (i.e., the outputs of each layer) during
training. This can be done by adding Gaussian noise, uniform noise, or other types of random
noise to the activations before they are passed to the next layer. Activation noise helps the model
learn more complex and diverse patterns by introducing randomness into the model’s internal
representations. This can improve the model’s generalization capabilities and make it more
robust to changes in the input data.

Gradient Noise

Gradient noise is added to the gradients during the optimization process. This can be done by
adding Gaussian noise, uniform noise, or other types of random noise to the gradients before
they are used to update the model’s weights. Gradient noise helps regularize the model by
introducing randomness into the optimization process, making it harder for the model to
converge to a single solution. This can improve the model’s generalization capabilities and make
it more robust to changes in the input data.

Applications
Noise injection has been successfully applied in various machine learning and deep learning
tasks, including image classification, natural language processing, and reinforcement learning. It
has been shown to improve the performance of models in scenarios where the training data is
limited or noisy, as well as in cases where the model is prone to overfitting.
In addition to its regularization benefits, noise injection can also be used as a form of data
augmentation, especially in image classification tasks. By adding noise to the input images
during training, the model is exposed to a wider range of variations, which can help improve its
ability to generalize to new, unseen data.

Overall, noise injection is a valuable technique for improving the generalization capabilities and
robustness of machine learning and deep learning models, making it an essential tool for data
scientists working with limited or noisy data.

Ensemble Methods:
Ensemble learning is a machine learning technique that enhances accuracy
and resilience in forecasting by merging predictions from multiple models. It
aims to mitigate errors or biases that may exist in individual models by
leveraging the collective intelligence of the ensemble.

Ensemble methods are techniques that create multiple models and then combine them to produce
improved results. Ensemble methods in machine learning usually produce more accurate
solutions than a single model would. This has been the case in a number of machine learning
competitions, where the winning solutions used ensemble methods. In the popular Netflix
Competition, the winner used an ensemble method to implement a powerful collaborative
filtering algorithm. Another example is KDD 2009 where the winner also used ensembling. You
can also find winners who used these methods in Kaggle competitions, for example here is the
interview with the winner of CrowdFlower competition.
It is important that we understand a few terminologies before we continue with this article.
Throughout the article I used the term “model” to describe the output of the algorithm that
trained with data. This model is then used for making predictions. This algorithm can be
any machine learning algorithm such as logistic regression, decision tree, etc. These models,
when used as inputs of ensemble methods, are called ”base models,” and the end result is an
ensemble model.
In this blog post I will cover ensemble methods for classification and describe some widely
known methods of ensemble: voting, stacking, bagging and boosting.

Voting and Averaging Based Ensemble Methods


Voting and averaging are two of the easiest examples of ensemble learning in machine learning.
They are both easy to understand and implement. Voting is used for classification and averaging
is used for regression.

In both methods, the first step is to create multiple classification/regression models using some
training dataset. Each base model can be created using different splits of the same training
dataset and same algorithm, or using the same dataset with different algorithms, or any other
method.

Majority Voting

Every model makes a prediction (votes) for each test instance and the final output prediction is
the one that receives more than half of the votes. If none of the predictions get more than half of
the votes, we may say that the ensemble method could not make a stable prediction for this
instance. Although this is one of the more popular ensemble techniques, you may try the most
voted prediction (even if that is less than half of the votes) as the final prediction. In some
articles, you may see this method being called “plurality voting”.

Weighted Voting

Unlike majority voting, where each model has the same rights, we can increase the importance of
one or more models. In weighted voting you count the prediction of the better models multiple
times. Finding a reasonable set of weights is up to you.

Simple Averaging
In simple averaging method, for every instance of test dataset, the average predictions are
calculated. This method often reduces overfit and creates a smoother regression model. The
following pseudocode code shows this simple averaging method:

final_predictions = []

for row_number in len (predictions):

final_predictions.append(

mean(prediction[row_number, ])

Weighted Averaging

Weighted averaging is a slightly modified version of simple averaging, where the prediction of
each model is multiplied by the weight and then their average is calculated. The following
pseudocode code shows the weighted averaging:

weights = [..., ..., ...] #length is equal to len(algorithms)

final_predictions = []

for row_number in len (predictions):

final_predictions.append(

mean(prediction[row_number, ]*weights)
)

Stacking Multiple Machine Learning Models

Stacking, also known as stacked generalization, is an ensemble method where the models are
combined using another machine learning algorithm. The basic idea is to train machine learning
algorithms with training dataset and then generate a new dataset with these models. Then this
new dataset is used as input for the combiner machine learning algorithm.
The pseudocode of a stacking procedure is summarized as below:

base_algorithms = [logistic_regression, decision_tree_classification, ...] #for classification

stacking_train_dataset = matrix(row_length= len (target), column_length= len (algorithms))

stacking_test_dataset = matrix(row_length= len (test), column_length= len (algorithms))

for i,base_algorithm in enumerate (base_algorithms):

stacking_train_dataset[,i] = base_algorithm.fit(train, target).predict(train)

stacking_test_dataset[,i] = base_algorithm.predict(test)

final_predictions = combiner_algorithm.fit(stacking_train_dataset, target).predict(stacking_test_dataset)

As you can see in the above pseudocode, the training dataset for combiner algorithm is generated
using the outputs of the base algorithms. In the pseudocode, the base algorithm is generated
using training dataset and then the same dataset is used again to make predictions. But as we
know, in the real world we do not use the same training dataset for prediction, so to overcome
this problem you may see some implementations of stacking where training dataset is splitted.
Below you can see a pseudocode where the training dataset is split before training the base
algorithms:

base_algorithms = [logistic_regression, decision_tree_classification, ...] #for classification

stacking_train_dataset = matrix(row_length= len (target), column_length= len (algorithms))

stacking_test_dataset = matrix(row_length= len (test), column_length= len (algorithms))

for i,base_algorithm in enumerate (base_algorithms):

for trainix, testix in split(train, k= 10 ): #you may use sklearn.cross_validation.KFold of

sklearn library

stacking_train_dataset[testcv,i] = base_algorithm.fit(train[trainix], target[trainix]).predict(train[testix])

stacking_test_dataset[,i] = base_algorithm.fit(train).predict(test)

final_predictions = combiner_algorithm.fit(stacking_train_dataset, target).predict(stacking_test_dataset)

Bootstrap Aggregating

The name Bootstrap Aggregating, also known as “Bagging”, summarizes the key elements of
this strategy. In the bagging algorithm, the first step involves creating multiple models. These
models are generated using the same algorithm with random sub-samples of the dataset which
are drawn from the original dataset randomly with bootstrap sampling method. In bootstrap
sampling, some original examples appear more than once and some original examples are not
present in the sample. If you want to create a sub-dataset with m elements, you should select a
random element from the original dataset m times. And if the goal is generating n dataset, you
follow this step n times.

Ensemble learning helps improve machine learning results by combining several


models. This approach allows the production of better predictive performance
compared to a single model. Basic idea is to learn a set of classifiers (experts) and to
allow them to vote.
Advantage : Improvement in predictive accuracy.
Disadvantage : It is difficult to understand an ensemble of classifiers.

Why do ensembles work?


Main Challenge for Developing Ensemble Models?
The main challenge is not to obtain highly accurate base models, but rather to obtain
base models which make different kinds of errors. For example, if ensembles are used
for classification, high accuracies can be accomplished if different base models
misclassify different training examples, even if the base classifier accuracy is low.
Methods for Independently Constructing Ensembles –
• Majority Vote
• Bagging and Random Forest
• Randomness Injection
• Feature-Selection Ensembles
• Error-Correcting Output Coding
Methods for Coordinated Construction of Ensembles –
• Boosting
• Stacking
Reliable Classification: Meta-Classifier Approach
Co-Training and Self-Training
Types of Ensemble Classifier –
Bagging:
Bagging (Bootstrap Aggregation) is used to reduce the variance of a decision tree.
Suppose a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled
with replacement from D (i.e., bootstrap). Then a classifier model Mi is learned for each
training set D < i. Each classifier Mi returns its class prediction. The bagged classifier
M* counts the votes and assigns the class with the most votes to X (unknown sample).
Implementation steps of Bagging –
1. Multiple subsets are created from the original data set with equal tuples,
selecting observations with replacement.
2. A base model is created on each of these subsets.
3. Each model is learned in parallel from each training set and independent of
each other.
4. The final predictions are determined by combining the predictions from all
the models.
Random Forest:
Random Forest is an extension over bagging. Each classifier in the ensemble is a
decision tree classifier and is generated using a random selection of attributes at
each node to determine the split. During classification, each tree votes and the
most popular class is returned.
Implementation steps of Random Forest –
1. Multiple subsets are created from the original data set, selecting
observations with replacement.
2. A subset of features is selected randomly and whichever feature gives
the best split is used to split the node iteratively.
3. The tree is grown to the largest.
4. Repeat the above steps and prediction is given based on the
aggregation of predictions from n number of trees.
Greedy Layer-wise Pre-Training
Introduction
Let us get to the basics of training a neural network. While training a neural network, we make a
forward propagation and calculate the cost of the neural network and later use the total calculated
cost to backpropagate through the layers and update the weights; this process is done until the
global minima is obtained. But this technique is efficient on small neural networks(neural
networks with a small number of hidden layers).
In the case of large neural networks, traditional training methods are not efficient as it leads to
vanishing gradient problem.
Vanishing gradient problem: All the layers using definite activation functions are added to
neural networks. The gradients of the loss function tend to zero while backpropagating, making
the network hard to train.
Due to this problem, the weights near the input layer will not be updated; only the weights near
the output layer get updated.
Mechanism of Greedy Layer-Wise Pre-Training
To solve the vanishing gradient descent problem, we use this technique. Lets us see the
mechanism of a greedy layer-wise pretraining method. First, we make a base model of the input
and output layer; later, we train the model using the available dataset. After training the model,
we remove the output layer and store it in another variable. Add a new hidden layer in the model
that will be the first hidden layer of the model and re-add the output layer in the model. Now
there are three layers in the model, the input layer, the hidden layer1, and the output layer, and
once again, train the model after inserting the hidden layer1. To add one more hidden layer,
remove the output layer set all the layers as non-trainable(no further change in weights of the
input layer and hidden layer1). Now insert the new hidden layer2 in the model and re-add the
output layer. Train the model after inserting the new hidden layer. The model structure will be in
the following order, input layer, hidden layer1, hidden layer2, output layer. Repeat the above
steps for every new hidden layer you want to add. (each time you insert a new hidden layer,
perform training on the model using the same dataset)
Multiclass Classification problem
To understand the greedy layer-wise pre-training, we will be making a classification model.
The dataset includes two input features and one output. The output will be classified into four
categories. The two input features will represent the X and Y coordinate for two features,
respectively. There will be a standard deviation of 2.0 for every point in the dataset. The random
state will also be set to 2.
We will use the sklearn library for creating the dataset. The make_blobs() function is used to
make the data points. Matplotlib is used to plot the dataset.
Supervised Greedy Layer-Wise Pretraining
After creating the dataset, we will be preparing the deep multilayer perceptron(MLP) model. We
will implement greedy layer-wise supervised learning for preparing the MLP model.
We do not require pretraining to address this simple predictive modeling problem. The main aim
behind implementing the model is to perform a supervised greedy layer-wise pretraining model
that can be used as a standard template and further used in larger datasets.
We will be using the dataset that was created earlier. To train a model using keras sequential, we
must convert the output data to one-hot encoding. For that purpose, we will use the
to_categorical() function from TensorFlow to convert the y into a one-hot encoding format.

Activation functions

In the process of building a neural network, one of the choices you get to make is
what Activation Function to use in the hidden layer as well as at the output layer of the
network. This article discusses some of the choices.
Elements of a Neural Network
Input Layer: This layer accepts input features. It provides information from the outside
world to the network, no computation is performed at this layer, nodes here just pass on
the information(features) to the hidden layer.
Hidden Layer: Nodes of this layer are not exposed to the outer world, they are part of
the abstraction provided by any neural network. The hidden layer performs all sorts of
computation on the features entered through the input layer and transfers the result to
the output layer.
Output Layer: This layer bring up the information learned by the network to the outer
world.
What is an activation function and why use them?
The activation function decides whether a neuron should be activated or not by
calculating the weighted sum and further adding bias to it. The purpose of the activation
function is to introduce non-linearity into the output of a neuron.
Explanation: We know, the neural network has neurons that work in correspondence
with weight, bias, and their respective activation function. In a neural network, we
would update the weights and biases of the neurons on the basis of the error at the
output. This process is known as back-propagation. Activation functions make the
back-propagation possible since the gradients are supplied along with the error to
update the weights and biases.
Why do we need Non-linear activation function?
A neural network without an activation function is essentially just a linear regression
model. The activation function does the non-linear transformation to the input making it
capable to learn and perform more complex tasks.

Variants of Activation Function


Linear Function
• Equation : Linear function has the equation similar to as of a straight line
i.e. y = x
• No matter how many layers we have, if all are linear in nature, the final
activation function of last layer is nothing but just a linear function of the
input of first layer.
• Range : -inf to +inf
• Uses : Linear activation function is used at just one place i.e. output layer.
• Issues : If we will differentiate linear function to bring non-linearity, result
will no more depend on input “x” and function will become constant, it
won’t introduce any ground-breaking behavior to our algorithm.
For example : Calculation of price of a house is a regression problem. House price may
have any big/small value, so we can apply linear activation at output layer. Even in this
case neural net must have any non-linear function at hidden layers.
Sigmoid Function
• It is a function which is plotted as ‘S’ shaped graph.
• Equation : A = 1/(1 + e-x)
• Nature : Non-linear. Notice that X values lies between -2 to 2, Y values are
very steep. This means, small changes in x would also bring about large
changes in the value of Y.
• Value Range : 0 to 1
• Uses : Usually used in output layer of a binary classification, where result is
either 0 or 1, as value for sigmoid function lies between 0 and 1 only so,
result can be predicted easily to be 1 if value is greater
than 0.5 and 0 otherwise.
Tanh Function
• The activation that works almost always better than sigmoid function
• is Tanh function also known as Tangent Hyperbolic function. It’s actually
mathematically shifted version of the sigmoid function. Both are similar and
can be derived from each other.
• Equation :-
• Value Range :- -1 to +1
• Nature :- non-linear
• Uses :- Usually used in hidden layers of a neural network as it’s values lies
between -1 to 1 hence the mean for the hidden layer comes out be 0 or very
close to it, hence helps in centering the data by bringing mean close to 0.
This makes learning for the next layer much easier.
RELU Function

• It Stands for Rectified linear unit. It is the most widely used activation
function. Chiefly implemented in hidden layers of Neural network.
• Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0
otherwise.
• Value Range :- [0, inf)
• Nature :- non-linear, which means we can easily backpropagate the errors
and have multiple layers of neurons being activated by the ReLU function.
• Uses :- ReLu is less computationally expensive than tanh and sigmoid
because it involves simpler mathematical operations. At a time only a few
neurons are activated making the network sparse making it efficient and easy
for computation.
In simple words, RELU learns much faster than sigmoid and Tanh function.
Softmax Function
The softmax function is also a type of sigmoid function but is handy when we are trying
to handle multi- class classification problems.
• Nature :- non-linear
• Uses :- Usually used when trying to handle multiple classes. the softmax
function was commonly found in the output layer of image classification
problems.The softmax function would squeeze the outputs for each class
between 0 and 1 and would also divide by the sum of the outputs.
• Output:- The softmax function is ideally used in the output layer of the
classifier where we are actually trying to attain the probabilities to define the
class of each input.
• The basic rule of thumb is if you really don’t know what activation function
to use, then simply use RELU as it is a general activation function in hidden
layers and is used in most cases these days.
• If your output is for binary classification then, sigmoid function is very
natural choice for output layer.
• If your output is for multi-class classification then, Softmax is very useful to
predict the probabilities of each classes.

Activation Functions

To put in simple terms, an artificial neuron calculates the ‘weighted sum’ of its inputs and
adds a bias, as shown in the figure below by the net input.

Mathematically,
Now the value of net input can be any anything from -inf to +inf. The neuron doesn’t
really know how to bound to value and thus is not able to decide the firing pattern. Thus
the activation function is an important part of an artificial neural network. They basically
decide whether a neuron should be activated or not. Thus it bounds the value of the net
input. The activation function is a non-linear transformation that we do over the input
before sending it to the next layer of neurons or finalizing it as output. Types of
Activation Functions – Several different types of activation functions are used in Deep
Learning. Some of them are explained below:
1. Step Function: Step Function is one of the simplest kind of activation
functions. In this, we consider a threshold value and if the value of net input
say y is greater than the threshold then the neuron is activated. Mathematically,
2.

3.
Given below is the graphical representation of step function.

4. Sigmoid Function: Sigmoid function is a widely used activation function. It is


defined as:

5.
Graphically, This is a smooth
function and is continuously differentiable. The biggest advantage that it has
over step and linear function is that it is non-linear. This is an incredibly cool
feature of the sigmoid function. This essentially means that when I have
multiple neurons having sigmoid function as their activation function – the
output is non linear as well. The function ranges from 0-1 having an S shape.
6. ReLU: The ReLU function is the Rectified linear unit. It is the most widely
used activation function. It is defined as:
7.

Graphically, The main


advantage of using the ReLU function over other activation functions is that it
does not activate all the neurons at the same time. What does this mean ? If
you look at the ReLU function if the input is negative it will convert it to zero
and the neuron does not get activated.
8. Leaky ReLU: Leaky ReLU function is nothing but an improved version of the
ReLU function.Instead of defining the Relu function as 0 for x less than 0, we
define it as a small linear component of x. It can be defined as:

Graphically,

Weight Initialization Techniques for Deep Neural


Networks

While building and training neural networks, it is crucial to initialize the weights
appropriately to ensure a model with high accuracy. If the weights are not correctly
initialized, it may give rise to the Vanishing Gradient problem or the Exploding Gradient
problem. Hence, selecting an appropriate weight initialization strategy is critical when
training DL models. In this article, we will learn some of the most common weight
initialization techniques, along with their implementation in Python
using Keras in TensorFlow.
As pre-requisites, the readers of this article are expected to have a basic knowledge of
weights, biases and activation functions. In order to understand what this all, you are
and what role they play in Deep Neural Networks – you are advised to read through the
article Deep Neural Network With L – Layers
Terminology or Notations
Following notations must be kept in mind while understanding the Weight Initialization
Techniques. These notations may vary at different publications. However, the ones used
here are the most common, usually found in research papers.
fan_in = Number of input paths towards the neuron
an_out = Number of output paths towards the neuron
Example: Consider the following neuron as a part of a Deep Neural Network.

For the above neuron,


fan_in = 3 (Number of input paths towards the neuron)
fan_out = 2 (Number of output paths towards the neuron)

Weight Initialization Techniques


1. Zero Initialization

As the name suggests, all the weights are assigned zero as the initial value is zero
initialization. This kind of initialization is highly ineffective as neurons learn the same
feature during each iteration. Rather, during any kind of constant initialization, the same
issue happens to occur. Thus, constant initializations are not preferred.

2. Random Initialization

In an attempt to overcome the shortcomings of Zero or Constant Initialization, random


initialization assigns random values except for zeros as weights to neuron paths.
However, assigning values randomly to the weights, problems such as Overfitting,
Vanishing Gradient Problem, Exploding Gradient Problem might occur.
Random Initialization can be of two kinds:
• Random Normal
• Random Uniform
a) Random Normal: The weights are initialized from values in a normal distribution.

b) Random Uniform: The weights are initialized from values in a uniform distribution.

3. Xavier/Glorot Initialization

In Xavier/Glorot weight initialization, the weights are assigned from values of a


uniform distribution as follows:

Xavier/Glorot Initialization often termed as Xavier Uniform Initialization, is suitable


for layers where the activation function used is Sigmoid.

4. Normalized Xavier/Glorot Initialization

In Normalized Xavier/Glorot weight initialization, the weights are assigned from values
of a normal distribution as follows:

Here, is given by:

Xavier/Glorot Initialization, too, is suitable for layers where the activation function
used is Sigmoid.

5. He Uniform Initialization

In He Uniform weight initialization, the weights are assigned from values of a uniform
distribution as follows:
He Uniform Initialization is suitable for layers where ReLU activation function is used.

6. He Normal Initialization

In He Normal weight initialization, the weights are assigned from values of a normal
distribution as follows:

Here, \sigma is given by:

He Uniform Initialization, too, is suitable for layers where ReLU activation function is
used

Batch normalization:-
Batch normalization works by normalizing the output of a previous activation
layer by subtracting the batch mean and dividing by the batch standard
deviation. After this step, the result is then scaled and shifted by two
learnable parameters, gamma and beta, which are unique to each layer.

Batch normalization works by normalizing the output of a previous activation layer by


subtracting the batch mean and dividing by the batch standard deviation. After this
step, the result is then scaled and shifted by two learnable parameters, gamma and
beta, which are unique to each layer. This process allows the model to maintain the
mean activation close to 0 and the activation standard deviation close to 1.

The normalization step is as follows:

1. Calculate the mean and variance of the activations for each feature in a mini-
batch.
2. Normalize the activations of each feature by subtracting the mini-batch mean
and dividing by the mini-batch standard deviation.
3. Scale and shift the normalized values using the learnable parameters gamma
and beta, which allow the network to undo the normalization if that is what the
learned behavior requires.

Batch normalization is typically applied before the activation function in a network


layer, although some variations may apply it after the activation function.

Benefits of Batch Normalization


Batch normalization offers several benefits to the training process of deep neural
networks:

• Improved Optimization: It allows the use of higher learning rates, speeding


up the training process by reducing the careful tuning of parameters.
• Regularization: It adds a slight noise to the activations, similar to dropout.
This can help to regularize the model and reduce overfitting.
• Reduced Sensitivity to Initialization: It makes the network less sensitive to
the initial starting weights.
• Allows Deeper Networks: By reducing internal covariate shift, batch
normalization allows for the training of deeper networks.

Batch Normalization During Inference


While batch normalization is straightforward to apply during training, it requires
special consideration during inference. Since the mini-batch mean and variance are
not available during inference, the network uses the moving averages of these
statistics that were computed during training. This ensures that the normalization is
consistent and the network's learned behavior is maintained.

Challenges and Considerations


Despite its benefits, batch normalization is not without challenges:

• Dependency on Mini-Batch Size:

The effectiveness of batch normalization can depend on the size of the mini-
batch. Very small batch sizes can lead to inaccurate estimates of the mean and
variance, which can destabilize the training process.
• Computational Overhead: Batch normalization introduces additional
computations and parameters into the network, which can increase the
complexity and computational cost.
• Sequence Data:

Applying batch normalization to recurrent neural networks and other


architectures that handle sequence data can be less straightforward and may
require alternative approaches.

One of the most common problems of data science professionals is to avoid over -
fitting. Have you come across a situation when your model is performing very
well on the training data but is unable to predict the test data accurately. The
reason is your model is overfitting. The solution to such a problem is
regularization.
The regularization techniques help to improve a model and allows it to converge
faster. We have several regularization tools at our end, some of them are early
stopping, dropout, weight initialization techniques, and batch normalization. The
regularization helps in preventing the over-fitting of the model and the learning
process becomes more efficient.

What is Batch Normalization?

Before entering into Batch normalization let’s understand the term “Normalization”.

Normalization is a data pre-processing tool used to bring the numerical data to a common

scale without distorting its shape.

Generally, when we input the data to a machine or deep learning algorithm we tend to

change the values to a balanced scale. The reason we normalize is partly to ensure that our

model can generalize appropriately.

Now coming back to Batch normalization, it is a process to make neural net works faster and

more stable through adding extra layers in a deep neural network. The new layer performs
the standardizing and normalizing operations on the input of a layer coming from a previous

layer.

But what is the reason behind the term “Batch” in batch normalization? A typical neural

network is trained using a collected set of input data called batch. Similarly, the

normalizing process in batch normalization takes place in batches, not as a single input.

Let’s understand this through an example, we have a deep neural network as shown in the

following image.
Initially, our inputs X1, X2, X3, X4 are in normalized form as they are coming from the pre -

processing stage. When the input passes through the first layer, it transforms, as a sigmoid

function applied over the dot product of input X and the weight matrix W. Similarly, this

transformation will take place for the second layer and go till the last layer L as shown in

the following image.

Although, our input X was normalized with time the output will no longer be on the same

scale. As the data go through multiple layers of the neural network and L activation

functions are applied, it leads to an internal co-variate shift in the data.

How does Batch Normalization work?

Since by now we have a clear idea of why we need Batch normalization, let’s understand

how it works. It is a two-step process. First, the input is normalized, and later rescaling and

offsetting is performed.

Normalization of the Input


Normalization is the process of transforming the data to have a mean zero and standard

deviation one. In this step we have our batch input from layer h, first, we need to calculate

the mean of this hidden activation.

Here, m is the number of neurons at layer h.

Once we have meant at our end, the next step is to calculate the standard deviation of the

hidden activations.

Further, as we have the mean and the standard deviation ready. We will normalize the

hidden activations using these values. For this, we will subtract the mean from ea ch input

and divide the whole value with the sum of standard deviation and the smoothing term ( ε).

The smoothing term(ε) assures numerical stability within the operation by stopping a

division by a zero value.

Rescaling of Offsetting
In the final operation, the re-scaling and offsetting of the input take place. Here two

components of the BN algorithm come into the picture, γ(gamma) and β (beta). These

parameters are used for re-scaling (γ) and shifting(β) of the vector containing values from

the previous operations.

These two are learnable parameters, during the training neural network ensures the optimal

values of γ and β are used. That will enable the accurate normalization of each batch.

Batch Normalization techniques

Batch normalization is a technique used in deep learning that helps our models learn and

adapt quickly. It’s like a teacher who helps students by breaking down complex topics into

simpler parts.

Why do we need it?

Imagine you’re trying to hit a moving target with a dart. It would be much harder than

hitting a stationary one, right? Similarly, in deep learning, our target keeps changing during

training due to the continuous updates in weights and biases. This is known as the “internal

covariate shift”. Batch normalization helps us stabilize this moving target, making our task

easier.

How does it work?

Batch normalization works by normalizing the output of a previous activation layer by

subtracting the batch mean and dividing by the batch standard deviation. However, these
normalized values may not follow the original distribution. To tackle this, batch

normalization introduces two learnable parameters, gamma and beta, which can shift and

scale the normalized values.

Benefits of Batch Normalization

• Speeds up learning: By reducing internal covariate shift, it helps the model train faster.

• Regularizes the model: It adds a little noise to your model, and in some cases, you might

not even need to use dropout or other regularization techniques.

• Allows higher learning rates: Gradient descent usually requires small learning rates for the

network to converge. Batch normalization helps us use much larger learning rates, speeding

up the training process.

Advantages of Batch Normalization

Now let’s look into the advantages the BN process offers.

Speed Up the Training

By Normalizing the hidden layer activation the Batch normalization speeds up the training

process.

Handles internal covariate shift

It solves the problem of internal covariate shift. Through this, we ensure that the input for

every layer is distributed around the same mean and standard deviation. If you are unaware

of what is an internal covariate shift, look at the following example.

Internal covariate shift


Suppose we are training an image classification model, that classifies the images into Dog

or Not Dog. Let’s say we have the images of white dogs only, these images will have certain

distribution as well. Using these images model will update its parameters.

later, if we get a new set of images, consisting of non-white dogs. These new images will

have a slightly different distribution from the previous images. Now the model will change

its parameters according to these new images. Hence the distribution of the hidden

activation will also change. This change in hidden activation is known as an internal

covariate shift.
However, according to a study by MIT researchers, the batch normalization does not solve

the problem of internal covariate shift.

In this research, they trained three models

Model-1: standard VGG network without batch normalization.

Model-2: Standard VGG network with batch normalization.

Model-3: Standard VGG with batch normalization and random noise.

This random noise has non-zero mean and non -unit variance and added after the batch

normalization layer. This experiment reached two conclusions.

• The third model has a less stable distribution across all layers. We can see the noisy model

has a high variance than the other two models.


• The

second conclusion was the training accuracy of the second and third models is higher than

the first model. So it can be concluded that internal co-variate shift might not be a

contributing factor in the performance of the batch normalization.

Smoothens the Loss Function

Batch normalization smoothens the loss function that in turn by optimizing the model

parameters improves the training speed of the model.

What is Batch Normalization?


Gradients are used to update weights during training, that can become unstable or
vanish entirely, hindering the network’s ability to learn effectively. Batch
Normalization (BN) is a powerful technique that addresses these issues by stabilizing
the learning process and accelerating convergence. Batch Normalization(BN) is a
popular technique used in deep learning to improve the training of neural networks by
normalizing the inputs of each layer. Implementing batch normalization in PyTorch
models requires understanding its concepts and best practices to achieve optimal
performance.
Batch Normalization makes the training to be more consistent, and faster, adds better
performance, and avoids problems like gradient becoming too small or too large during
training and ensures that the network doesn’t get stuck or make big mistakes while
learning. It is helpful when neural network faces issues like slow training or unstable
gradients.

×
How Batch Normalization works?
1. During each training iteration (epoch), BN takes a mini batch of data and
normalizes the activations (outputs) of a hidden layer. This normalization
transforms the activations to have a mean of 0 and a standard deviation of 1.
2. While normalization helps with stability, it can also disrupt the network’s
learned features. To compensate, BN introduces two learnable parameters:
gamma and beta. Gamma rescales the normalized activations, and beta shifts
them, allowing the network to recover the information present in the original
activations.
It ensures that each element or component is in the right proportion before distributing
the inputs into the layers and each layer is normalized before being passed to the next
layer.
Correct Batch Size:
• Resonable sized mini-batches must be taken into consideration during
training. It performs better with large batch sizes as it computes more
accurate batch statistics.
• Leading it to be more stable gradients and faster convergence.
Implementing Batch Normalization in PyTorch
PyTorch provides the nn.BatchNormXd module (where X is 1 for 1D data, 2 for 2D data
like images, and 3 for 3D data) for convenient BN implementation. In this tutorial, we
will see the implementation of batch normalizationa and it’s effect on model. We will
train the model and highlight the loss before and after using batch normalization with
MNIST dataset widely used dataset in the field of machine learing and computer vision.
This dataset consists of a collection of 28X28 pixel grayscale images of handwritten
digits ranges from (0 to 9) inclusive along with their corresponding labels.
Prerequsite: Install the PyTorch library:
pip install torch torchvision
Step 1: Importing necessary libraries
1. Torch : Imports the PyTorch library for deep learning operations.
2. nn : Imports the neural network module from PyTorch for building neural
network architectures.
3. DataLoader : Import dataloader class from PyTorch, it helps in loading the
datasets efficiently for traning and testing.
4. Transforms : Imports the transforms module from torchvision, which
provides common image transformations.
5. Time : Imports the time module for time-related operations.
6. OS : Imports the os module, which provides functions for interacting with the
operating system.
• Python3

import torch

from torch import nn

from torchvision.datasets import MNIST

from torch.utils.data import DataLoader


from torchvision import transforms

import time

import datetime

import os

Step 2: Implementing Batch Normalization to the model


In the code snippet, Batch Normalization (BN) is incorporated into the neural network
architecture using the nn.BatchNorm1d layer, the layers are added after the fully
connected layers.
• nn.BatchNorm1d(64) is applied after the first fully connected layer (64 neurons).
• nn.BatchNorm1d(32) is applied after the second fully connected layer (32
neurons).
The arguments (64 and 32) represent the number of features (neurons) in the respective
layers to which Batch Normalization is applied. Following Batch Normalization, the
ReLU activation function is applied to introduce non-linearity. In the forward method,
the input tensor x is passed through the layers, including those with Batch
Normalization.
• Python3

# Define your neural network architecture with batch normalization

class MLP(nn.Module):

def __init__(self):

super().__init__()

self.layers = nn.Sequential(

nn.Flatten(), # Flatten the input image tensor

nn.Linear(28 * 28, 64), # Fully connected layer from 28*28 to 64 neurons

nn.BatchNorm1d(64), # Batch normalization for stability and faster convergence


nn.ReLU(), # ReLU activation function

nn.Linear(64, 32), # Fully connected layer from 64 to 32 neurons

nn.BatchNorm1d(32), # Batch normalization for stability and faster convergence

nn.ReLU(), # ReLU activation function

nn.Linear(32, 10) # Fully connected layer from 32 to 10 neurons (for MNIST classes)

def forward(self, x):

return self.layers(x)

Step 3: The next step follows loading and training the dataset with simple MLP neural network
architecture for the MINST dataset and creating the dataloader for training.

• Python3

if __name__ == '__main__':

# Set random seed for reproducibility

torch.manual_seed(47)

# Load the MNIST dataset

transform = transforms.Compose([

transforms.ToTensor()
])

train_data = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor())

train_loader = DataLoader(train_data, batch_size=64, shuffle=True)

Step 4: Initialize the MLP model, Define the loss function(CrossEntropyLoss), and optimizer (Adam).

• Python3

mlp = MLP() # Initialize MLP model

loss_function = nn.CrossEntropyLoss() # Cross-entropy loss function for classification

optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-3) # Adam optimizer with learning rate 0.001

Step 5: Define Training Loop


We are training the model for 3 epoch using a training loop. It will itertate over mini-
batches of traning data, computes the loss, performs backpropogation, and updatess the
model paramaters.
• Python3

start_time = time.time()

# Training loop

for epoch in range(3): # Iterate over 3 epochs

print(f'Starting epoch {epoch + 1}')

running_loss = 0.0

for i, data in enumerate(train_loader, 0):


inputs, labels = data

optimizer.zero_grad() # Zero the gradients

outputs = mlp(inputs.view(inputs.shape[0], -1)) # Flatten the input for MLP and forward pass

loss = loss_function(outputs, labels) # Compute the loss

loss.backward() # Backpropagation

optimizer.step() # Optimizer step to update parameters

running_loss += loss.item()

if i % 100 == 99: # Print every 100 mini-batches

print(f'Epoch {epoch + 1}, Mini-batch {i + 1}, Loss: {running_loss / 100}')

running_loss = 0.0

print('Training finished')

end_time = time.time() # Record end time

print('Training process has been completed. ')

training_time = end_time - start_time

print('Training time:', str(datetime.timedelta(seconds=training_time))) # for calculating the training time in minutes and
seconds format

Output:
Starting epoch 1
Epoch 1, Mini-batch 100, Loss: 1.107109518647194
Epoch 1, Mini-batch 200, Loss: 0.48408970028162
Epoch 1, Mini-batch 300, Loss: 0.3104418055713177
Epoch 1, Mini-batch 400, Loss: 0.2633690595626831
Epoch 1, Mini-batch 500, Loss: 0.2228860107809305
Epoch 1, Mini-batch 600, Loss: 0.20098184436559677
Epoch 1, Mini-batch 700, Loss: 0.18423103891313075
Epoch 1, Mini-batch 800, Loss: 0.16403419613838197
Epoch 1, Mini-batch 900, Loss: 0.14670498583465816
Starting epoch 2
Epoch 2, Mini-batch 100, Loss: 0.1223447759822011
Epoch 2, Mini-batch 200, Loss: 0.11535881120711565
Epoch 2, Mini-batch 300, Loss: 0.12264159372076393
Epoch 2, Mini-batch 400, Loss: 0.1274782767519355
Epoch 2, Mini-batch 500, Loss: 0.12688526364043354
Epoch 2, Mini-batch 600, Loss: 0.10709397405385972
Epoch 2, Mini-batch 700, Loss: 0.12462730823084713
Epoch 2, Mini-batch 800, Loss: 0.10854666410945356
Epoch 2, Mini-batch 900, Loss: 0.10740736600011587
Starting epoch 3
Epoch 3, Mini-batch 100, Loss: 0.09494352690875531
Epoch 3, Mini-batch 200, Loss: 0.08548182763159275
Epoch 3, Mini-batch 300, Loss: 0.08944599309004843
Epoch 3, Mini-batch 400, Loss: 0.08315778982825578
Epoch 3, Mini-batch 500, Loss: 0.0855206391401589
Epoch 3, Mini-batch 600, Loss: 0.08882722020149231
Epoch 3, Mini-batch 700, Loss: 0.0896124207880348
Epoch 3, Mini-batch 800, Loss: 0.08545528341084718
Epoch 3, Mini-batch 900, Loss: 0.09168351721018553
Training finished
Training process has been completed.
Training time: 0:00:21.384532

Note: The loss after mini-batch 900 of epoch 3 with batch normalization is 0.09196628
Benefits of Batch Normalization
• Faster Convergence: By stabilizing the gradients, BN allows you to use
higher learning rates, which can significantly speed up training.
• Reduced Internal Covariate Shift: As the network trains, the distribution of
activations within a layer can change (internal covariate shift). BN helps
mitigate this by normalizing activations before subsequent layers, making the
training process less sensitive to these shifts.
• Initialization Insensitivity: BN makes the network less reliant on the initial
weight values, allowing for more robust training and potentially better
performance.

You might also like