Unit 4
Unit 4
4 Ensemble Methods
Regularization :
Regularization is one of the most important concepts of machine learning. It is a technique to
prevent the model from overfitting by adding extra information to it.
Sometimes the machine learning model performs well with the training data but does not perform
well with the test data. It means the model is not able to predict the output when deals with unseen
data by introducing noise in the output, and hence the model is called overfitted. This problem can
be deal with the help of a regularization technique.
This technique can be used in such a way that it will allow to maintain all variables or features in
the model by reducing the magnitude of the variables. Hence, it maintains accuracy as well as a
generalization of the model.
It mainly regularizes or reduces the coefficient of features toward zero. In simple words, "In
regularization technique, we reduce the magnitude of the features by keeping the same number of
features."
y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here represents
the bias of the model, and b represents the intercept.
Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:
o Ridge Regression
o Lasso Regression
o
In the above equation, the penalty term regularizes the coefficients of the model, and hence
ridge regression reduces the amplitudes of the coefficients that decreases the complexity
of the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation
becomes the cost function of the linear regression model. Hence, for the minimum value
of λ, the model will resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model.
It stands for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the absolute
weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression
can only shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso
regression will be:
o Some of the features in this technique are completely neglected for model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model as well as
the feature selection.
o Ridge regression is mostly used to reduce the overfitting in the model, and it includes all
the features present in the model. It reduces the complexity of the model by shrinking the
coefficients.
o Lasso regression helps to reduce the overfitting in the model as well as feature selection.
To summarize, overfitting is a phenomenon where the machine learning model learns patterns
and performs well on data that it has been trained on and does not perform well on unseen data.
The graph shows that as the model is trained for a longer duration, the training error lessens.
However, the testing error starts increasing after a specific point. This indicates that the model
has started to overfit.
In these techniques, data augmentation and more training data don’t change the model
architecture but try to improve the performance by altering the input data. Early stopping is used
to stop the model training at an appropriate time - before the model overfits, rather than
addressing the issue of overfitting directly. However, regularization is a more robust technique
that can be used to avoid overfitting.
1. L2 regularization
2. L1 regularization
3. Dropout regularization
L2 regularization
According to regression analysis, L2 regularization is also called ridge regression. In this type of
regularization, the squared magnitude of the coefficients or weights multiplied with a regularizer
term is added to the loss or cost function. L2 regression can be represented with the following
mathematical equation.
Loss:
• Lambda is the hyperparameter that is tuned to prevent overfitting i.e. penalize the
insignificant weights by forcing them to be small but not zero.
• L2 regularization works best when all the weights are roughly of the same size, i.e., input
features are of the same range.
• This technique also helps the model to learn more complex patterns from data without
overfitting easily.
L1 regularization
Loss:
In the above equation,
A fraction of the sum of absolute values of weights to the loss function is added in the L1
regularization. In this way, you will be able to eliminate some coefficients with lesser values by
pushing those values towards 0. You can observe the following by using L1 regularization:
• Since the L1 regularization adds an absolute value as a penalty to the cost function, the
feature selection will be done by retaining only some important features and eliminating
the lower or unimportant features.
• This technique is also robust to outliers, i.e., the model will be able to easily learn about
outliers in the dataset.
• This technique will not be able to learn complex patterns from the input data.
Dropout regularization
Dropout regularization is the technique in which some of the neurons are randomly disabled
during the training such that the model can extract more useful robust features from the model.
This prevents overfitting. You can see the dropout regularization in the following diagram:
• In figure (a), the neural network is fully connected. If all the neurons are trained with the
entire training dataset, some neurons might memorize the patterns occurring in training
data. This leads to overfitting since the model is not generalizing well.
• In figure (b), the neural network is sparsely connected, i.e., only some neurons are active
during the model training. This forces the neurons to extract robust features/patterns from
training data to prevent overfitting.
Image source
Image source
• Dropout randomly disables some percent of neurons in each layer. So for every epoch,
different neurons will be dropped leading to effective learning.
• Dropout is applied by specifying the ‘p’ values, which is the fraction of neurons to be
dropped.
• Dropout reduces the dependencies of neurons on other neurons, resulting in more robust
model behavior.
• Dropout is applied only during the model training phase and is not applied during the
inference phase.
• When the model receives complete data during the inference time, you need to scale the
layer outputs ‘x’ by ‘p’ such that only some parts of data will be sent to the next layer.
This is because the layers have seen less amount of data as specified by dropout.
These are some of the most popular regularization techniques that are used to reduce overfitting
during model training. They can be applied according to the use case or dataset being considered
for more accurate model performance on the testing data.
What is Variance?
The variability of model prediction for a given data point which tells us the spread of
our data is called the variance of the model. The model with high variance has a very
complex fit to the training data and thus is not able to fit accurately on the data which it
hasn’t seen before. As a result, such models perform very well on training data but have
high error rates on test data. When a model is high on variance, it is then said to
as Overfitting of Data. Overfitting is fitting the training set accurately via complex
curve and high order hypothesis but is not the solution as the error with unseen data is
high. While training a data model variance should be kept low. The high variance data
looks as follows.
High Variance in the Model
The best fit will be given by the hypothesis on the tradeoff point. The error to
complexity graph to show trade-off is given as –
Region for the Least Value of Total Error
This is referred to as the best point chosen for the training of the algorithm which gives
low error in training as well as testing data.
Early Stopping :-
Early Stopping is a regularization technique for deep neural networks that
stops training when parameter updates no longer begin to yield improves on a
validation set.
What is Early Stopping?
In Regularization by Early Stopping, we stop training the model when the performance
on the validation set is getting worse- increasing loss decreasing accuracy, or poorer
scores of the scoring metric. By plotting the error on the training dataset and the
validation dataset together, both the errors decrease with a number of iterations until the
point where the model starts to overfit. After this point, the training error still decreases
but the validation error increases.
So, even if training is continued after this point, early stopping essentially returns the
set of parameters that were used at this point and so is equivalent to stopping training at
that point. So, the final parameters returned will enable the model to have low variance
and better generalization. The model at the time the training is stopped will have a
better generalization performance than the model with the least training error.
on the
validation set is getting worse- increasing loss or decreasing accuracy or poorer scores
Early stopping can be thought of as implicit regularization, contrary to regularization
via weight decay. This method is also efficient since it requires less amount of training
data, which is not always available. Due to this fact, early stopping requires lesser time
for training compared to other regularization methods. Repeating the early stopping
process many times may result in the model overfitting the validation dataset, just as
similar as overfitting occurs in the case of training data.
The number of iterations(i.e. epoch) taken to train the model can be considered
a hyperparameter. Then the model has to find an optimum value for this
hyperparameter (by hyperparameter tuning) for the best performance of the learning
model.
Benefits of Early Stopping:
• Helps in reducing overfitting
• It improves generalisation
• It requires less amount of training data
• Takes less time compared to other regularisation models
• It is simple to implement
Limitations of Early Stopping:
• If the model stops too early, there might be risk of underfitting
• It may not be beneficial for all types of models
• If validation set is not chosen properly, it may not lead to the most optimal
stopping
To summarize, early stopping can be best used to prevent overfitting of the model, and
saving resources. It would give best results if taken care of few things like – parameter
tuning, preventing the model from overfitting, and ensuring that the model learns
enough from the data.
Dataset Augmentation : -
Data augmentation is a technique of artificially increasing the training set by
creating modified copies of a dataset using existing data. It includes making
minor changes to the dataset or using deep learning to generate new data
points.
Our model was effectively trained to classify the training data. It did not generalize well for the
validation data to fix the overfishing issue. Now, let's discuss one more technique to improve the
model training process. This technique is known as data augmentation. It is the process by which
we create new data for our model to use during the training process.
This is done by taking our existing dataset and transforming or altering the image in useful ways
to create new images.
After applying the transformation, the newly created images are known as augmented images
because they essentially allow us to augment our dataset by adding new data to it. The data
augmentation technique is useful because it allows our model to look at each image in our
dataset from a variety of different perspective. This allows our model to extract relevant features
more accurately and to obtain more feature-related data from each training image.
Now our biggest question is how we will use that augmentation to reduce overfitting. The
overfitting occurs when our model is too closely fit the training set.
There is no need to start collecting new images and adding them to our datasets. We can use data
augmentation which introduces minor alteration to our existing datasets such darker shading, flips,
zooming, rotations or translation. Our model will interpret them as separate distinct images. It will
not only reduce over fitting but it also prevents our network from learning irrelevant patterns and
boosts overall performance. We have the following steps to perform data augmentation:
Step 1:
To perform data augmentation on training dataset, we have to make to make a separate transform
statement. For validation dataset the transform will remain same. So we first copy our transform1
statement and treat it as transform_train as:
transform_train=transforms.Compose([transforms.Resize((32,32)),transforms.ToTensor()
,transforms.Normalize((0.5,),(0.5,))])
Step 2:
Now, we will add alternation in our transform_train statement. The alternations will be a
RandomHorizontalFlip, RandomRotation which is used for rotation of an image by a certain angle
and that angle will be passes as an argument.
1. transform_train=transforms.Compose([transforms.Resize((32,32)),
2. transform.RandomHorizontalFlip(),
3. transform.RandomRotation(),
4. transforms.ToTensor(),
5. transforms.Normalize((0.5,),(0.5,))])
To add even more variety to our dataset, we will use a fine type transformation. Fine transformation
represent simple transformation which preserve straight lines and planes with the object. Scaling,
translation, shear and zooming is a transformation which fits this category.
1. transform_train=transforms.Compose([transforms.Resize((32,32)),
2. transform.RandomHorizontalFlip(),
3. transform.RandomRotation(),
4. transform.RandomAffine(0,shear=10,scale=(0.8,1.2)),
5. transforms.ToTensor(),
6. transforms.Normalize((0.5,),(0.5,))])
In RandomAffine(), the first argument is decrease which we set zero to deactivate rotation, second
argument is the shear transformation and the last one is the scaling transformation and use a topple
to define the range of zoom which we have required. We defined a lower and upper limit of 0.8
and 1.2 to scale images to 80 or 120 percent of their size.
Step 3:
Now, we move onto our next augmentation to create new augmented images with a randomized
variety of brightness, contrast and saturation. We will add another transformation i.e. ColorJitter
as:
1. transform_train=transforms.Compose([transforms.Resize((32,32)),
2. transform.RandomHorizontalFlip(),
3. transform.RandomRotation(10),
4. transform.RandomAffine(0,shear=10,scale=(0.8,1.2)),
5. transform.ColorJitter(brightness=0.2,contrast=0.2,saturation=0.2)
6. transforms.ToTensor(),
7. transforms.Normalize((0.5,),(0.5,))])
Step 4:
Before executing our code, we have to change the training_dataset statement because now we have
another transform for the training dataset. So
1. training_dataset=datasets.CIFAR10(root='./data',train=True,download=True,transform=tr
ansform_train
Some standard regularisers like l1 and l2 penalize model parameters for deviating from the fixed
value of zero. One of the side effects of Lasso or group-Lasso regularization in learning a Deep
Neural Networks is that there is a possibility that many of the parameters may become zero.
Thus, reducing the amount of memory required to store the model and lowering the
computational cost of applying it. A significant drawback of Lasso (or group-Lasso)
regularization is that in the presence of groups of highly correlated features, it tends to select
only one or an arbitrary convex combination of elements from each group. Moreover, the
learning process of Lasso tends to be unstable because the subsets of parameters that end up
selected may change dramatically with minor changes in the data or algorithmic procedure. In
Deep Neural Networks, it is almost unavoidable to encounter correlated features due to the high
dimensionality of the input to each layer and because neurons tend to adapt, producing strongly
correlated features that we pass as an input to the subsequent layer.
To overcome the issues we face while using Lasso or group lasso is countered by a regularizer,
the group version of the ordered weighted one norm, known as group-OWL (GrOWL). GrOWL
supports sparsity and simultaneously learns which parameters should share a similar value.
GrOWL has been effective in linear regression, identifying and coping with strongly correlated
covariates. Unlike standard sparsity-inducing regularizers (e.g., Lasso), GrOWL eliminates
unimportant neurons by setting all their weights to zero and explicitly identifies strongly
correlated neurons by tying the corresponding weights to an expected value. This ability of
GrOWL motivates the following two-stage procedure:
(i) use GrOWL regularization during training to simultaneously identify significant neurons and
groups of parameters that should be tied together.
(ii) retrain the network, enforcing the structure unveiled in the previous phase, i.e., keeping only
the significant neurons and implementing the learned tying structure.
Parameter Sharing
Parameter sharing forces sets of parameters to be similar as we interpret various models or model
components as sharing a unique set of parameters. We only need to store only a subset of
memory.
Suppose two models A and B, perform a classification task on similar input and output
distributions. In such a case, we'd expect the parameters for both models to be identical to each
other as well. We could impose a norm penalty on the distance between the weights, but a more
popular method is to force the parameters to be equal. The idea behind Parameter Sharing is the
essence of forcing the parameters to be similar. A significant benefit here is that we need to store
only a subset of the parameters (e.g., storing only the parameters for model A instead of storing
for both A and B), which leads to significant memory savings.
Example
The most extensive use of parameter sharing is in convolutional neural networks. Natural images
have specific statistical properties that are robust to translation. For example photo of a cat
remains a photo of a cat if it is translated one pixel to the right. Convolution Neural Networks
consider this property by sharing parameters across multiple image locations. Thus we can find a
cat with the same cat detector in column i or i+1 in the image.
Noise Injection
The concept of noise injection is simple. We know that the main root cause for the problem of
overfitting is the size of the dataset, and if the dataset we are dealing with is too small, then our
model will get complete accuracy on the train data, but this model won’t show much accuracy on
the holdout dataset. So, we need to increase the size of the dataset by upsampling the whole
dataset, either collecting the new data or adding some noise or unwanted data. The concept of
collecting a new data sample and adding it to the dataset is a routine and effort-needed task. Thus
the concept of noise injection to the dataset is required and developed. The type of noise you are
going to add to the dataset is purely based on the actual dataset.
Why add Noise?
This small dataset problem challenges machine learning to develop this procedure. The main
problems that we are facing with the small dataset are that we have very few samples. Thus our
model will effectively learn all of those and work well for training data. Similarly, since the
model learned fewer samples, this model can’t make a better mapping between the input and
output data, thus resulting in a poor relation to the output of a particular input.
Wait! Don’t you think why the addition of noise will improve the model? Doesn’t the addition of
it degrade the model performance?
Well!, the answer is NO. This is because we know what regularization is? And how does it
work? Similarly, the addition of noise to the dataset that is causing the overfitting problem, this
addition leads to a regularization effect while training the model and thus improves the model
performance.
Thus, adding noise expands the training dataset size. Each time when a training sample is
exposed to the model, some random noise is added to the input variables making them different
every time it is exposed to the model. In this way, adding noise to input samples is a simple form
of “data augmentation”. Thus the noise addition makes the model not memorize the samples
much efficiently, resulting in a smooth mapping function.
Noise Injection is a technique used in machine learning and deep learning models to improve
their generalization capabilities and robustness. It involves adding random noise to the input data
or the model’s internal layers during training. This process helps the model learn more complex
and diverse patterns, ultimately leading to better performance on unseen data. Noise injection is
particularly useful in scenarios where the training data is limited or noisy, as it can help
prevent overfitting and improve the model’s ability to generalize to new data.
Overview
In machine learning, models are trained to learn patterns from the input data and make
predictions or decisions based on those patterns. However, when the training data is limited or
contains noise, the model may learn to fit the noise rather than the underlying patterns, leading to
overfitting. Overfitting occurs when a model performs well on the training data but poorly on
new, unseen data.
Input noise is added directly to the input data during training. This can be done by adding
Gaussian noise, uniform noise, or other types of random noise to the input features. The added
noise makes it harder for the model to memorize the training data, forcing it to learn more
general patterns. Input noise is particularly useful when the input data is noisy or when the model
is prone to overfitting.
Weight Noise
Weight noise is added to the model’s weights during training. This can be done by adding
Gaussian noise, uniform noise, or other types of random noise to the weights before each update.
Weight noise helps regularize the model by preventing it from relying too much on any single
weight or feature. This can improve the model’s generalization capabilities and make it more
robust to changes in the input data.
Activation Noise
Activation noise is added to the model’s activations (i.e., the outputs of each layer) during
training. This can be done by adding Gaussian noise, uniform noise, or other types of random
noise to the activations before they are passed to the next layer. Activation noise helps the model
learn more complex and diverse patterns by introducing randomness into the model’s internal
representations. This can improve the model’s generalization capabilities and make it more
robust to changes in the input data.
Gradient Noise
Gradient noise is added to the gradients during the optimization process. This can be done by
adding Gaussian noise, uniform noise, or other types of random noise to the gradients before
they are used to update the model’s weights. Gradient noise helps regularize the model by
introducing randomness into the optimization process, making it harder for the model to
converge to a single solution. This can improve the model’s generalization capabilities and make
it more robust to changes in the input data.
Applications
Noise injection has been successfully applied in various machine learning and deep learning
tasks, including image classification, natural language processing, and reinforcement learning. It
has been shown to improve the performance of models in scenarios where the training data is
limited or noisy, as well as in cases where the model is prone to overfitting.
In addition to its regularization benefits, noise injection can also be used as a form of data
augmentation, especially in image classification tasks. By adding noise to the input images
during training, the model is exposed to a wider range of variations, which can help improve its
ability to generalize to new, unseen data.
Overall, noise injection is a valuable technique for improving the generalization capabilities and
robustness of machine learning and deep learning models, making it an essential tool for data
scientists working with limited or noisy data.
Ensemble Methods:
Ensemble learning is a machine learning technique that enhances accuracy
and resilience in forecasting by merging predictions from multiple models. It
aims to mitigate errors or biases that may exist in individual models by
leveraging the collective intelligence of the ensemble.
Ensemble methods are techniques that create multiple models and then combine them to produce
improved results. Ensemble methods in machine learning usually produce more accurate
solutions than a single model would. This has been the case in a number of machine learning
competitions, where the winning solutions used ensemble methods. In the popular Netflix
Competition, the winner used an ensemble method to implement a powerful collaborative
filtering algorithm. Another example is KDD 2009 where the winner also used ensembling. You
can also find winners who used these methods in Kaggle competitions, for example here is the
interview with the winner of CrowdFlower competition.
It is important that we understand a few terminologies before we continue with this article.
Throughout the article I used the term “model” to describe the output of the algorithm that
trained with data. This model is then used for making predictions. This algorithm can be
any machine learning algorithm such as logistic regression, decision tree, etc. These models,
when used as inputs of ensemble methods, are called ”base models,” and the end result is an
ensemble model.
In this blog post I will cover ensemble methods for classification and describe some widely
known methods of ensemble: voting, stacking, bagging and boosting.
In both methods, the first step is to create multiple classification/regression models using some
training dataset. Each base model can be created using different splits of the same training
dataset and same algorithm, or using the same dataset with different algorithms, or any other
method.
Majority Voting
Every model makes a prediction (votes) for each test instance and the final output prediction is
the one that receives more than half of the votes. If none of the predictions get more than half of
the votes, we may say that the ensemble method could not make a stable prediction for this
instance. Although this is one of the more popular ensemble techniques, you may try the most
voted prediction (even if that is less than half of the votes) as the final prediction. In some
articles, you may see this method being called “plurality voting”.
Weighted Voting
Unlike majority voting, where each model has the same rights, we can increase the importance of
one or more models. In weighted voting you count the prediction of the better models multiple
times. Finding a reasonable set of weights is up to you.
Simple Averaging
In simple averaging method, for every instance of test dataset, the average predictions are
calculated. This method often reduces overfit and creates a smoother regression model. The
following pseudocode code shows this simple averaging method:
final_predictions = []
final_predictions.append(
mean(prediction[row_number, ])
Weighted Averaging
Weighted averaging is a slightly modified version of simple averaging, where the prediction of
each model is multiplied by the weight and then their average is calculated. The following
pseudocode code shows the weighted averaging:
final_predictions = []
final_predictions.append(
mean(prediction[row_number, ]*weights)
)
Stacking, also known as stacked generalization, is an ensemble method where the models are
combined using another machine learning algorithm. The basic idea is to train machine learning
algorithms with training dataset and then generate a new dataset with these models. Then this
new dataset is used as input for the combiner machine learning algorithm.
The pseudocode of a stacking procedure is summarized as below:
stacking_test_dataset[,i] = base_algorithm.predict(test)
As you can see in the above pseudocode, the training dataset for combiner algorithm is generated
using the outputs of the base algorithms. In the pseudocode, the base algorithm is generated
using training dataset and then the same dataset is used again to make predictions. But as we
know, in the real world we do not use the same training dataset for prediction, so to overcome
this problem you may see some implementations of stacking where training dataset is splitted.
Below you can see a pseudocode where the training dataset is split before training the base
algorithms:
sklearn library
stacking_test_dataset[,i] = base_algorithm.fit(train).predict(test)
Bootstrap Aggregating
The name Bootstrap Aggregating, also known as “Bagging”, summarizes the key elements of
this strategy. In the bagging algorithm, the first step involves creating multiple models. These
models are generated using the same algorithm with random sub-samples of the dataset which
are drawn from the original dataset randomly with bootstrap sampling method. In bootstrap
sampling, some original examples appear more than once and some original examples are not
present in the sample. If you want to create a sub-dataset with m elements, you should select a
random element from the original dataset m times. And if the goal is generating n dataset, you
follow this step n times.
Activation functions
In the process of building a neural network, one of the choices you get to make is
what Activation Function to use in the hidden layer as well as at the output layer of the
network. This article discusses some of the choices.
Elements of a Neural Network
Input Layer: This layer accepts input features. It provides information from the outside
world to the network, no computation is performed at this layer, nodes here just pass on
the information(features) to the hidden layer.
Hidden Layer: Nodes of this layer are not exposed to the outer world, they are part of
the abstraction provided by any neural network. The hidden layer performs all sorts of
computation on the features entered through the input layer and transfers the result to
the output layer.
Output Layer: This layer bring up the information learned by the network to the outer
world.
What is an activation function and why use them?
The activation function decides whether a neuron should be activated or not by
calculating the weighted sum and further adding bias to it. The purpose of the activation
function is to introduce non-linearity into the output of a neuron.
Explanation: We know, the neural network has neurons that work in correspondence
with weight, bias, and their respective activation function. In a neural network, we
would update the weights and biases of the neurons on the basis of the error at the
output. This process is known as back-propagation. Activation functions make the
back-propagation possible since the gradients are supplied along with the error to
update the weights and biases.
Why do we need Non-linear activation function?
A neural network without an activation function is essentially just a linear regression
model. The activation function does the non-linear transformation to the input making it
capable to learn and perform more complex tasks.
• It Stands for Rectified linear unit. It is the most widely used activation
function. Chiefly implemented in hidden layers of Neural network.
• Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0
otherwise.
• Value Range :- [0, inf)
• Nature :- non-linear, which means we can easily backpropagate the errors
and have multiple layers of neurons being activated by the ReLU function.
• Uses :- ReLu is less computationally expensive than tanh and sigmoid
because it involves simpler mathematical operations. At a time only a few
neurons are activated making the network sparse making it efficient and easy
for computation.
In simple words, RELU learns much faster than sigmoid and Tanh function.
Softmax Function
The softmax function is also a type of sigmoid function but is handy when we are trying
to handle multi- class classification problems.
• Nature :- non-linear
• Uses :- Usually used when trying to handle multiple classes. the softmax
function was commonly found in the output layer of image classification
problems.The softmax function would squeeze the outputs for each class
between 0 and 1 and would also divide by the sum of the outputs.
• Output:- The softmax function is ideally used in the output layer of the
classifier where we are actually trying to attain the probabilities to define the
class of each input.
• The basic rule of thumb is if you really don’t know what activation function
to use, then simply use RELU as it is a general activation function in hidden
layers and is used in most cases these days.
• If your output is for binary classification then, sigmoid function is very
natural choice for output layer.
• If your output is for multi-class classification then, Softmax is very useful to
predict the probabilities of each classes.
Activation Functions
•
To put in simple terms, an artificial neuron calculates the ‘weighted sum’ of its inputs and
adds a bias, as shown in the figure below by the net input.
Mathematically,
Now the value of net input can be any anything from -inf to +inf. The neuron doesn’t
really know how to bound to value and thus is not able to decide the firing pattern. Thus
the activation function is an important part of an artificial neural network. They basically
decide whether a neuron should be activated or not. Thus it bounds the value of the net
input. The activation function is a non-linear transformation that we do over the input
before sending it to the next layer of neurons or finalizing it as output. Types of
Activation Functions – Several different types of activation functions are used in Deep
Learning. Some of them are explained below:
1. Step Function: Step Function is one of the simplest kind of activation
functions. In this, we consider a threshold value and if the value of net input
say y is greater than the threshold then the neuron is activated. Mathematically,
2.
3.
Given below is the graphical representation of step function.
5.
Graphically, This is a smooth
function and is continuously differentiable. The biggest advantage that it has
over step and linear function is that it is non-linear. This is an incredibly cool
feature of the sigmoid function. This essentially means that when I have
multiple neurons having sigmoid function as their activation function – the
output is non linear as well. The function ranges from 0-1 having an S shape.
6. ReLU: The ReLU function is the Rectified linear unit. It is the most widely
used activation function. It is defined as:
7.
Graphically,
As the name suggests, all the weights are assigned zero as the initial value is zero
initialization. This kind of initialization is highly ineffective as neurons learn the same
feature during each iteration. Rather, during any kind of constant initialization, the same
issue happens to occur. Thus, constant initializations are not preferred.
2. Random Initialization
b) Random Uniform: The weights are initialized from values in a uniform distribution.
3. Xavier/Glorot Initialization
In Normalized Xavier/Glorot weight initialization, the weights are assigned from values
of a normal distribution as follows:
Xavier/Glorot Initialization, too, is suitable for layers where the activation function
used is Sigmoid.
5. He Uniform Initialization
In He Uniform weight initialization, the weights are assigned from values of a uniform
distribution as follows:
He Uniform Initialization is suitable for layers where ReLU activation function is used.
6. He Normal Initialization
In He Normal weight initialization, the weights are assigned from values of a normal
distribution as follows:
He Uniform Initialization, too, is suitable for layers where ReLU activation function is
used
Batch normalization:-
Batch normalization works by normalizing the output of a previous activation
layer by subtracting the batch mean and dividing by the batch standard
deviation. After this step, the result is then scaled and shifted by two
learnable parameters, gamma and beta, which are unique to each layer.
1. Calculate the mean and variance of the activations for each feature in a mini-
batch.
2. Normalize the activations of each feature by subtracting the mini-batch mean
and dividing by the mini-batch standard deviation.
3. Scale and shift the normalized values using the learnable parameters gamma
and beta, which allow the network to undo the normalization if that is what the
learned behavior requires.
The effectiveness of batch normalization can depend on the size of the mini-
batch. Very small batch sizes can lead to inaccurate estimates of the mean and
variance, which can destabilize the training process.
• Computational Overhead: Batch normalization introduces additional
computations and parameters into the network, which can increase the
complexity and computational cost.
• Sequence Data:
One of the most common problems of data science professionals is to avoid over -
fitting. Have you come across a situation when your model is performing very
well on the training data but is unable to predict the test data accurately. The
reason is your model is overfitting. The solution to such a problem is
regularization.
The regularization techniques help to improve a model and allows it to converge
faster. We have several regularization tools at our end, some of them are early
stopping, dropout, weight initialization techniques, and batch normalization. The
regularization helps in preventing the over-fitting of the model and the learning
process becomes more efficient.
Before entering into Batch normalization let’s understand the term “Normalization”.
Normalization is a data pre-processing tool used to bring the numerical data to a common
Generally, when we input the data to a machine or deep learning algorithm we tend to
change the values to a balanced scale. The reason we normalize is partly to ensure that our
Now coming back to Batch normalization, it is a process to make neural net works faster and
more stable through adding extra layers in a deep neural network. The new layer performs
the standardizing and normalizing operations on the input of a layer coming from a previous
layer.
But what is the reason behind the term “Batch” in batch normalization? A typical neural
network is trained using a collected set of input data called batch. Similarly, the
normalizing process in batch normalization takes place in batches, not as a single input.
Let’s understand this through an example, we have a deep neural network as shown in the
following image.
Initially, our inputs X1, X2, X3, X4 are in normalized form as they are coming from the pre -
processing stage. When the input passes through the first layer, it transforms, as a sigmoid
function applied over the dot product of input X and the weight matrix W. Similarly, this
transformation will take place for the second layer and go till the last layer L as shown in
Although, our input X was normalized with time the output will no longer be on the same
scale. As the data go through multiple layers of the neural network and L activation
Since by now we have a clear idea of why we need Batch normalization, let’s understand
how it works. It is a two-step process. First, the input is normalized, and later rescaling and
offsetting is performed.
deviation one. In this step we have our batch input from layer h, first, we need to calculate
Once we have meant at our end, the next step is to calculate the standard deviation of the
hidden activations.
Further, as we have the mean and the standard deviation ready. We will normalize the
hidden activations using these values. For this, we will subtract the mean from ea ch input
and divide the whole value with the sum of standard deviation and the smoothing term ( ε).
The smoothing term(ε) assures numerical stability within the operation by stopping a
Rescaling of Offsetting
In the final operation, the re-scaling and offsetting of the input take place. Here two
components of the BN algorithm come into the picture, γ(gamma) and β (beta). These
parameters are used for re-scaling (γ) and shifting(β) of the vector containing values from
These two are learnable parameters, during the training neural network ensures the optimal
values of γ and β are used. That will enable the accurate normalization of each batch.
Batch normalization is a technique used in deep learning that helps our models learn and
adapt quickly. It’s like a teacher who helps students by breaking down complex topics into
simpler parts.
Imagine you’re trying to hit a moving target with a dart. It would be much harder than
hitting a stationary one, right? Similarly, in deep learning, our target keeps changing during
training due to the continuous updates in weights and biases. This is known as the “internal
covariate shift”. Batch normalization helps us stabilize this moving target, making our task
easier.
subtracting the batch mean and dividing by the batch standard deviation. However, these
normalized values may not follow the original distribution. To tackle this, batch
normalization introduces two learnable parameters, gamma and beta, which can shift and
• Speeds up learning: By reducing internal covariate shift, it helps the model train faster.
• Regularizes the model: It adds a little noise to your model, and in some cases, you might
• Allows higher learning rates: Gradient descent usually requires small learning rates for the
network to converge. Batch normalization helps us use much larger learning rates, speeding
By Normalizing the hidden layer activation the Batch normalization speeds up the training
process.
It solves the problem of internal covariate shift. Through this, we ensure that the input for
every layer is distributed around the same mean and standard deviation. If you are unaware
or Not Dog. Let’s say we have the images of white dogs only, these images will have certain
distribution as well. Using these images model will update its parameters.
later, if we get a new set of images, consisting of non-white dogs. These new images will
have a slightly different distribution from the previous images. Now the model will change
its parameters according to these new images. Hence the distribution of the hidden
activation will also change. This change in hidden activation is known as an internal
covariate shift.
However, according to a study by MIT researchers, the batch normalization does not solve
This random noise has non-zero mean and non -unit variance and added after the batch
• The third model has a less stable distribution across all layers. We can see the noisy model
second conclusion was the training accuracy of the second and third models is higher than
the first model. So it can be concluded that internal co-variate shift might not be a
Batch normalization smoothens the loss function that in turn by optimizing the model
×
How Batch Normalization works?
1. During each training iteration (epoch), BN takes a mini batch of data and
normalizes the activations (outputs) of a hidden layer. This normalization
transforms the activations to have a mean of 0 and a standard deviation of 1.
2. While normalization helps with stability, it can also disrupt the network’s
learned features. To compensate, BN introduces two learnable parameters:
gamma and beta. Gamma rescales the normalized activations, and beta shifts
them, allowing the network to recover the information present in the original
activations.
It ensures that each element or component is in the right proportion before distributing
the inputs into the layers and each layer is normalized before being passed to the next
layer.
Correct Batch Size:
• Resonable sized mini-batches must be taken into consideration during
training. It performs better with large batch sizes as it computes more
accurate batch statistics.
• Leading it to be more stable gradients and faster convergence.
Implementing Batch Normalization in PyTorch
PyTorch provides the nn.BatchNormXd module (where X is 1 for 1D data, 2 for 2D data
like images, and 3 for 3D data) for convenient BN implementation. In this tutorial, we
will see the implementation of batch normalizationa and it’s effect on model. We will
train the model and highlight the loss before and after using batch normalization with
MNIST dataset widely used dataset in the field of machine learing and computer vision.
This dataset consists of a collection of 28X28 pixel grayscale images of handwritten
digits ranges from (0 to 9) inclusive along with their corresponding labels.
Prerequsite: Install the PyTorch library:
pip install torch torchvision
Step 1: Importing necessary libraries
1. Torch : Imports the PyTorch library for deep learning operations.
2. nn : Imports the neural network module from PyTorch for building neural
network architectures.
3. DataLoader : Import dataloader class from PyTorch, it helps in loading the
datasets efficiently for traning and testing.
4. Transforms : Imports the transforms module from torchvision, which
provides common image transformations.
5. Time : Imports the time module for time-related operations.
6. OS : Imports the os module, which provides functions for interacting with the
operating system.
• Python3
import torch
import time
import datetime
import os
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(32, 10) # Fully connected layer from 32 to 10 neurons (for MNIST classes)
return self.layers(x)
Step 3: The next step follows loading and training the dataset with simple MLP neural network
architecture for the MINST dataset and creating the dataloader for training.
• Python3
if __name__ == '__main__':
torch.manual_seed(47)
transform = transforms.Compose([
transforms.ToTensor()
])
Step 4: Initialize the MLP model, Define the loss function(CrossEntropyLoss), and optimizer (Adam).
• Python3
start_time = time.time()
# Training loop
running_loss = 0.0
outputs = mlp(inputs.view(inputs.shape[0], -1)) # Flatten the input for MLP and forward pass
loss.backward() # Backpropagation
running_loss += loss.item()
running_loss = 0.0
print('Training finished')
print('Training time:', str(datetime.timedelta(seconds=training_time))) # for calculating the training time in minutes and
seconds format
Output:
Starting epoch 1
Epoch 1, Mini-batch 100, Loss: 1.107109518647194
Epoch 1, Mini-batch 200, Loss: 0.48408970028162
Epoch 1, Mini-batch 300, Loss: 0.3104418055713177
Epoch 1, Mini-batch 400, Loss: 0.2633690595626831
Epoch 1, Mini-batch 500, Loss: 0.2228860107809305
Epoch 1, Mini-batch 600, Loss: 0.20098184436559677
Epoch 1, Mini-batch 700, Loss: 0.18423103891313075
Epoch 1, Mini-batch 800, Loss: 0.16403419613838197
Epoch 1, Mini-batch 900, Loss: 0.14670498583465816
Starting epoch 2
Epoch 2, Mini-batch 100, Loss: 0.1223447759822011
Epoch 2, Mini-batch 200, Loss: 0.11535881120711565
Epoch 2, Mini-batch 300, Loss: 0.12264159372076393
Epoch 2, Mini-batch 400, Loss: 0.1274782767519355
Epoch 2, Mini-batch 500, Loss: 0.12688526364043354
Epoch 2, Mini-batch 600, Loss: 0.10709397405385972
Epoch 2, Mini-batch 700, Loss: 0.12462730823084713
Epoch 2, Mini-batch 800, Loss: 0.10854666410945356
Epoch 2, Mini-batch 900, Loss: 0.10740736600011587
Starting epoch 3
Epoch 3, Mini-batch 100, Loss: 0.09494352690875531
Epoch 3, Mini-batch 200, Loss: 0.08548182763159275
Epoch 3, Mini-batch 300, Loss: 0.08944599309004843
Epoch 3, Mini-batch 400, Loss: 0.08315778982825578
Epoch 3, Mini-batch 500, Loss: 0.0855206391401589
Epoch 3, Mini-batch 600, Loss: 0.08882722020149231
Epoch 3, Mini-batch 700, Loss: 0.0896124207880348
Epoch 3, Mini-batch 800, Loss: 0.08545528341084718
Epoch 3, Mini-batch 900, Loss: 0.09168351721018553
Training finished
Training process has been completed.
Training time: 0:00:21.384532
Note: The loss after mini-batch 900 of epoch 3 with batch normalization is 0.09196628
Benefits of Batch Normalization
• Faster Convergence: By stabilizing the gradients, BN allows you to use
higher learning rates, which can significantly speed up training.
• Reduced Internal Covariate Shift: As the network trains, the distribution of
activations within a layer can change (internal covariate shift). BN helps
mitigate this by normalizing activations before subsequent layers, making the
training process less sensitive to these shifts.
• Initialization Insensitivity: BN makes the network less reliant on the initial
weight values, allowing for more robust training and potentially better
performance.