UNIT II Deep Learning
UNIT II Deep Learning
In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which work
on our instructions. But can a machine also learn from experiences or past data like a human
does? So here comes the role of Machine Learning.
Machine learning algorithms create a mathematical model that, without being explicitly
programmed, aids in making predictions or decisions with the assistance of sample historical
data, or training data. For the purpose of developing predictive models, machine learning
brings together statistics and computer science. Algorithms that learn from historical data are
either constructed or utilized in machine learning. The performance will rise in proportion to
the quantity of information we provide.
A machine can learn if it can gain more data to improve its performance.
How does Machine Learning work
A machine learning system builds prediction models, learns from previous data, and predicts
the output of new data whenever it receives it. The amount of data helps to build a better
model that accurately predicts the output, which in turn affects the accuracy of the predicted
output.
Let's say we have a complex problem in which we need to make predictions. Instead of
writing code, we just need to feed the data to generic algorithms, which build the logic based
on the data and predict the output. Our perspective on the issue has changed as a result of
machine learning. The Machine Learning algorithm's operation is depicted in the following
block diagram:
The demand for machine learning is steadily rising. Because it is able to perform tasks that
are too complex for a person to directly implement, machine learning is required. Humans are
constrained by our inability to manually access vast amounts of data; as a result, we require
computer systems, which is where machine learning comes in to simplify our lives.
By providing them with a large amount of data and allowing them to automatically explore
the data, build models, and predict the required output, we can train machine learning
algorithms. The cost function can be used to determine the amount of data and the machine
learning algorithm's performance. We can save both time and money by using machine
learning.
Following are some key points which show the importance of Machine Learning:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1) Supervised Learning
In supervised learning, sample labeled data are provided to the machine learning system for
training, and the system then predicts the output based on the training data.
The system uses labeled data to build a model that understands the datasets and learns about
each one. After the training and processing are done, we test the model with sample data to
see if it can accurately predict the output.
The mapping of the input data to the output data is the objective of supervised learning. The
managed learning depends on oversight, and it is equivalent to when an understudy learns
things in the management of the educator. Spam filtering is an example of supervised
learning.
o Classification
o Regression
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.
The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any supervision.
The goal of unsupervised learning is to restructure the input data into new features or a group
of objects with similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find
useful insights from the huge amount of data. It can be further classifieds into two categories
of algorithms:
o Clustering
o Association
3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a learning agent gets a
reward for each right action and gets a penalty for each wrong action. The agent learns
automatically with these feedbacks and improves its performance. In reinforcement learning,
the agent interacts with the environment and explores it. The goal of an agent is to get the
most reward points, and hence, it improves its performance.
The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.
The main objective of supervised learning algorithms is to learn an association between input
data samples and corresponding outputs after performing multiple training data instances.
In supervised machine learning, models are trained using a dataset that consists of input-
output pairs.
The supervised learning algorithm analyzes the dataset and learns the relation between the
input data (features) and correct output (labels/ targets). In the process of training, the model
estimates the algorithm's parameters by minimizing a loss function. The loss function
measures the difference between the model's predictions and actual target values.
The model iteratively updates its parameters until the loss/ error has been sufficiently
minimized.
Once the training is completed, the model parameters have optimal values. The model has
learned the optimal mapping/ relation between the inputs and targets. Now, the model can
predict values for the new and unseen input data.
Supervised machine learning is categorized into two types of problems − classification and
regression.
1. Classification
Some popular classification algorithms are decision trees, random forests, support vector
machines (SVM), logistic regression, etc.
2. Regression
The key objective of regression-based tasks is to predict output labels or responses, which are
continuous numeric values, for the given input data. Basically, regression models use the
input data features (independent variables) and their corresponding continuous numeric
output values (dependent or outcome variables) to learn specific associations between inputs
and corresponding outputs.
Some popular regression algorithms are linear regression, polynomial regression, Laso
regression, etc.
Supervised learning is one of the important models of learning involved in training machines.
This chapter talks in detail about the same.
There are several algorithms available for supervised learning. Some of the widely used
algorithms of supervised learning are as shown below −
Linear Regression
k-Nearest Neighbors
Decision Trees
Naive Bayes
Logistic Regression
Support Vector Machines
Random Forest
Gradient Boosting
Let's discuss each of the above mentioned supervised machine learning algorithms in detail.
1. Linear Regression
Linear regression is a type of algorithm that tries to find the linear relation between input
features and output values for the prediction of future events. This algorithm is widely used to
perform stock analysis, weather forecasting and others.
2. K-Nearest Neighbors
The k-Nearest Neighbors (kNN) is a statistical technique that can be used for solving
classification and regression problems. This algorithm classifies or predicts values for new
data by mathematically calculating the nearest distance with other points in training data.
Let us discuss the case of classifying an unknown object using kNN. Consider the
distribution of objects as shown in the image given below −
The diagram shows three types of objects, marked in red, blue and green colors. When you
run the kNN classifier on the above dataset, the boundaries for each type of object will be
marked as shown below −
Now, consider a new unknown object you want to classify as red, green or blue. This is
depicted in the figure below.
As you see it visually, the unknown data point belongs to a class of blue objects.
Mathematically, this can be concluded by measuring the distance of this unknown point with
every other point in the data set. When you do so, you will know that most of its neighbors
are blue in color. The average distance between red and green objects would definitely be
more than the average distance between blue objects. Thus, this unknown object can be
classified as belonging to blue class.
The kNN algorithm can also be used for regression problems. The kNN algorithm is available
as ready-to-use in most of the ML libraries.
3. Decision Trees
A Decision tree is a tree-like structure used to make decisions and analyze the possible
consequences. The algorithm splits the data into subsets based on features, where each parent
node represents internal decisions and the leaf node represents final prediction.
You would write a code to classify your input data based on this flowchart. The flowchart is
self-explanatory and trivial. In this scenario, you are trying to classify an incoming email to
decide when to read it.
In reality, the decision trees can be large and complex. There are several algorithms available
to create and traverse these trees. As a Machine Learning enthusiast, you need to understand
and master these techniques of creating and traversing decision trees.
4. Naive Bayes
Naive Bayes is used for creating classifiers. Suppose you want to sort out (classify) fruits of
different kinds from a fruit basket. You may use features such as color, size, and shape of
fruit; for example, any fruit that is red in color, round in shape, and about 10 cm in diameter
may be considered an Apple. So to train the model, you would use these features and test the
probability that a given feature matches the desired constraints. The probabilities of different
features are then combined to arrive at the probability that a given fruit is an Apple. Naive
Bayes generally requires a small number of training data for classification.
4. Logistic Regression
Logistic regression is a type of statistical algorithm that estimates the probability of
occurrence of an event.
Look at the following diagram. It shows the distribution of data points in the XY plane.
From the diagram, we can visually inspect the separation of red and green dots. You may
draw a boundary line to separate out these dots. Now, to classify a new data point, you will
just need to determine on which side of the line the point lies.
Look at the following distribution of data. Here the three classes of data cannot be linearly
separated. The boundary curves are non-linear. In such a case, finding the curve's equation
becomes a complex job.
The Support Vector Machines (SVM) come in handy in determining the separation
boundaries in such situations.
7. Random Forest
Random forest is also a supervised learning algorithm that is flexible for classification and
regression. This algorithm is a combination of multiple decision trees which are merged to
improve the accuracy of prediction .
The following diagram illustrates how the Random Forest Algorithm works −
8. Gradient Boosting
Gradient boosting combines weak learners (decision trees), to create a strong model. It builds
new models that correct errors of the previous ones. The goal of this algorithm is to minimize
the loss function. It can be efficiently used for classification and regression tasks.
Supervised learning algorithms are one of the most popular among the machine learning
models. Some benefits are:-
The goal in supervised learning is well-defined, which improves the prediction accuracy.
Models trained using supervised learning are effective at predicting and classification since
they use labeled datasets.
It can be highly versatile, i.e., applied to various problems, like spam detection, stock prices,
etc.
Disadvantages of Supervised Learning
Though supervised learning is the most used, it comes with certain challenges too. Some of
them are:
Supervised learning requires a large amount of labeled data for the model to train effectively.
It is practically very difficult to collect such huge data; it is expensive and time-consuming.
Supervised learning cannot predict accurately if the test data is different from the training
data.
Accurately labeling the data is complex and requires expertise and effort.
Supervised learning models are widely used in many applications in various sectors,
including the following-
Image recognition − A model is trained on a labeled dataset of images, where each image is
associated with a label. The model is fed with data, which allows it to learn patterns and
features. Once trained, the model can now be tested using new, unseen data. This is widely
used in applications like facial recognition and object detection.
Predictive analytics − Supervised learning algorithms are used to train labeled historical
data, allowing the model to learn patterns and relations between input features and output to
identify trends and make accurate predictions. Businesses use this method to make data-
driven decisions and enhance strategic planning.
Unlike supervised machine learning, unsupervised machine learning models are trained on
unlabeled dataset. Unsupervised learning algorithms are handy in scenarios in which we do
not have the liberty, like in supervised learning algorithms, of having pre-labeled training
data and we want to extract useful patterns from input data.
There are many approaches that are used in unsupervised machine learning. Some of the
approaches are association, clustering, and dimensionality reduction. Some examples of
unsupervised machine learning algorithms include K-means clustering, K-nearest neighbors,
etc.
In regression, we train the machine to predict a future value. In classification, we train the
machine to classify an unknown object in one of the categories we define. In short, we have
been training machines so that it can predict Y for our data X. Given a huge data set and not
estimating the categories, it would be difficult for us to train the machine using supervised
learning. What if the machine can look up and analyze the big data running into several
Gigabytes and Terabytes and tell us that this data contains so many distinct categories?
As an example, consider the voters data. By considering some inputs from each voter (these
are called features in AI terminology), let the machine predict that there are so many voters
who would vote for X political party and so many would vote for Y, and so on. Thus, in
general, we are asking the machine given a huge set of data points X, What can you tell me
about X?. Or it may be a question like What are the five best groups we can make out of X?.
Or it could be even like What three features occur together most frequently in X?.
In the training process, the algorthims learn and infer their own rules on the basis of the
similarities, patterns and differences of data points. The algorithms learn without any labels
(target values) or pre-training.
The outcome of this training process of algorithm with data sets is a machine learning model.
As the data sets are unlabeled (no target values, no human supervision), the model is
unsupervised machine learning model.
Now the model is ready to perform the unsupervised learning tasks such as clustering,
association, or dimensionality reduction.
Unsupervised learning models is suitable complex tasks, like organizing large datasets into
clusters.
Unsupervised learning methods or approaches are broadly categorized into three categories −
clustering, association, and dimensionality reduction. Let us discuss these methods briefly
and list some related algorithms −
1. Clustering
Clustering is a technique used to group a set of objects or data points into clusters based on
their similarities. The goal of this technique is to make sure that the data points within the
same cluster should have more similarities than those in other clusters.
Clustering is one of the popular unsupervised learning approaches. There are several
unsupervised learning algorithms used for clustering like −
K-Means Clustering − This algorithm is used to assign data points to one among the K
clusters based on its distance from the center of the cluster. After assigning each data point to
a cluster, new centroids are recalculated. This is an iterative process until the centroids no
longer change. This shows that the algorithm is efficient and the clusters are stable.
Mean Shift Algorithm − It is a clustering technique that identifies clusters by finding high
data density areas. It is an iterative process, where mean of each data point is shifted towards
the densest area of the data.
Gaussian Mixture Models − It is a probabilistic model that is a combination of multiple
Gaussian distributions. These models are used to determine which determination a given data
belongs to.
This is rule based technique that is used to discover associations between parameters in large
dataset. It is popularly used for Market Basket Analysis, allows companies to make decisions
and recommendation engines. One of the main algorithms that is used for Association Rule
Mining is the Apriori algorithm.
Apriori Algorithm
Apriori algorithm is a technique used in unsupervised learning to identify data points that are
frequently repeated and discover association rules within transactional data.
3. Dimensionality Reduction
As the name suggests, dimensionality reduction is used to reduce the number of feature
variables for each data sample by selecting set of principal or representative features.
A question arises here is that, why we need to reduce the dimensionality? The reason behind
this is the problem of feature space complexity which arises when we start analyzing and
extracting millions of features from data samples. This problem generally refers to "curse of
dimensionality". Some popular algorithms in unsupervised learning that are used for
dimensionality reduction are −
Algorithms are very important part in machine learning model training. A machine learning
algorithm is a set of instructions that a program follows to analyze the data and produce the
outcomes. For specific tasks, suitable machine learning algorithms are selected and trained on
the data.
Algorithms used in unsupervised learning generally fall under one of the three categories −
clustering, association, or dimensionality reduction. The following are the most used
unsupervised learning algorithms −
K-Means Clustering
Hierarchical Clustering
Mean-shift Clustering
DBSCAN Clustering
HDBSCAN Clustering
BIRCH Clustering
Affinity Propagation
Agglomerative Clustering
Apriori Algorithm
Eclat algorithm
FP-growth algorithm
Principal Component Analysis(PCA)
Autoencoders
Singular value decomposition (SVD)
Unsupervised learning has many advantages that make it particularly purposeful in various
tasks −
No labeled data required − Unsupervised learning doesn't require a labeled dataset for
training, which makes it easier and cheaper to use.
Discovers hidden patterns − It helps in recognizing patterns and relationships in large data,
which can lead to gaining insights and efficient decision-making.
Suitable for complex tasks − It is efficiently used for various complex tasks like clustering,
anomaly detection, and dimensionality reduction.
While unsupervised learning has many advantages, some challenges can occur too while
training the model without human intervention. Some of the disadvantages of unsupervised
learning are:
Difficult to evaluate − Without labeled data and predefined targets, it would be difficult to
evaluate the performance of unsupervised learning algorithms.
Inaccurate outcomes − The outcome of an unsupervised learning algorithm might be less
accurate, especially if the input data has noise and also since the data is not labeled, the
algorithms do not know the exact output.
Unsupervised learning provides a path for businesses to identify patterns in large volumes of
data. Some real-world applications of unsupervised learning are:
Data enters the input nodes, travels through the hidden layers, and exits the output nodes.
The network lacks links, allowing the information leaving the output node to be sent
back into the network.
The feedfоrwаrd netwоrk will mар y = f (x; θ). It then memorizes the value of θ that
most closely approximates the function.
The Google Photos app shows that a feedforward neural network is the foundation for
photo object detection .
It contains the neurons that receive input. The data is subsequently passed on to the next
tier. The input layer’s total number of neurons equals the number of variables in the
dataset.
Hidden layer
This is the intermediate layer, which is concealed between the input and output layers. It
has many neurons that alter the inputs and then communicate with the output layer.
Output layer
It is the last layer and depends on the model’s construction. The output layer is the
expected feature, as you know the desired outcome.
Neurons weights
Weights describe the strength of a connection between neurons. A weight’s value ranges
from 0 to 1.
Where,
b = biases
a = output vectors
x = input
The Gradient Descent Algorithm repeatedly calculates the next point using gradient at
the current location, then scales it (by a learning rate) and subtracts the achieved value
from the current position (makes a step) (makes a step). It subtracts the value since we
want to decrease the function (to increase it would be adding) (and to maximize it would
be adding). This procedure may be written as:
There’s a crucial parameter η, which adjusts the gradient and hence affects the step size.
In machine learning , it is termed learning rate and substantially affects performance.
The smaller the learning rate, the longer GD converges or may approach
maximum iteration before finding the optimal point
If the learning rate is too great, the algorithm may not converge to the ideal point
(jump around) or diverge altogether.
The Gradient Descent method’s steps are:
The following is an example of how to construct the Gradient Descent algorithm (with
steps tracking):
Let us now write the following methods in Python:def func1(x): return x**2-4*x+1 def
gradient_func1(x):
return 2*x – 4
With a learning rate of 0.1 and a starting point of x=9, we can compute each step
manually for this function. Let us begin with the first three steps:
The animation below illustrates the GD algorithm’s steps at 0.1 and 0.8 learning rates.
As the algorithm approaches the minimum, the steps become steadily smaller. Jumping
from one side to the other is necessary for a faster learning rate before convergence.
The first ten stages were conducted by GD to determine the learning rate for small and large
groups.
The following diagram illustrates the trajectory, number of iterations, and ultimate
converged output (within tolerance) for various learning rates:
You must experiment with the weights to see how the network learns. To reach
perfection, weight variations of just a few grams should have a negligible effect on
production.
On the other hand, what if a minor change in the weight results in a large change in the
output? The sigmoid neuron model can resolve this issue.
These neural networks are utilized in a wide variety of applications. The following are
exercise.
Overfitting can lead to poor performance on new data, especially in the presence of
outliers or noise in the training set. If so, this article addresses your concerns by
exploring techniques such as regularization in deep learning and essential concepts of
bagging, boosting, and stacking in ensemble learning to improve model generalization.
What is Regularization?
This penalty discourages the model from becoming too complex or having large
parameter values, which helps in controlling the model’s ability to fit noise in the
regularization, dropout, early stopping, and more. By applying regularization for deep
learning, models become more robust and better at making accurate predictions on
unseen data.
Before we deep dive into the topic, take a look at this image:
Have you seen this image before? As we move towards the right in this image,
our model tries to learn too well the details and the noise from the training data,
ultimately resulting in poor performance on the unseen data.
In other words, while going toward the right, the complexity of the model
increases such that the training error reduces but the testing error doesn’t. This
is shown in the image below:
If you’ve built a neural network before, you know how complex they are. This makes
them more prone to overfitting.
Regularization is a technique that modifies the learning algorithm slightly so that the
model generalizes better. This, in turn, improves the model’s performance on unseen
data as well.
Let’s consider a neural network that is overfitting on the training data as shown in the
image below:
If you have studied the concept of regularization in machine learning, you will
have a fair idea that regularization penalizes the coefficients. In deep learning, it
penalizes the weight matrices of the nodes.
Assume that our regularization coefficient is so high that some of the weight
matrices are nearly equal to zero.
This will result in a much simpler linear network and slight underfitting of the training
data.
Such a large value of the regularization coefficient is not that useful. We need to
optimize the value of the regularization coefficient to obtain a well-fitted model as
shown in the image below:
Now that we understand how regularization helps reduce overfitting, we’ll learn a few
different techniques for applying regularization in deep learning.
L2&L1 Regularization
L1 and L2 are the most common types of regularization deep learning. These update the
general cost function by adding another term known as the regularization term.
because it assumes that a neural network with smaller weight matrices leads to simpler
For L2:
For L1:
In L2, we have: ||w||^2 = Σ w_i^2. This is known as ridge regression, where lambda is
the regularization parameter. It is the hyperparameter whose value is optimized for better
results. L2 regularization is also known as weight decay as it forces the weights to decay
In L1, we having: ||w||=Σ |w_i|. In this, we penalize the absolute value of the weights.
Unlike L2, the weights may be reduced to zero here. L1 regularization is also called
lasso regression. Hence, it is very useful when we are trying to compress our
In keras, we can directly apply regularization for deep learning to any layer using
the regularizers .
Note: Here the value 0.01 is the value of regularization parameter, i.e., lambda, which
we need to optimize further. We can optimize it using the grid-search method.
Similarly, we can also apply L1 regularization deep learning. We will look at this in
more detail in a case study later in this article.
Dropout
This is one of the most interesting types of regularization techniques . It also produces
very good results and is consequently the most frequently used regularization technique
in the field of deep learning.
To understand dropout, let’s say our neural network structure is akin to the one shown
below:
So what does dropout do? At every iteration, it randomly selects some nodes and
removes them along with all of their incoming and outgoing connections as
shown below:
Each iteration has a different set of nodes, which results in a different set of
outputs. This can also be thought of as an ensemble technique in machine learning.
Ensemble models usually perform better than a single model as they capture more
randomness. Similarly, dropout models also perform better than normal neural network
models.
This probability of choosing how many nodes should be dropped is the hyperparameter
of the dropout function. As seen in the image above, dropout can be applied to both the
hidden layers as well as the input layers.
In Keras, we can implement dropout using the Keras core layer . Below is the
Python code for it:
from keras.layers.core import Dropout
model = Sequential([
Dense(output_dim=hidden1_num_units, input_dim=input_num_units,
activation='relu'),
Dropout(0.25),
Dense(output_dim=output_num_units, input_dim=hidden5_num_units,
activation='softmax'),
])
Copy Code
As you can see, we have defined 0.25 as the probability of dropping. We can tune it
further for better results using the grid search method.
Data Augmentation
The simplest way to reduce overfitting is to increase the training data size. In machine
learning, however, increasing the training data size was impossible as the labeled data
was too costly.
But now, let’s consider we are dealing with images. In this case, there are a few ways of
increasing the size of the training data—rotating the image, flipping, scaling, shifting,
etc. In the image below, some transformation has been done on the handwritten digits
dataset.
This technique is known as data augmentation. It usually provides a big leap in
improving the accuracy of the model, and it can be considered a mandatory trick to
a big list of arguments that you can use to pre-process your training data.
Early stopping is a cross-validation strategy in which we keep one part of the training
set as the validation set. When we see that the performance on the validation set is
In keras, we can apply early stopping using the callbacks function. Below is the sample
Here, monitor refers to the quantity that you need to keep track of, and ‘ val_err’ refers
Patience denotes the number of epochs with no further improvement, after which
training stops. For a better understanding, let’s look at the above image again. After the
dotted line, each epoch will result in a higher validation error value. Therefore, our
model will stop 5 epochs after the dotted line (since our patience equals 5) because it
process of neural networks by iteratively refining the weights and biases based
on the feedback received from the data. Well-known optimizers in deep learning
equipped with distinct update rules, learning rates, and momentum strategies, all
geared towards the overarching goal of discovering and converging upon optimal
Gradient Descent can be considered the popular kid among the class of
consistently modify the values and achieve the local minimum. Before moving
In simple terms, consider you are holding a ball resting at the top of a bowl.
When you lose the ball, it goes along the steepest direction and eventually
settles at the bottom of the bowl. A Gradient provides the ball in the steepest
direction to reach the local minimum which is the bottom of the bowl.
The above equation means how the gradient is calculated. Here alpha is the step
size that represents how far to move against each gradient with each iteration.
3. Search for Lower Cost: Look for a cost value lower than the current one.
coefficients’ values.
Gradient descent works best for most purposes. However, it has some
downsides too. It is expensive to calculate the gradients if the size of the data is
huge. Gradient descent works well for convex functions, but it doesn’t know how
At the end of the previous section, you learned why there might be better options
than using gradient descent on massive data. To tackle the challenges large
instead of processing the entire dataset during each iteration, we randomly select
batches of data. This implies that only a few samples from the dataset are
The procedure is first to select the initial parameters w and learning rate n. Then
Since we are not using the whole dataset but the batches of it for each iteration,
the path taken by the algorithm is full of noise as compared to the gradient
descent algorithm. Thus, SGD uses a higher number of iterations to reach the
computation time increases. But even after increasing the number of iterations,
the computation cost is still less than that of the gradient descent optimizer. So
descent algorithm.
As discussed in the earlier section, you have learned that stochastic gradient
descent takes a much more noisy path than the gradient descent algorithm when
number of iterations to reach the optimal minimum, and hence, computation time
is very slow. To overcome the problem, we use stochastic gradient descent with
a momentum algorithm.
What the momentum does is helps in faster convergence of the loss function.
and updates the weights accordingly. However, adding a fraction of the previous
update to the current update will make the process a bit faster. One thing that
should be remembered while using this algorithm is that the learning rate should
In the above image, the left part shows the convergence graph of the stochastic
gradient descent algorithm. At the same time, the right side shows SGD with
momentum. From the image, you can compare the path chosen by both
algorithms and realize that using momentum helps reach convergence in less
time. You might be thinking of using a large momentum and learning rate to
make the process even faster. But remember that while increasing the
momentum, the possibility of passing the optimal minimum also increases. This
and R
use a subset of the dataset to calculate the loss function. Since we use a batch
of data instead of the whole dataset, we need fewer iterations. That is why the
descent and batch gradient descent algorithms. This algorithm is more efficient
and robust than the earlier variants of gradient descent. As the algorithm uses
batching, you do not need to load all the training data into memory, which makes
the process more efficient to implement. Moreover, the cost function in mini-
batch gradient descent is noisier than the batch gradient descent algorithm but
smoother than that of the stochastic gradient descent algorithm. Because of this,
mini-batch gradient descent is ideal and provides a good balance between speed
and accuracy.
Despite all that, the mini-batch gradient descent algorithm has some downsides
almost every case. Also, in some cases, it results in poor final accuracy. Due to
The adaptive gradient descent algorithm is slightly different from other gradient
descent algorithms. This is because it uses different learning rates for each
iteration. The change in learning rate depends upon the difference in the
parameters during training. The more the parameters get changed, the more
minor the learning rate changes. This modification is highly beneficial because
real-world datasets contain sparse as well as dense features. So it is unfair to
have the same value of learning rate for all the features. The Adagrad algorithm
uses the below formula to update the weights. Here the alpha(t) denotes the
to avoid division by 0.
The benefit of using Adagrad is that it abolishes the need to modify the learning
rate manually. It is more reliable than gradient descent algorithms and their
One downside of the AdaGrad optimizer is that it decreases the learning rate
aggressively and monotonically. There might be a point when the learning rate
increasing. Small learning rates prevent the model from acquiring more
RMS prop is one of the popular optimizers among deep learning enthusiasts.
This is maybe because it hasn’t been published but is still very well-known in the
the problem of varying gradients. The problem with the gradients is that some of
them were small while others may be huge. So, defining a single learning rate
might not be the best idea. RPPROP uses the gradient sign, adapting the step
size individually for each weight. In this algorithm, the two gradients are first
compared for signs. If they have the same sign, we’re going in the right direction,
increasing the step size by a small fraction. If they have opposite signs, we must
decrease the step size. Then we limit the step size and can now go for the
weight update.
The problem with RPPROP is that it doesn’t work well with large datasets and
based upon adaptive learning and is designed to deal with significant drawbacks
of AdaGrad and RMS prop optimizer. The main problem with the above two
optimizers is that the initial learning rate must be defined manually. One other
some point. Due to this, a certain number of iterations later, the model can no
To deal with these problems, AdaDelta uses two state variables to store the
leaky average of the second moment gradient and a leaky average of the second
delta Xt-1 denotes squares rescaled gradients, and epsilon represents a small
The name ‘Adam’ comes from ‘adaptive moment estimation,’ highlighting its
ability to adaptively adjust the learning rate for each network weight individually.
Unlike SGD, which maintains a single learning rate throughout training, Adam
Adam optimizer considers the second moment of the gradients, but unlike
By incorporating both the first moment (mean) and second moment (uncentered
that can efficiently navigate the optimization landscape during training. This
network.
The adam optimizer has several benefits, due to which it is used widely. It is
implement, has a faster running time, low memory requirements, and requires
If the adam optimizer uses the good properties of all the algorithms and is the
best available optimizer, then why shouldn’t you use Adam in every application?
And what was the need to learn about other algorithms in depth? This is because
even Adam has some downsides. It tends to focus on faster computation time,
whereas algorithms like stochastic gradient descent focus on data points. That’s
why algorithms like SGD generalize the data in a better manner at the cost of low
Hands-on Optimizers
analysis. It’s time to try what we have learned and compare the results by
keeping things simple, what’s better than the MNIST dataset? We will train a
simple model using some basic layers, keeping the batch size and epochs the
same but with different optimizers. For the sake of fairness, we will use the
num_classes=10
epochs=10
def build_model(optimizer):
model=Sequential()
model.add(Conv2D(32,kernel_size=(3,3),activation='relu',input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
for i in optimizers:
model = build_model(i)
We have run our model with a batch size of 64 for 10 epochs. After trying the
different optimizers, the results we get are pretty interesting. Before analyzing
the results, what do you think will be the best optimizer for this dataset?
Table Analysis
Epoch 1 Epoch 5 Epoch 10
Total
Optimizer Val accuracy | Val Val accuracy | Val Val accuracy | Val
Time
loss loss loss
SGD with .9168 | .2929 .9585 | .1421 .9697 | .1008 7:04 min
momentum
The above table shows the validation accuracy and loss at different epochs. It
also contains the total time that the model took to run on 10 epochs for each
optimizer. From the above table, we can make the following analysis.
The adam optimizer shows the best accuracy in a satisfactory amount of time.
RMSprop shows similar accuracy to that of Adam but with a comparatively much
Surprisingly, the SGD algorithm took the least time to train and produced good
results as well. But to reach the accuracy of the Adam optimizer, SGD will
require more iterations, and hence the computation time will increase.
SGD with momentum shows similar accuracy to SGD with unexpectedly larger
optimized.
Adadelta shows poor results both with accuracy and computation time.
You can analyze the accuracy of each optimizer with each epoch from the below
graph.
We’ve now reached the end of this comprehensive guide. To refresh your