0% found this document useful (0 votes)
12 views42 pages

UNIT II Deep Learning

Machine learning is a subset of artificial intelligence that enables computers to learn from data and improve their performance without explicit programming. It is classified into three types: supervised learning, unsupervised learning, and reinforcement learning, each with distinct methods and applications. The document discusses the workings, features, and importance of machine learning, along with various algorithms used in supervised learning such as linear regression, decision trees, and support vector machines.

Uploaded by

rdevika753
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views42 pages

UNIT II Deep Learning

Machine learning is a subset of artificial intelligence that enables computers to learn from data and improve their performance without explicit programming. It is classified into three types: supervised learning, unsupervised learning, and reinforcement learning, each with distinct methods and applications. The document discusses the workings, features, and importance of machine learning, along with various algorithms used in supervised learning such as linear regression, decision trees, and support vector machines.

Uploaded by

rdevika753
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 42

UNIT II – MACHINE LEARNING AND DEEP NETWORKS

Machine Learning: Basics

What is Machine Learning

In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which work
on our instructions. But can a machine also learn from experiences or past data like a human
does? So here comes the role of Machine Learning.

Introduction to Machine Learning

A subset of artificial intelligence known as machine learning focuses primarily on the


creation of algorithms that enable a computer to independently learn from data and previous
experiences. Arthur Samuel first used the term "machine learning" in 1959. It could be
summarized as follows:

Without being explicitly programmed, machine learning enables a machine to automatically


learn from data, improve performance from experiences, and predict things.

Machine learning algorithms create a mathematical model that, without being explicitly
programmed, aids in making predictions or decisions with the assistance of sample historical
data, or training data. For the purpose of developing predictive models, machine learning
brings together statistics and computer science. Algorithms that learn from historical data are
either constructed or utilized in machine learning. The performance will rise in proportion to
the quantity of information we provide.

A machine can learn if it can gain more data to improve its performance.
How does Machine Learning work

A machine learning system builds prediction models, learns from previous data, and predicts
the output of new data whenever it receives it. The amount of data helps to build a better
model that accurately predicts the output, which in turn affects the accuracy of the predicted
output.

Let's say we have a complex problem in which we need to make predictions. Instead of
writing code, we just need to feed the data to generic algorithms, which build the logic based
on the data and predict the output. Our perspective on the issue has changed as a result of
machine learning. The Machine Learning algorithm's operation is depicted in the following
block diagram:

Features of Machine Learning:

o Machine learning uses data to detect various patterns in a given dataset.


o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge amount
of the data.

Need for Machine Learning

The demand for machine learning is steadily rising. Because it is able to perform tasks that
are too complex for a person to directly implement, machine learning is required. Humans are
constrained by our inability to manually access vast amounts of data; as a result, we require
computer systems, which is where machine learning comes in to simplify our lives.

By providing them with a large amount of data and allowing them to automatically explore
the data, build models, and predict the required output, we can train machine learning
algorithms. The cost function can be used to determine the amount of data and the machine
learning algorithm's performance. We can save both time and money by using machine
learning.

Following are some key points which show the importance of Machine Learning:

o Rapid increment in the production of data


o Solving complex problems, which are difficult for a human
o Decision making in various sector including finance
o Finding hidden patterns and extracting useful information from data.

Classification of Machine Learning

At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

1) Supervised Learning
In supervised learning, sample labeled data are provided to the machine learning system for
training, and the system then predicts the output based on the training data.

The system uses labeled data to build a model that understands the datasets and learns about
each one. After the training and processing are done, we test the model with sample data to
see if it can accurately predict the output.

The mapping of the input data to the output data is the objective of supervised learning. The
managed learning depends on oversight, and it is equivalent to when an understudy learns
things in the management of the educator. Spam filtering is an example of supervised
learning.

Supervised learning can be grouped further in two categories of algorithms:

o Classification
o Regression
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.

The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any supervision.
The goal of unsupervised learning is to restructure the input data into new features or a group
of objects with similar patterns.

In unsupervised learning, we don't have a predetermined result. The machine tries to find
useful insights from the huge amount of data. It can be further classifieds into two categories
of algorithms:

o Clustering
o Association

3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a learning agent gets a
reward for each right action and gets a penalty for each wrong action. The agent learns
automatically with these feedbacks and improves its performance. In reinforcement learning,
the agent interacts with the environment and explores it. The goal of an agent is to get the
most reward points, and hence, it improves its performance.

The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.

Supervised learning algorithm

What is Supervised Machine Learning?

Supervised learning, also known as supervised machine learning, is a type of machine


learning that trains the model using labeled datasets to predict outcomes. A Labeled dataset
is one that consists of input data (features) along with corresponding output data (targets).

The main objective of supervised learning algorithms is to learn an association between input
data samples and corresponding outputs after performing multiple training data instances.

How does Supervised Learning Work?

In supervised machine learning, models are trained using a dataset that consists of input-
output pairs.

The supervised learning algorithm analyzes the dataset and learns the relation between the
input data (features) and correct output (labels/ targets). In the process of training, the model
estimates the algorithm's parameters by minimizing a loss function. The loss function
measures the difference between the model's predictions and actual target values.
The model iteratively updates its parameters until the loss/ error has been sufficiently
minimized.

Once the training is completed, the model parameters have optimal values. The model has
learned the optimal mapping/ relation between the inputs and targets. Now, the model can
predict values for the new and unseen input data.

Types of Supervised Learning Algorithm

Supervised machine learning is categorized into two types of problems − classification and
regression.

1. Classification

The key objective of classification-based tasks is to predict categorical output labels or


responses for the given input data such as true-false, male-female, yes-no etc. As we know,
the categorical output responses mean unordered and discrete values; hence, each output
response will belong to a specific class or category.

Some popular classification algorithms are decision trees, random forests, support vector
machines (SVM), logistic regression, etc.

2. Regression

The key objective of regression-based tasks is to predict output labels or responses, which are
continuous numeric values, for the given input data. Basically, regression models use the
input data features (independent variables) and their corresponding continuous numeric
output values (dependent or outcome variables) to learn specific associations between inputs
and corresponding outputs.

Some popular regression algorithms are linear regression, polynomial regression, Laso
regression, etc.

Algorithms for Supervised Learning

Supervised learning is one of the important models of learning involved in training machines.
This chapter talks in detail about the same.

There are several algorithms available for supervised learning. Some of the widely used
algorithms of supervised learning are as shown below −

 Linear Regression
 k-Nearest Neighbors
 Decision Trees
 Naive Bayes
 Logistic Regression
 Support Vector Machines
 Random Forest
 Gradient Boosting

Let's discuss each of the above mentioned supervised machine learning algorithms in detail.

1. Linear Regression
Linear regression is a type of algorithm that tries to find the linear relation between input
features and output values for the prediction of future events. This algorithm is widely used to
perform stock analysis, weather forecasting and others.

2. K-Nearest Neighbors
The k-Nearest Neighbors (kNN) is a statistical technique that can be used for solving
classification and regression problems. This algorithm classifies or predicts values for new
data by mathematically calculating the nearest distance with other points in training data.

Let us discuss the case of classifying an unknown object using kNN. Consider the
distribution of objects as shown in the image given below −

The diagram shows three types of objects, marked in red, blue and green colors. When you
run the kNN classifier on the above dataset, the boundaries for each type of object will be
marked as shown below −

Now, consider a new unknown object you want to classify as red, green or blue. This is
depicted in the figure below.
As you see it visually, the unknown data point belongs to a class of blue objects.
Mathematically, this can be concluded by measuring the distance of this unknown point with
every other point in the data set. When you do so, you will know that most of its neighbors
are blue in color. The average distance between red and green objects would definitely be
more than the average distance between blue objects. Thus, this unknown object can be
classified as belonging to blue class.

The kNN algorithm can also be used for regression problems. The kNN algorithm is available
as ready-to-use in most of the ML libraries.

3. Decision Trees
A Decision tree is a tree-like structure used to make decisions and analyze the possible
consequences. The algorithm splits the data into subsets based on features, where each parent
node represents internal decisions and the leaf node represents final prediction.

A simple decision tree in a flowchart format is shown below −

You would write a code to classify your input data based on this flowchart. The flowchart is
self-explanatory and trivial. In this scenario, you are trying to classify an incoming email to
decide when to read it.
In reality, the decision trees can be large and complex. There are several algorithms available
to create and traverse these trees. As a Machine Learning enthusiast, you need to understand
and master these techniques of creating and traversing decision trees.

4. Naive Bayes
Naive Bayes is used for creating classifiers. Suppose you want to sort out (classify) fruits of
different kinds from a fruit basket. You may use features such as color, size, and shape of
fruit; for example, any fruit that is red in color, round in shape, and about 10 cm in diameter
may be considered an Apple. So to train the model, you would use these features and test the
probability that a given feature matches the desired constraints. The probabilities of different
features are then combined to arrive at the probability that a given fruit is an Apple. Naive
Bayes generally requires a small number of training data for classification.

4. Logistic Regression
Logistic regression is a type of statistical algorithm that estimates the probability of
occurrence of an event.

Look at the following diagram. It shows the distribution of data points in the XY plane.

From the diagram, we can visually inspect the separation of red and green dots. You may
draw a boundary line to separate out these dots. Now, to classify a new data point, you will
just need to determine on which side of the line the point lies.

6. Support Vector Machines


Support Vector Machines (SVM) algorithm can be typically used for both classification and
regression. For classification tasks, the algorithm creates a hyperplane to separate data into
classes. While for regression, the algorithm tries to fit a regression line with minimal error.

Look at the following distribution of data. Here the three classes of data cannot be linearly
separated. The boundary curves are non-linear. In such a case, finding the curve's equation
becomes a complex job.
The Support Vector Machines (SVM) come in handy in determining the separation
boundaries in such situations.

7. Random Forest
Random forest is also a supervised learning algorithm that is flexible for classification and
regression. This algorithm is a combination of multiple decision trees which are merged to
improve the accuracy of prediction .

The following diagram illustrates how the Random Forest Algorithm works −

8. Gradient Boosting
Gradient boosting combines weak learners (decision trees), to create a strong model. It builds
new models that correct errors of the previous ones. The goal of this algorithm is to minimize
the loss function. It can be efficiently used for classification and regression tasks.

Advantages of Supervised Learning

Supervised learning algorithms are one of the most popular among the machine learning
models. Some benefits are:-

 The goal in supervised learning is well-defined, which improves the prediction accuracy.
 Models trained using supervised learning are effective at predicting and classification since
they use labeled datasets.
 It can be highly versatile, i.e., applied to various problems, like spam detection, stock prices,
etc.
Disadvantages of Supervised Learning

Though supervised learning is the most used, it comes with certain challenges too. Some of
them are:

 Supervised learning requires a large amount of labeled data for the model to train effectively.
It is practically very difficult to collect such huge data; it is expensive and time-consuming.
 Supervised learning cannot predict accurately if the test data is different from the training
data.
 Accurately labeling the data is complex and requires expertise and effort.

Applications of Supervised learning

Supervised learning models are widely used in many applications in various sectors,
including the following-

 Image recognition − A model is trained on a labeled dataset of images, where each image is
associated with a label. The model is fed with data, which allows it to learn patterns and
features. Once trained, the model can now be tested using new, unseen data. This is widely
used in applications like facial recognition and object detection.
 Predictive analytics − Supervised learning algorithms are used to train labeled historical
data, allowing the model to learn patterns and relations between input features and output to
identify trends and make accurate predictions. Businesses use this method to make data-
driven decisions and enhance strategic planning.

What is Unsupervised Machine Learning?

Unsupervised learning, also known as unsupervised machine learning, is a type of machine


learning that learns patterns and structures within the data without human supervision.
Unsupervised learning uses machine learning algorithms to analyze the data and discover
underlying patterns within unlabeled data sets.

Unlike supervised machine learning, unsupervised machine learning models are trained on
unlabeled dataset. Unsupervised learning algorithms are handy in scenarios in which we do
not have the liberty, like in supervised learning algorithms, of having pre-labeled training
data and we want to extract useful patterns from input data.

We can summarize unsupervised learning as −

 a machine learning approach or type that


 uses machine learning algorithms
 to find hidden patterns or structures
 within the data without human supervision.

There are many approaches that are used in unsupervised machine learning. Some of the
approaches are association, clustering, and dimensionality reduction. Some examples of
unsupervised machine learning algorithms include K-means clustering, K-nearest neighbors,
etc.

In regression, we train the machine to predict a future value. In classification, we train the
machine to classify an unknown object in one of the categories we define. In short, we have
been training machines so that it can predict Y for our data X. Given a huge data set and not
estimating the categories, it would be difficult for us to train the machine using supervised
learning. What if the machine can look up and analyze the big data running into several
Gigabytes and Terabytes and tell us that this data contains so many distinct categories?

As an example, consider the voters data. By considering some inputs from each voter (these
are called features in AI terminology), let the machine predict that there are so many voters
who would vote for X political party and so many would vote for Y, and so on. Thus, in
general, we are asking the machine given a huge set of data points X, What can you tell me
about X?. Or it may be a question like What are the five best groups we can make out of X?.
Or it could be even like What three features occur together most frequently in X?.

This is exactly what Unsupervised Learning is all about.

How does Unsupervised Learning Work?

In unsupervised learning, machine learning algorithms (called self-learning algorithms) are


trained on unlabeled data sets i.e, the input data is not categorized. Based on the tasks, or
machine learning problems such as clustering, associations, etc. and the data sets, the suitable
algorithms are chosen for the training.

In the training process, the algorthims learn and infer their own rules on the basis of the
similarities, patterns and differences of data points. The algorithms learn without any labels
(target values) or pre-training.

The outcome of this training process of algorithm with data sets is a machine learning model.
As the data sets are unlabeled (no target values, no human supervision), the model is
unsupervised machine learning model.

Now the model is ready to perform the unsupervised learning tasks such as clustering,
association, or dimensionality reduction.

Unsupervised learning models is suitable complex tasks, like organizing large datasets into
clusters.

Unsupervised Machine Learning Methods

Unsupervised learning methods or approaches are broadly categorized into three categories −
clustering, association, and dimensionality reduction. Let us discuss these methods briefly
and list some related algorithms −
1. Clustering
Clustering is a technique used to group a set of objects or data points into clusters based on
their similarities. The goal of this technique is to make sure that the data points within the
same cluster should have more similarities than those in other clusters.

Clustering is sometimes called unsupervised classification because it produces the same


result as classification does but without having predefined classes.

Clustering is one of the popular unsupervised learning approaches. There are several
unsupervised learning algorithms used for clustering like −

 K-Means Clustering − This algorithm is used to assign data points to one among the K
clusters based on its distance from the center of the cluster. After assigning each data point to
a cluster, new centroids are recalculated. This is an iterative process until the centroids no
longer change. This shows that the algorithm is efficient and the clusters are stable.
 Mean Shift Algorithm − It is a clustering technique that identifies clusters by finding high
data density areas. It is an iterative process, where mean of each data point is shifted towards
the densest area of the data.
 Gaussian Mixture Models − It is a probabilistic model that is a combination of multiple
Gaussian distributions. These models are used to determine which determination a given data
belongs to.

2. Association Rule Mining

This is rule based technique that is used to discover associations between parameters in large
dataset. It is popularly used for Market Basket Analysis, allows companies to make decisions
and recommendation engines. One of the main algorithms that is used for Association Rule
Mining is the Apriori algorithm.

Apriori Algorithm
Apriori algorithm is a technique used in unsupervised learning to identify data points that are
frequently repeated and discover association rules within transactional data.

3. Dimensionality Reduction
As the name suggests, dimensionality reduction is used to reduce the number of feature
variables for each data sample by selecting set of principal or representative features.

A question arises here is that, why we need to reduce the dimensionality? The reason behind
this is the problem of feature space complexity which arises when we start analyzing and
extracting millions of features from data samples. This problem generally refers to "curse of
dimensionality". Some popular algorithms in unsupervised learning that are used for
dimensionality reduction are −

 Principle Component Analysis


 Missing Value Ratio
 Singular Value Decomposition
 Autoencoders
Algorithms for Unsupervised Learning

Algorithms are very important part in machine learning model training. A machine learning
algorithm is a set of instructions that a program follows to analyze the data and produce the
outcomes. For specific tasks, suitable machine learning algorithms are selected and trained on
the data.

Algorithms used in unsupervised learning generally fall under one of the three categories −
clustering, association, or dimensionality reduction. The following are the most used
unsupervised learning algorithms −

 K-Means Clustering
 Hierarchical Clustering
 Mean-shift Clustering
 DBSCAN Clustering
 HDBSCAN Clustering
 BIRCH Clustering
 Affinity Propagation
 Agglomerative Clustering
 Apriori Algorithm
 Eclat algorithm
 FP-growth algorithm
 Principal Component Analysis(PCA)
 Autoencoders
 Singular value decomposition (SVD)

Advantages of Unsupervised Learning

Unsupervised learning has many advantages that make it particularly purposeful in various
tasks −

 No labeled data required − Unsupervised learning doesn't require a labeled dataset for
training, which makes it easier and cheaper to use.
 Discovers hidden patterns − It helps in recognizing patterns and relationships in large data,
which can lead to gaining insights and efficient decision-making.
 Suitable for complex tasks − It is efficiently used for various complex tasks like clustering,
anomaly detection, and dimensionality reduction.

Disadvantages of Unsupervised Learning

While unsupervised learning has many advantages, some challenges can occur too while
training the model without human intervention. Some of the disadvantages of unsupervised
learning are:

 Difficult to evaluate − Without labeled data and predefined targets, it would be difficult to
evaluate the performance of unsupervised learning algorithms.
 Inaccurate outcomes − The outcome of an unsupervised learning algorithm might be less
accurate, especially if the input data has noise and also since the data is not labeled, the
algorithms do not know the exact output.

Applications of Unsupervised Learning

Unsupervised learning provides a path for businesses to identify patterns in large volumes of
data. Some real-world applications of unsupervised learning are:

 Customer Segmentation − In business and retail analysis, unsupervised learning is used to


group customers into segments based on their purchases, past activity, or preferences.
 Anomaly Detection − Unsupervised learning algorithms are used in anomaly detection to
identify unusual patterns, which is crucial for fraud detection in financial transactions and
network security.
 Recommendation Engines − Unsupervised learning algorithms help to analyze large
customer data to gain valuable insights and understand patterns. This can help in target
marketing and personalization.
 Natural Language Processing− Unsupervised learning algorithms are used for various
applications. For example, google used to categorize articles in the news section.

DEEP FEED FORWARD NETWORK

What is Feed-forward Neural Networks?

A feedforward neural network is an artificial neural network in which nodes’


connections do not form a loop. Often referred to as a multi-layered network of neurons,
feedforward neural networks are so named because all information flows forward only.

Data enters the input nodes, travels through the hidden layers, and exits the output nodes.
The network lacks links, allowing the information leaving the output node to be sent
back into the network.

The purpose of feedforward neural networks is to approximate functions.

A classifier uses the formula y = f* (x).

This assigns the value of input x to the category y.

The feedfоrwаrd netwоrk will mар y = f (x; θ). It then memorizes the value of θ that
most closely approximates the function.

The Google Photos app shows that a feedforward neural network is the foundation for
photo object detection .

Types of Neural Network’s Layers

The following are the components of a feedforward neural network:


Layer of input

It contains the neurons that receive input. The data is subsequently passed on to the next
tier. The input layer’s total number of neurons equals the number of variables in the
dataset.

Hidden layer

This is the intermediate layer, which is concealed between the input and output layers. It
has many neurons that alter the inputs and then communicate with the output layer.

Output layer

It is the last layer and depends on the model’s construction. The output layer is the
expected feature, as you know the desired outcome.

Neurons weights

Weights describe the strength of a connection between neurons. A weight’s value ranges
from 0 to 1.

Cost Function in Feedforward Neural Network

The cost function is an important factor of a feedforward neural network. Generally,


minor adjustments to weights and biases have little effect on the categorized data points.
Thus, a method for improving performance can be determined by making minor
adjustments to weights and biases using a smooth cost function.

The mean square error cost function is defined as follows:

Where,

w = weights collected in the network

b = biases

a = output vectors

x = input

‖v‖ = usual length of vector v

Loss Function in Feedforward Neural Network


The cross-entropy loss associated with multi-class categorization is as follows:

Gradient Learning Algorithm

The Gradient Descent Algorithm repeatedly calculates the next point using gradient at
the current location, then scales it (by a learning rate) and subtracts the achieved value
from the current position (makes a step) (makes a step). It subtracts the value since we
want to decrease the function (to increase it would be adding) (and to maximize it would
be adding). This procedure may be written as:

There’s a crucial parameter η, which adjusts the gradient and hence affects the step size.
In machine learning , it is termed learning rate and substantially affects performance.

 The smaller the learning rate, the longer GD converges or may approach
maximum iteration before finding the optimal point
 If the learning rate is too great, the algorithm may not converge to the ideal point
(jump around) or diverge altogether.
The Gradient Descent method’s steps are:

1. Pick a beginning point (initialization)


2. Compute the gradient at this spot
3. Produce a scaled step in the opposite direction to the gradient (objective:
minimize) (objective: minimize)
4. Repeat points 2 and 3 until one of the conditions is met:

 maximum number of repetitions reached


 step size is smaller than the tolerance.

The following is an example of how to construct the Gradient Descent algorithm (with
steps tracking):

The function accepts the following five parameters:

1. Starting point: In our example, we specify it manually, but it is frequently


determined randomly.
2. Gradient function – must be defined in advance
3. Learning rate – factor used to scale step sizes
4. Maximum iterations
5. Tolerance for the algorithm to be stopped on a conditional basis (in this case, a
default value is 0.01)

Example- A quadratic function


Consider the following elementary quadratic function:

Because it is a univariate function, a gradient function is as follows:

Let us now write the following methods in Python:def func1(x): return x**2-4*x+1 def
gradient_func1(x):

return 2*x – 4

With a learning rate of 0.1 and a starting point of x=9, we can compute each step
manually for this function. Let us begin with the first three steps:

The python function is invoked as follows:history, result = gradient_descent(9,


gradient_func1, 0.1, 100)

The animation below illustrates the GD algorithm’s steps at 0.1 and 0.8 learning rates.
As the algorithm approaches the minimum, the steps become steadily smaller. Jumping
from one side to the other is necessary for a faster learning rate before convergence.

The first ten stages were conducted by GD to determine the learning rate for small and large
groups.

The following diagram illustrates the trajectory, number of iterations, and ultimate
converged output (within tolerance) for various learning rates:

The Need for a Neuron Model


Suppose the inputs to the network are pixel data from a character scan. There are a few
things you need to keep in mind while designing a network to classify a digit
appropriately:

You must experiment with the weights to see how the network learns. To reach
perfection, weight variations of just a few grams should have a negligible effect on
production.

On the other hand, what if a minor change in the weight results in a large change in the
output? The sigmoid neuron model can resolve this issue.

Applications of Feedforward Neural Network

These neural networks are utilized in a wide variety of applications. The following are

units denote several of them:

 Physiological feedforward system: Feedforward management is exemplified by

the central involuntary system’s usual preventative control of heartbeat before

exercise.

 Gene regulation and feedforward: A theme predominates throughout the

famous networks, and this motif has been demonstrated to be a feedforward

system for detecting non-temporary atmospheric alteration.

 Parallel feedforward compensation with derivative: This is a relatively recent

approach for converting the non-minimum component of an open-loop transfer

system into the minimum part.

Regularization in Deep Learning:

Overfitting can lead to poor performance on new data, especially in the presence of
outliers or noise in the training set. If so, this article addresses your concerns by
exploring techniques such as regularization in deep learning and essential concepts of
bagging, boosting, and stacking in ensemble learning to improve model generalization.

Avoiding overfitting can single-handedly improve our model’s performance.


In this article, you will delve into the concept of regularization in neural networks ,
discover different regularization techniques, and understand how these approaches can
greatly enhance model performance while preventing overfitting.

What is Regularization?

Regularization is a technique used in machine learning and deep learning to prevent

overfitting and improve a model’s generalization performance. It involves adding a

penalty term to the loss function during training.

This penalty discourages the model from becoming too complex or having large

parameter values, which helps in controlling the model’s ability to fit noise in the

training data. Regularization in deep learning methods includes L1 and L2

regularization, dropout, early stopping, and more. By applying regularization for deep

learning, models become more robust and better at making accurate predictions on

unseen data.

Before we deep dive into the topic, take a look at this image:

Have you seen this image before? As we move towards the right in this image,
our model tries to learn too well the details and the noise from the training data,
ultimately resulting in poor performance on the unseen data.

In other words, while going toward the right, the complexity of the model
increases such that the training error reduces but the testing error doesn’t. This
is shown in the image below:
If you’ve built a neural network before, you know how complex they are. This makes
them more prone to overfitting.

Regularization is a technique that modifies the learning algorithm slightly so that the
model generalizes better. This, in turn, improves the model’s performance on unseen
data as well.

How does Regularization help Reduce Overfitting?

Let’s consider a neural network that is overfitting on the training data as shown in the
image below:

If you have studied the concept of regularization in machine learning, you will
have a fair idea that regularization penalizes the coefficients. In deep learning, it
penalizes the weight matrices of the nodes.
Assume that our regularization coefficient is so high that some of the weight
matrices are nearly equal to zero.

This will result in a much simpler linear network and slight underfitting of the training
data.

Such a large value of the regularization coefficient is not that useful. We need to
optimize the value of the regularization coefficient to obtain a well-fitted model as
shown in the image below:

Different Regularization Techniques in Deep Learning

Now that we understand how regularization helps reduce overfitting, we’ll learn a few
different techniques for applying regularization in deep learning.

L2&L1 Regularization

L1 and L2 are the most common types of regularization deep learning. These update the

general cost function by adding another term known as the regularization term.

 Cost function = Loss (say, binary cross entropy) + Regularization term


Due to the addition of this regularization term, the values of weight matrices decrease

because it assumes that a neural network with smaller weight matrices leads to simpler

models. Therefore, it will also reduce overfitting to quite an extent.

However, this regularization term differs in L1 and L2.

For L2:

For L1:

In L2, we have: ||w||^2 = Σ w_i^2. This is known as ridge regression, where lambda is

the regularization parameter. It is the hyperparameter whose value is optimized for better

results. L2 regularization is also known as weight decay as it forces the weights to decay

towards zero (but not exactly zero).

In L1, we having: ||w||=Σ |w_i|. In this, we penalize the absolute value of the weights.

Unlike L2, the weights may be reduced to zero here. L1 regularization is also called

lasso regression. Hence, it is very useful when we are trying to compress our

model. Otherwise, we usually prefer L2 over it.

In keras, we can directly apply regularization for deep learning to any layer using

the regularizers .

Below is the sample code to apply L2 regularization to a Dense layer:

from keras import regularizers


model.add(Dense(64, input_dim=64,
kernel_regularizer=regularizers.l2(0.01)))
Copy Code

Note: Here the value 0.01 is the value of regularization parameter, i.e., lambda, which
we need to optimize further. We can optimize it using the grid-search method.

Similarly, we can also apply L1 regularization deep learning. We will look at this in
more detail in a case study later in this article.

Dropout

This is one of the most interesting types of regularization techniques . It also produces
very good results and is consequently the most frequently used regularization technique
in the field of deep learning.

To understand dropout, let’s say our neural network structure is akin to the one shown
below:

So what does dropout do? At every iteration, it randomly selects some nodes and
removes them along with all of their incoming and outgoing connections as
shown below:
Each iteration has a different set of nodes, which results in a different set of
outputs. This can also be thought of as an ensemble technique in machine learning.

Ensemble models usually perform better than a single model as they capture more
randomness. Similarly, dropout models also perform better than normal neural network
models.

This probability of choosing how many nodes should be dropped is the hyperparameter
of the dropout function. As seen in the image above, dropout can be applied to both the
hidden layers as well as the input layers.

In Keras, we can implement dropout using the Keras core layer . Below is the
Python code for it:
from keras.layers.core import Dropout

model = Sequential([
Dense(output_dim=hidden1_num_units, input_dim=input_num_units,
activation='relu'),
Dropout(0.25),

Dense(output_dim=output_num_units, input_dim=hidden5_num_units,
activation='softmax'),
])
Copy Code

As you can see, we have defined 0.25 as the probability of dropping. We can tune it
further for better results using the grid search method.

Data Augmentation

The simplest way to reduce overfitting is to increase the training data size. In machine
learning, however, increasing the training data size was impossible as the labeled data
was too costly.

But now, let’s consider we are dealing with images. In this case, there are a few ways of
increasing the size of the training data—rotating the image, flipping, scaling, shifting,
etc. In the image below, some transformation has been done on the handwritten digits
dataset.
This technique is known as data augmentation. It usually provides a big leap in

improving the accuracy of the model, and it can be considered a mandatory trick to

improve our predictions.

In keras, we can perform all of these transformations using ImageDataGenerator . It has

a big list of arguments that you can use to pre-process your training data.

Below is the sample code to implement it:

from keras.preprocessing.image import ImageDataGenerator


datagen = ImageDataGenerator(horizontal flip=True)
datagen.fit(train)Copy Code
Early Stopping

Early stopping is a cross-validation strategy in which we keep one part of the training

set as the validation set. When we see that the performance on the validation set is

getting worse, we immediately stop the training on the model.


In the above image, we will stop training at the dotted line since, after that, our model

will start overfitting on the training data.

In keras, we can apply early stopping using the callbacks function. Below is the sample

code for it.

from keras.callbacks import EarlyStopping

EarlyStopping(monitor='val_err', patience=5)Copy Code

Here, monitor refers to the quantity that you need to keep track of, and ‘ val_err’ refers

to the validation error.

Patience denotes the number of epochs with no further improvement, after which

training stops. For a better understanding, let’s look at the above image again. After the

dotted line, each epoch will result in a higher validation error value. Therefore, our

model will stop 5 epochs after the dotted line (since our patience equals 5) because it

sees no further improvement

Optimization for Data Models:

What are Optimizers in Deep Learning?

In deep learning, optimizers are crucial as algorithms that dynamically fine-tune a

model’s parameters throughout the training process, aiming to minimize a

predefined loss function. These specialized algorithms facilitate the learning

process of neural networks by iteratively refining the weights and biases based

on the feedback received from the data. Well-known optimizers in deep learning

encompass Stochastic Gradient Descent (SGD), Adam, and RMSprop, each

equipped with distinct update rules, learning rates, and momentum strategies, all
geared towards the overarching goal of discovering and converging upon optimal

model parameters, thereby enhancing overall performance.

Gradient Descent Deep Learning Optimizer

Gradient Descent can be considered the popular kid among the class of

optimizers in deep learning . This optimization algorithm uses calculus to

consistently modify the values and achieve the local minimum. Before moving

ahead, you might question what a gradient is.

In simple terms, consider you are holding a ball resting at the top of a bowl.

When you lose the ball, it goes along the steepest direction and eventually

settles at the bottom of the bowl. A Gradient provides the ball in the steepest

direction to reach the local minimum which is the bottom of the bowl.

The above equation means how the gradient is calculated. Here alpha is the step

size that represents how far to move against each gradient with each iteration.

Gradient descent works as follows:

1. Initialize Coefficients: Start with initial coefficients.

2. Evaluate Cost: Calculate the cost associated with these coefficients.

3. Search for Lower Cost: Look for a cost value lower than the current one.

4. Update Coefficients: Move towards the lower cost by updating the

coefficients’ values.

5. Repeat Process: Continue this process iteratively.


6. Reach Local Minimum: Stop when a local minimum is reached, where

further cost reduction is not possible.

Gradient descent works best for most purposes. However, it has some

downsides too. It is expensive to calculate the gradients if the size of the data is

huge. Gradient descent works well for convex functions, but it doesn’t know how

far to travel along the gradient for nonconvex functions.

Stochastic Gradient Descent Deep Learning Optimizer

At the end of the previous section, you learned why there might be better options

than using gradient descent on massive data. To tackle the challenges large

datasets pose, we have stochastic gradient descent, a popular approach among

optimizers in deep learning. The term stochastic denotes the element of


randomness upon which the algorithm relies. In stochastic gradient descent,

instead of processing the entire dataset during each iteration, we randomly select

batches of data. This implies that only a few samples from the dataset are

considered at a time, allowing for more efficient and computationally feasible

optimization in deep learning models.

The procedure is first to select the initial parameters w and learning rate n. Then

randomly shuffle the data at each iteration to reach an approximate minimum.

Since we are not using the whole dataset but the batches of it for each iteration,

the path taken by the algorithm is full of noise as compared to the gradient

descent algorithm. Thus, SGD uses a higher number of iterations to reach the

local minima. Due to an increase in the number of iterations, the overall

computation time increases. But even after increasing the number of iterations,

the computation cost is still less than that of the gradient descent optimizer. So

the conclusion is if the data is enormous and computational time is an essential

factor, stochastic gradient descent should be preferred over batch gradient

descent algorithm.

Stochastic Gradient Descent With Momentum Deep


Learning Optimizer

As discussed in the earlier section, you have learned that stochastic gradient

descent takes a much more noisy path than the gradient descent algorithm when

addressing optimizers in deep learning. Due to this, it requires a more significant

number of iterations to reach the optimal minimum, and hence, computation time
is very slow. To overcome the problem, we use stochastic gradient descent with

a momentum algorithm.

What the momentum does is helps in faster convergence of the loss function.

Stochastic gradient descent oscillates between either direction of the gradient

and updates the weights accordingly. However, adding a fraction of the previous

update to the current update will make the process a bit faster. One thing that

should be remembered while using this algorithm is that the learning rate should

be decreased with a high momentum term.

In the above image, the left part shows the convergence graph of the stochastic

gradient descent algorithm. At the same time, the right side shows SGD with

momentum. From the image, you can compare the path chosen by both

algorithms and realize that using momentum helps reach convergence in less

time. You might be thinking of using a large momentum and learning rate to

make the process even faster. But remember that while increasing the

momentum, the possibility of passing the optimal minimum also increases. This

might result in poor accuracy and even more oscillations.

Checkout this article – Coding Neural Networks From Scratch in Python

and R

Mini Batch Gradient Descent Deep Learning Optimizer


In this variant of gradient descent, instead of using all the training data, we only

use a subset of the dataset to calculate the loss function. Since we use a batch

of data instead of the whole dataset, we need fewer iterations. That is why the

mini-batch gradient descent algorithm is faster than both stochastic gradient

descent and batch gradient descent algorithms. This algorithm is more efficient

and robust than the earlier variants of gradient descent. As the algorithm uses

batching, you do not need to load all the training data into memory, which makes

the process more efficient to implement. Moreover, the cost function in mini-

batch gradient descent is noisier than the batch gradient descent algorithm but

smoother than that of the stochastic gradient descent algorithm. Because of this,

mini-batch gradient descent is ideal and provides a good balance between speed

and accuracy.

Despite all that, the mini-batch gradient descent algorithm has some downsides

too. It requires a hyperparameter called ‘mini-batch-size,’ which you must tune to

achieve the required accuracy. A batch size of 32 is generally appropriate for

almost every case. Also, in some cases, it results in poor final accuracy. Due to

this, there needs a rise to look for other alternatives too.

Adagrad (Adaptive Gradient Descent) Deep Learning


Optimizer

The adaptive gradient descent algorithm is slightly different from other gradient

descent algorithms. This is because it uses different learning rates for each

iteration. The change in learning rate depends upon the difference in the

parameters during training. The more the parameters get changed, the more

minor the learning rate changes. This modification is highly beneficial because
real-world datasets contain sparse as well as dense features. So it is unfair to

have the same value of learning rate for all the features. The Adagrad algorithm

uses the below formula to update the weights. Here the alpha(t) denotes the

different learning rates at each iteration, n is a constant, and E is a small positive

to avoid division by 0.

The benefit of using Adagrad is that it abolishes the need to modify the learning

rate manually. It is more reliable than gradient descent algorithms and their

variants, and it reaches convergence at a higher speed.

One downside of the AdaGrad optimizer is that it decreases the learning rate

aggressively and monotonically. There might be a point when the learning rate

becomes extremely small. This is because the squared gradients in the

denominator keep accumulating, and thus the denominator part keeps on

increasing. Small learning rates prevent the model from acquiring more

knowledge, which compromises its accuracy.

RMS Prop (Root Mean Square) Deep Learning


Optimizer

RMS prop is one of the popular optimizers among deep learning enthusiasts.

This is maybe because it hasn’t been published but is still very well-known in the

community. RMS prop is ideally an extension of the work RPPROP. It resolves

the problem of varying gradients. The problem with the gradients is that some of

them were small while others may be huge. So, defining a single learning rate
might not be the best idea. RPPROP uses the gradient sign, adapting the step

size individually for each weight. In this algorithm, the two gradients are first

compared for signs. If they have the same sign, we’re going in the right direction,

increasing the step size by a small fraction. If they have opposite signs, we must

decrease the step size. Then we limit the step size and can now go for the

weight update.

The problem with RPPROP is that it doesn’t work well with large datasets and

when we want to perform mini-batch updates. So, achieving the robustness of

RPPROP and the efficiency of mini-batches simultaneously was the main

motivation behind the rise of RMS prop. RMS prop is an advancement in

AdaGrad optimizer as it reduces the monotonically decreasing learning

AdaDelta Deep Learning Optimizer

AdaDelta can be seen as a more robust version of the AdaGrad optimizer. It is

based upon adaptive learning and is designed to deal with significant drawbacks

of AdaGrad and RMS prop optimizer. The main problem with the above two

optimizers is that the initial learning rate must be defined manually. One other

problem is the decaying learning rate which becomes infinitesimally small at

some point. Due to this, a certain number of iterations later, the model can no

longer learn new knowledge.

To deal with these problems, AdaDelta uses two state variables to store the

leaky average of the second moment gradient and a leaky average of the second

moment of change of parameters in the model.


Here St and delta Xt denote the state variables, g’t denotes rescaled gradient,

delta Xt-1 denotes squares rescaled gradients, and epsilon represents a small

positive integer to handle division by 0.

Adam Optimizer in Deep Learning

Adam optimizer, short for Adaptive Moment Estimation optimizer, serves as an

optimization algorithm commonly used in deep learning. It extends the stochastic

gradient descent (SGD) algorithm and updates the weights of a neural

network during training.

The name ‘Adam’ comes from ‘adaptive moment estimation,’ highlighting its

ability to adaptively adjust the learning rate for each network weight individually.

Unlike SGD, which maintains a single learning rate throughout training, Adam

optimizer dynamically computes individual learning rates based on the past

gradients and their second moments.


The creators of Adam optimizer incorporated the beneficial features of other

optimization algorithms such as AdaGrad and RMSProp. Similar to RMSProp,

Adam optimizer considers the second moment of the gradients, but unlike

RMSProp, it calculates the uncentered variance of the gradients (without

subtracting the mean).

By incorporating both the first moment (mean) and second moment (uncentered

variance) of the gradients, Adam optimizer achieves an adaptive learning rate

that can efficiently navigate the optimization landscape during training. This

adaptivity helps in faster convergence and improved performance of the neural

network.

In summary, Adam optimizer is an optimization algorithm that extends SGD by

dynamically adjusting learning rates based on individual weights. It combines the

features of AdaGrad and RMSProp to provide efficient and adaptive updates to

the network weights during deep learning training.

Adam Optimizer Formula

The adam optimizer has several benefits, due to which it is used widely. It is

adapted as a benchmark for deep learning papers and recommended as a

default optimization algorithm. Moreover, the algorithm is straightforward to

implement, has a faster running time, low memory requirements, and requires

less tuning than any other optimization algorithm.


The above formula represents the working of adam optimizer. Here B1 and B2

represent the decay rate of the average of the gradients.

If the adam optimizer uses the good properties of all the algorithms and is the

best available optimizer, then why shouldn’t you use Adam in every application?

And what was the need to learn about other algorithms in depth? This is because

even Adam has some downsides. It tends to focus on faster computation time,

whereas algorithms like stochastic gradient descent focus on data points. That’s

why algorithms like SGD generalize the data in a better manner at the cost of low

computation speed. So, the optimization algorithms can be picked accordingly

depending on the requirements and the type of data.


The above visualizations create a better picture in mind and help in comparing

the results of various optimization algorithms.

Hands-on Optimizers

We have learned enough theory, and now we need to do some practical

analysis. It’s time to try what we have learned and compare the results by

choosing different optimizers on a simple neural network. As we are talking about

keeping things simple, what’s better than the MNIST dataset? We will train a

simple model using some basic layers, keeping the batch size and epochs the

same but with different optimizers. For the sake of fairness, we will use the

default values with each optimizer.

The steps for building the network are given below:

Import Necessary Libraries


import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
(x_train, y_train), (x_test, y_test) = mnist.load_data()
print(x_train.shape, y_train.shape) Copy Code

Load the Dataset


x_train= x_train.reshape(x_train.shape[0],28,28,1)
x_test= x_test.reshape(x_test.shape[0],28,28,1)
input_shape=(28,28,1)
y_train=keras.utils.to_categorical(y_train)#,num_classes=)
y_test=keras.utils.to_categorical(y_test)#, num_classes)
x_train= x_train.astype('float32')
x_test= x_test.astype('float32')
x_train /= 255
x_test /=255 Copy Code

Build the Model


batch_size=64

num_classes=10

epochs=10

def build_model(optimizer):
model=Sequential()

model.add(Conv2D(32,kernel_size=(3,3),activation='relu',input_shape=input_shape))

model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Dropout(0.25))

model.add(Flatten())

model.add(Dense(256, activation='relu'))

model.add(Dropout(0.5))

model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy, optimizer= optimizer,


metrics=['accuracy'])

return modelCopy Code

Train the Model


optimizers = ['Adadelta', 'Adagrad', 'Adam', 'RMSprop', 'SGD']

for i in optimizers:

model = build_model(i)

hist=model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1,


validation_data=(x_test,y_test)) Copy Code

We have run our model with a batch size of 64 for 10 epochs. After trying the

different optimizers, the results we get are pretty interesting. Before analyzing

the results, what do you think will be the best optimizer for this dataset?

Table Analysis
Epoch 1 Epoch 5 Epoch 10
Total
Optimizer Val accuracy | Val Val accuracy | Val Val accuracy | Val
Time
loss loss loss

Adadelta .4612 | 2.2474 .7776 | 1.6943 .8375 | 0.9026 8:02 min

Adagrad .8411 | .7804 .9133 | .3194 .9286 | 0.2519 7:33 min

Adam .9772 | .0701 .9884 | .0344 .9908 | .0297 7:20 min

RMSprop .9783 | .0712 .9846 | .0484 .9857 | .0501 10:01 min

SGD with .9168 | .2929 .9585 | .1421 .9697 | .1008 7:04 min
momentum

SGD .9124 | .3157 .9569 | 1451 .9693 | .1040 6:42 min

The above table shows the validation accuracy and loss at different epochs. It

also contains the total time that the model took to run on 10 epochs for each

optimizer. From the above table, we can make the following analysis.

 The adam optimizer shows the best accuracy in a satisfactory amount of time.

 RMSprop shows similar accuracy to that of Adam but with a comparatively much

larger computation time.

 Surprisingly, the SGD algorithm took the least time to train and produced good

results as well. But to reach the accuracy of the Adam optimizer, SGD will

require more iterations, and hence the computation time will increase.

 SGD with momentum shows similar accuracy to SGD with unexpectedly larger

computation time. This means the value of momentum taken needs to be

optimized.

 Adadelta shows poor results both with accuracy and computation time.

You can analyze the accuracy of each optimizer with each epoch from the below

graph.
We’ve now reached the end of this comprehensive guide. To refresh your

memory, we will go through a summary of every optimization algorithm that we

have covered in this guide. To refresh your memory, we will go through a

summary of every optimization algorithm that we have covered in this guide.

You might also like