0% found this document useful (0 votes)
28 views300 pages

Main Dataset

Uploaded by

Viit Pune
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views300 pages

Main Dataset

Uploaded by

Viit Pune
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 300

Deep learning is a branch of machine learning which is based on artificial neural

networks. It is capable of learning complex patterns and relationships within data.


In deep learning, we don’t need to explicitly program everything. It has become
increasingly popular in recent years due to the advances in processing power and
the availability of large datasets. Because it is based on artificial neural
networks (ANNs) also known as deep neural networks (DNNs). These neural networks
are inspired by the structure and function of the human brain’s biological neurons,
and they are designed to learn from large amounts of data.
Deep Learning is a subfield of Machine Learning that involves the use of neural
networks to model and solve complex problems. Neural networks are modeled after the
structure and function of the human brain and consist of layers of interconnected
nodes that process and transform data.The key characteristic of Deep Learning is
the use of deep neural networks, which have multiple layers of interconnected
nodes. These networks can learn complex representations of data by discovering
hierarchical patterns and features in the data. Deep Learning algorithms can
automatically learn and improve from data without the need for manual feature
engineering.Deep Learning has achieved significant success in various fields,
including image recognition, natural language processing, speech recognition, and
recommendation systems. Some of the popular Deep Learning architectures include
Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Deep
Belief Networks (DBNs).Training deep neural networks typically requires a large
amount of data and computational resources. However, the availability of cloud
computing and the development of specialized hardware, such as Graphics Processing
Units (GPUs), has made it easier to train deep neural networks.
In summary, Deep Learning is a subfield of Machine Learning that involves the use
of deep neural networks to model and solve complex problems. Deep Learning has
achieved significant success in various fields, and its use is expected to continue
to grow as more data becomes available, and more powerful computing resources
become available.
What is Deep Learning?
Deep learning is the branch of machine learning which is based on artificial neural
network architecture. An artificial neural network or ANN uses layers of
interconnected nodes called neurons that work together to process and learn from
the input data.
In a fully connected Deep neural network, there is an input layer and one or more
hidden layers connected one after the other. Each neuron receives input from the
previous layer neurons or the input layer. The output of one neuron becomes the
input to other neurons in the next layer of the network, and this process continues
until the final layer produces the output of the network. The layers of the neural
network transform the input data through a series of nonlinear transformations,
allowing the network to learn complex representations of the input data.

Today Deep learning has become one of the most popular and visible areas of machine
learning, due to its success in a variety of applications, such as computer vision,
natural language processing, and Reinforcement learning.
Deep learning can be used for supervised, unsupervised as well as reinforcement
machine learning. it uses a variety of ways to process these.
Supervised Machine Learning: Supervised machine learning is the machine learning
technique in which the neural network learns to make predictions or classify data
based on the labeled datasets. Here we input both input features along with the
target variables. the neural network learns to make predictions based on the cost
or error that comes from the difference between the predicted and the actual
target, this process is known as backpropagation. Deep learning algorithms like
Convolutional neural networks, Recurrent neural networks are used for many
supervised tasks like image classifications and recognization, sentiment analysis,
language translations, etc.Unsupervised Machine Learning: Unsupervised machine
learning is the machine learning technique in which the neural network learns to
discover the patterns or to cluster the dataset based on unlabeled datasets. Here
there are no target variables. while the machine has to self-determined the hidden
patterns or relationships within the datasets. Deep learning algorithms like
autoencoders and generative models are used for unsupervised tasks like clustering,
dimensionality reduction, and anomaly detection.Reinforcement Machine Learning:
Reinforcement Machine Learning is the machine learning technique in which an agent
learns to make decisions in an environment to maximize a reward signal. The agent
interacts with the environment by taking action and observing the resulting
rewards. Deep learning can be used to learn policies, or a set of actions, that
maximizes the cumulative reward over time. Deep reinforcement learning algorithms
like Deep Q networks and Deep Deterministic Policy Gradient (DDPG) are used to
reinforce tasks like robotics and game playing etc.Artificial neural networks
Artificial neural networks are built on the principles of the structure and
operation of human neurons. It is also known as neural networks or neural nets. An
artificial neural network’s input layer, which is the first layer, receives input
from external sources and passes it on to the hidden layer, which is the second
layer. Each neuron in the hidden layer gets information from the neurons in the
previous layer, computes the weighted total, and then transfers it to the neurons
in the next layer. These connections are weighted, which means that the impacts of
the inputs from the preceding layer are more or less optimized by giving each input
a distinct weight. These weights are then adjusted during the training process to
enhance the performance of the model.
Fully Connected Artificial Neural Network
Artificial neurons, also known as units, are found in artificial neural networks.
The whole Artificial Neural Network is composed of these artificial neurons, which
are arranged in a series of layers. The complexities of neural networks will depend
on the complexities of the underlying patterns in the dataset whether a layer has a
dozen units or millions of units. Commonly, Artificial Neural Network has an input
layer, an output layer as well as hidden layers. The input layer receives data from
the outside world which the neural network needs to analyze or learn about.
In a fully connected artificial neural network, there is an input layer and one or
more hidden layers connected one after the other. Each neuron receives input from
the previous layer neurons or the input layer. The output of one neuron becomes the
input to other neurons in the next layer of the network, and this process continues
until the final layer produces the output of the network. Then, after passing
through one or more hidden layers, this data is transformed into valuable data for
the output layer. Finally, the output layer provides an output in the form of an
artificial neural network’s response to the data that comes in.
Units are linked to one another from one layer to another in the bulk of neural
networks. Each of these links has weights that control how much one unit influences
another. The neural network learns more and more about the data as it moves from
one unit to another, ultimately producing an output from the output layer.
Difference between Machine Learning and Deep Learning :
machine learning and deep learning both are subsets of artificial intelligence but
there are many similarities and differences between them.
Machine Learning
Deep Learning
Apply statistical algorithms to learn the hidden patterns and relationships in the
dataset.Uses artificial neural network architecture to learn the hidden patterns
and relationships in the dataset.Can work on the smaller amount of datasetRequires
the larger volume of dataset compared to machine learningBetter for the low-label
task.Better for complex task like image processing, natural language processing,
etc.Takes less time to train the model.Takes more time to train the model.A model
is created by relevant features which are manually extracted from images to detect
an object in the image.Relevant features are automatically extracted from images.
It is an end-to-end learning process.Less complex and easy to interpret the
result.More complex, it works like the black box interpretations of the result are
not easy.It can work on the CPU or requires less computing power as compared to
deep learning.It requires a high-performance computer with GPU.Types of neural
networks
Deep Learning models are able to automatically learn features from the data, which
makes them well-suited for tasks such as image recognition, speech recognition, and
natural language processing. The most widely used architectures in deep learning
are feedforward neural networks, convolutional neural networks (CNNs), and
recurrent neural networks (RNNs).
Feedforward neural networks (FNNs) are the simplest type of ANN, with a linear flow
of information through the network. FNNs have been widely used for tasks such as
image classification, speech recognition, and natural language processing.
Convolutional Neural Networks (CNNs) are specifically for image and video
recognition tasks. CNNs are able to automatically learn features from the images,
which makes them well-suited for tasks such as image classification, object
detection, and image segmentation.
Recurrent Neural Networks (RNNs) are a type of neural network that is able to
process sequential data, such as time series and natural language. RNNs are able to
maintain an internal state that captures information about the previous inputs,
which makes them well-suited for tasks such as speech recognition, natural language
processing, and language translation.
Applications of Deep Learning :
The main applications of deep learning can be divided into computer vision, natural
language processing (NLP), and reinforcement learning.
Computer vision
In computer vision, Deep learning models can enable machines to identify and
understand visual data. Some of the main applications of deep learning in computer
vision include:
Object detection and recognition: Deep learning model can be used to identify and
locate objects within images and videos, making it possible for machines to perform
tasks such as self-driving cars, surveillance, and robotics. Image classification:
Deep learning models can be used to classify images into categories such as
animals, plants, and buildings. This is used in applications such as medical
imaging, quality control, and image retrieval. Image segmentation: Deep learning
models can be used for image segmentation into different regions, making it
possible to identify specific features within images.Natural language processing
(NLP):
In NLP, the Deep learning model can enable machines to understand and generate
human language. Some of the main applications of deep learning in NLP include:
Automatic Text Generation – Deep learning model can learn the corpus of text and
new text like summaries, essays can be automatically generated using these trained
models.Language translation: Deep learning models can translate text from one
language to another, making it possible to communicate with people from different
linguistic backgrounds. Sentiment analysis: Deep learning models can analyze the
sentiment of a piece of text, making it possible to determine whether the text is
positive, negative, or neutral. This is used in applications such as customer
service, social media monitoring, and political analysis. Speech recognition: Deep
learning models can recognize and transcribe spoken words, making it possible to
perform tasks such as speech-to-text conversion, voice search, and voice-controlled
devices. Reinforcement learning:
In reinforcement learning, deep learning works as training agents to take action in
an environment to maximize a reward. Some of the main applications of deep learning
in reinforcement learning include:
Game playing: Deep reinforcement learning models have been able to beat human
experts at games such as Go, Chess, and Atari. Robotics: Deep reinforcement
learning models can be used to train robots to perform complex tasks such as
grasping objects, navigation, and manipulation. Control systems: Deep reinforcement
learning models can be used to control complex systems such as power grids, traffic
management, and supply chain optimization. Challenges in Deep Learning
Deep learning has made significant advancements in various fields, but there are
still some challenges that need to be addressed. Here are some of the main
challenges in deep learning:
Data availability: It requires large amounts of data to learn from. For using deep
learning it’s a big concern to gather as much data for training.Computational
Resources: For training the deep learning model, it is computationally expensive
because it requires specialized hardware like GPUs and TPUs.Time-consuming: While
working on sequential data depending on the computational resource it can take very
large even in days or months. Interpretability: Deep learning models are complex,
it works like a black box. it is very difficult to interpret the
result.Overfitting: when the model is trained again and again, it becomes too
specialized for the training data, leading to overfitting and poor performance on
new data.Advantages of Deep Learning:High accuracy: Deep Learning algorithms can
achieve state-of-the-art performance in various tasks, such as image recognition
and natural language processing.Automated feature engineering: Deep Learning
algorithms can automatically discover and learn relevant features from data without
the need for manual feature engineering.Scalability: Deep Learning models can scale
to handle large and complex datasets, and can learn from massive amounts of
data.Flexibility: Deep Learning models can be applied to a wide range of tasks and
can handle various types of data, such as images, text, and speech.Continual
improvement: Deep Learning models can continually improve their performance as more
data becomes available.Disadvantages of Deep Learning:High computational
requirements: Deep Learning models require large amounts of data and computational
resources to train and optimize.Requires large amounts of labeled data: Deep
Learning models often require a large amount of labeled data for training, which
can be expensive and time- consuming to acquire.Interpretability: Deep Learning
models can be challenging to interpret, making it difficult to understand how they
make decisions.Overfitting: Deep Learning models can sometimes overfit to the
training data, resulting in poor performance on new and unseen data.Black-box
nature: Deep Learning models are often treated as black boxes, making it difficult
to understand how they work and how they arrived at their predictions.In summary,
while Deep Learning offers many advantages, including high accuracy and
scalability, it also has some disadvantages, such as high computational
requirements, the need for large amounts of labeled data, and interpretability
challenges. These limitations need to be carefully considered when deciding whether
to use Deep Learning for a specific task.

Artificial Intelligence is basically the mechanism to incorporate human


intelligence into machines through a set of rules(algorithm). AI is a combination
of two words: “Artificial” meaning something made by humans or non-natural things
and “Intelligence” meaning the ability to understand or think accordingly. Another
definition could be that “AI is basically the study of training your
machine(computers) to mimic a human brain and its thinking capabilities”.
AI focuses on 3 major aspects(skills): learning, reasoning, and self-correction to
obtain the maximum efficiency possible.
Machine Learning:
Machine Learning is basically the study/process which provides the
system(computer) to learn automatically on its own through experiences it had and
improve accordingly without being explicitly programmed. ML is an application or
subset of AI. ML focuses on the development of programs so that it can access data
to use it for itself. The entire process makes observations on data to identify the
possible patterns being formed and make better future decisions as per the examples
provided to them. The major aim of ML is to allow the systems to learn by
themselves through experience without any kind of human intervention or assistance.
Deep Learning:
Deep Learning is basically a sub-part of the broader family of Machine Learning
which makes use of Neural Networks(similar to the neurons working in our brain) to
mimic human brain-like behavior. DL algorithms focus on information processing
patterns mechanism to possibly identify the patterns just like our human brain does
and classifies the information accordingly. DL works on larger sets of data when
compared to ML and the prediction mechanism is self-administered by machines.
Below is a table of differences between Artificial Intelligence, Machine Learning
and Deep Learning:
Artificial IntelligenceMachine LearningDeep LearningAI stands for Artificial
Intelligence, and is basically the study/process which enables machines to mimic
human behaviour through particular algorithm.ML stands for Machine Learning, and is
the study that uses statistical methods enabling machines to improve with
experience.DL stands for Deep Learning, and is the study that makes use of Neural
Networks(similar to neurons present in human brain) to imitate functionality just
like a human brain.AI is the broader family consisting of ML and DL as it’s
components.ML is the subset of AI.DL is the subset of ML.AI is a computer algorithm
which exhibits intelligence through decision making.ML is an AI algorithm which
allows system to learn from data.DL is a ML algorithm that uses deep(more than one
layer) neural networks to analyze data and provide output accordingly.Search Trees
and much complex math is involved in AI.If you have a clear idea about the
logic(math) involved in behind and you can visualize the complex functionalities
like K-Mean, Support Vector Machines, etc., then it defines the ML aspect.If you
are clear about the math involved in it but don’t have idea about the features, so
you break the complex functionalities into linear/lower dimension features by
adding more layers, then it defines the DL aspect.The aim is to basically increase
chances of success and not accuracy.The aim is to increase accuracy not caring much
about the success ratio.It attains the highest rank in terms of accuracy when it is
trained with large amount of data.Three broad categories/types Of AI are:
Artificial Narrow Intelligence (ANI), Artificial General Intelligence (AGI) and
Artificial Super Intelligence (ASI)Three broad categories/types Of ML are:
Supervised Learning, Unsupervised Learning and Reinforcement LearningDL can be
considered as neural networks with a large number of parameters layers lying in one
of the four fundamental network architectures: Unsupervised Pre-trained Networks,
Convolutional Neural Networks, Recurrent Neural Networks and Recursive Neural
NetworksThe efficiency Of AI is basically the efficiency provided by ML and DL
respectively.Less efficient than DL as it can’t work for longer dimensions or
higher amount of data.More powerful than ML as it can easily work for larger sets
of data.Examples of AI applications include: Google’s AI-Powered Predictions,
Ridesharing Apps Like Uber and Lyft, Commercial Flights Use an AI Autopilot,
etc.Examples of ML applications include: Virtual Personal Assistants: Siri, Alexa,
Google, etc., Email Spam and Malware Filtering.Examples of DL applications include:
Sentiment based news aggregation, Image analysis and caption generation, etc.AI
refers to the broad field of computer science that focuses on creating intelligent
machines that can perform tasks that would normally require human intelligence,
such as reasoning, perception, and decision-making.ML is a subset of AI that
focuses on developing algorithms that can learn from data and improve their
performance over time without being explicitly programmed. DL is a subset of ML
that focuses on developing deep neural networks that can automatically learn and
extract features from data.AI can be further broken down into various subfields
such as robotics, natural language processing, computer vision, expert systems, and
more.ML algorithms can be categorized as supervised, unsupervised, or reinforcement
learning. In supervised learning, the algorithm is trained on labeled data, where
the desired output is known. In unsupervised learning, the algorithm is trained on
unlabeled data, where the desired output is unknown. DL algorithms are inspired by
the structure and function of the human brain, and they are particularly well-
suited to tasks such as image and speech recognition. AI systems can be rule-based,
knowledge-based, or data-driven.In reinforcement learning, the algorithm learns by
trial and error, receiving feedback in the form of rewards or punishments. DL
networks consist of multiple layers of interconnected neurons that process data in
a hierarchical manner, allowing them to learn increasingly complex representations
of the data.AI vs. Machine Learning vs. Deep Learning Examples:
Artificial Intelligence (AI) refers to the development of computer systems that can
perform tasks that would normally require human intelligence.
Some examples of AI include:
There are numerous examples of AI applications across various industries. Here are
some common examples:
Speech recognition: speech recognition systems use deep learning algorithms to
recognize and classify images and speech. These systems are used in a variety of
applications, such as self-driving cars, security systems, and medical
imaging.Personalized recommendations: E-commerce sites and streaming services like
Amazon and Netflix use AI algorithms to analyze users’ browsing and viewing history
to recommend products and content that they are likely to be interested
in.Predictive maintenance: AI-powered predictive maintenance systems analyze data
from sensors and other sources to predict when equipment is likely to fail, helping
to reduce downtime and maintenance costs.Medical diagnosis: AI-powered medical
diagnosis systems analyze medical images and other patient data to help doctors
make more accurate diagnoses and treatment plans.Autonomous vehicles: Self-driving
cars and other autonomous vehicles use AI algorithms and sensors to analyze their
environment and make decisions about speed, direction, and other factors.Virtual
Personal Assistants (VPA) like Siri or Alexa – these use natural language
processing to understand and respond to user requests, such as playing music,
setting reminders, and answering questions.Autonomous vehicles – self-driving cars
use AI to analyze sensor data, such as cameras and lidar, to make decisions about
navigation, obstacle avoidance, and route planning.Fraud detection – financial
institutions use AI to analyze transactions and detect patterns that are indicative
of fraud, such as unusual spending patterns or transactions from unfamiliar
locations.Image recognition – AI is used in applications such as photo
organization, security systems, and autonomous robots to identify objects, people,
and scenes in images.Natural language processing – AI is used in chatbots and
language translation systems to understand and generate human-like text.Predictive
analytics – AI is used in industries such as healthcare and marketing to analyze
large amounts of data and make predictions about future events, such as disease
outbreaks or consumer behavior.Game-playing AI – AI algorithms have been developed
to play games such as chess, Go, and poker at a superhuman level, by analyzing game
data and making predictions about the outcomes of moves.Examples of Machine
Learning:
Machine Learning (ML) is a subset of Artificial Intelligence (AI) that involves the
use of algorithms and statistical models to allow a computer system to “learn” from
data and improve its performance over time, without being explicitly programmed to
do so.
Here are some examples of Machine Learning:
Image recognition: Machine learning algorithms are used in image recognition
systems to classify images based on their contents. These systems are used in a
variety of applications, such as self-driving cars, security systems, and medical
imaging.Speech recognition: Machine learning algorithms are used in speech
recognition systems to transcribe speech and identify the words spoken. These
systems are used in virtual assistants like Siri and Alexa, as well as in call
centers and other applications.Natural language processing (NLP): Machine learning
algorithms are used in NLP systems to understand and generate human language. These
systems are used in chatbots, virtual assistants, and other applications that
involve natural language interactions.Recommendation systems: Machine learning
algorithms are used in recommendation systems to analyze user data and recommend
products or services that are likely to be of interest. These systems are used in
e-commerce sites, streaming services, and other applications.Sentiment analysis:
Machine learning algorithms are used in sentiment analysis systems to classify the
sentiment of text or speech as positive, negative, or neutral. These systems are
used in social media monitoring and other applications.Predictive maintenance:
Machine learning algorithms are used in predictive maintenance systems to analyze
data from sensors and other sources to predict when equipment is likely to fail,
helping to reduce downtime and maintenance costs.Spam filters in email – ML
algorithms analyze email content and metadata to identify and flag messages that
are likely to be spam.Recommendation systems – ML algorithms are used in e-commerce
websites and streaming services to make personalized recommendations to users based
on their browsing and purchase history.Predictive maintenance – ML algorithms are
used in manufacturing to predict when machinery is likely to fail, allowing for
proactive maintenance and reducing downtime.Credit risk assessment – ML algorithms
are used by financial institutions to assess the credit risk of loan applicants, by
analyzing data such as their income, employment history, and credit score.Customer
segmentation – ML algorithms are used in marketing to segment customers into
different groups based on their characteristics and behavior, allowing for targeted
advertising and promotions.Fraud detection – ML algorithms are used in financial
transactions to detect patterns of behavior that are indicative of fraud, such as
unusual spending patterns or transactions from unfamiliar locations.Speech
recognition – ML algorithms are used to transcribe spoken words into text, allowing
for voice-controlled interfaces and dictation software.Examples of Deep Learning:
Deep Learning is a type of Machine Learning that uses artificial neural networks
with multiple layers to learn and make decisions.
Here are some examples of Deep Learning:
Image and video recognition: Deep learning algorithms are used in image and video
recognition systems to classify and analyze visual data. These systems are used in
self-driving cars, security systems, and medical imaging.Generative models: Deep
learning algorithms are used in generative models to create new content based on
existing data. These systems are used in image and video generation, text
generation, and other applications.Autonomous vehicles: Deep learning algorithms
are used in self-driving cars and other autonomous vehicles to analyze sensor data
and make decisions about speed, direction, and other factors.Image classification –
Deep Learning algorithms are used to recognize objects and scenes in images, such
as recognizing faces in photos or identifying items in an image for an e-commerce
website.Speech recognition – Deep Learning algorithms are used to transcribe spoken
words into text, allowing for voice-controlled interfaces and dictation
software.Natural language processing – Deep Learning algorithms are used for tasks
such as sentiment analysis, language translation, and text generation.Recommender
systems – Deep Learning algorithms are used in recommendation systems to make
personalized recommendations based on users’ behavior and preferences.Fraud
detection – Deep Learning algorithms are used in financial transactions to detect
patterns of behavior that are indicative of fraud, such as unusual spending
patterns or transactions from unfamiliar locations.Game-playing AI – Deep Learning
algorithms have been used to develop game-playing AI that can compete at a
superhuman level, such as the AlphaGo AI that defeated the world champion in the
game of Go.Time series forecasting – Deep Learning algorithms are used to forecast
future values in time series data, such as stock prices, energy consumption, and
weather patterns.AI vs. ML vs. DL works: Is There a Difference?
Working in AI is not the same as being an ML or DL engineer. Here’s how you can
tell those careers apart and decide which one is the right call for you.
What Does an AI Engineer Do?
An AI Engineer is a professional who designs, develops, and implements artificial
intelligence (AI) systems and solutions. Here are some of the key responsibilities
and tasks of an AI Engineer:
Design and development of AI algorithms: AI Engineers design, develop, and
implement AI algorithms, such as decision trees, random forests, and neural
networks, to solve specific problems.Data analysis: AI Engineers analyze and
interpret data, using statistical and mathematical techniques, to identify patterns
and relationships that can be used to train AI models.Model training and
evaluation: AI Engineers train AI models on large datasets, evaluate their
performance, and adjust the parameters of the algorithms to improve
accuracy.Deployment and maintenance: AI Engineers deploy AI models into production
environments and maintain and update them over time.Collaboration with
stakeholders: AI Engineers work closely with stakeholders, including data
scientists, software engineers, and business leaders, to understand their
requirements and ensure that the AI solutions meet their needs.Research and
innovation: AI Engineers stay current with the latest advancements in AI and
contribute to the research and development of new AI techniques and
algorithms.Communication: AI Engineers communicate the results of their work,
including the performance of AI models and their impact on business outcomes, to
stakeholders.
An AI Engineer must have a strong background in computer science, mathematics, and
statistics, as well as experience in developing AI algorithms and solutions. They
should also be familiar with programming languages, such as Python and R.
What Does a Machine Learning Engineer Do?
A Machine Learning Engineer is a professional who designs, develops, and implements
machine learning (ML) systems and solutions. Here are some of the key
responsibilities and tasks of a Machine Learning Engineer:
Design and development of ML algorithms: Machine Learning Engineers design,
develop, and implement ML algorithms, such as decision trees, random forests, and
neural networks, to solve specific problems.Data analysis: Machine Learning
Engineers analyze and interpret data, using statistical and mathematical
techniques, to identify patterns and relationships that can be used to train ML
models.Model training and evaluation: Machine Learning Engineers train ML models on
large datasets, evaluate their performance, and adjust the parameters of the
algorithms to improve accuracy.Deployment and maintenance: Machine Learning
Engineers deploy ML models into production environments and maintain and update
them over time.Collaboration with stakeholders: Machine Learning Engineers work
closely with stakeholders, including data scientists, software engineers, and
business leaders, to understand their requirements and ensure that the ML solutions
meet their needs.Research and innovation: Machine Learning Engineers stay current
with the latest advancements in ML and contribute to the research and development
of new ML techniques and algorithms.Communication: Machine Learning Engineers
communicate the results of their work, including the performance of ML models and
their impact on business outcomes, to stakeholders.
A Machine Learning Engineer must have a strong background in computer science,
mathematics, and statistics, as well as experience in developing ML algorithms and
solutions. They should also be familiar with programming languages, such as Python
and R, and have experience working with ML frameworks and tools.
What Does a Deep Learning Engineer Do?
A Deep Learning Engineer is a professional who designs, develops, and implements
deep learning (DL) systems and solutions. Here are some of the key responsibilities
and tasks of a Deep Learning Engineer:
Design and development of DL algorithms: Deep Learning Engineers design, develop,
and implement deep neural networks and other DL algorithms to solve specific
problems.Data analysis: Deep Learning Engineers analyze and interpret large
datasets, using statistical and mathematical techniques, to identify patterns and
relationships that can be used to train DL models.Model training and evaluation:
Deep Learning Engineers train DL models on massive datasets, evaluate their
performance, and adjust the parameters of the algorithms to improve
accuracy.Deployment and maintenance: Deep Learning Engineers deploy DL models into
production environments and maintain and update them over time.Collaboration with
stakeholders: Deep Learning Engineers work closely with stakeholders, including
data scientists, software engineers, and business leaders, to understand their
requirements and ensure that the DL solutions meet their needs.Research and
innovation: Deep Learning Engineers stay current with the latest advancements in DL
and contribute to the research and development of new DL techniques and
algorithms.Communication: Deep Learning Engineers communicate the results of their
work, including the performance of DL models and their impact on business outcomes,
to stakeholders.
Do you ever think of what it’s like to build anything like a brain, how these
things work, or what they do? Let us look at how nodes communicate with neurons and
what are some differences between artificial and biological neural networks.
1. Artificial Neural Network: Artificial Neural Network (ANN) is a type of neural
network that is based on a Feed-Forward strategy. It is called this because they
pass information through the nodes continuously till it reaches the output node.
This is also known as the simplest type of neural network. Some advantages of ANN :
Ability to learn irrespective of the type of data (Linear or Non-Linear).ANN is
highly volatile and serves best in financial time series forecasting.
Some disadvantages of ANN :
The simplest architecture makes it difficult to explain the behavior of the
network.This network is dependent on hardware.
2. Biological Neural Network: Biological Neural Network (BNN) is a structure that
consists of Synapse, dendrites, cell body, and axon. In this neural network, the
processing is carried out by neurons. Dendrites receive signals from other neurons,
Soma sums all the incoming signals and axon transmits the signals to other cells.
Some advantages of BNN :
The synapses are the input processing element.It is able to process highly complex
parallel inputs.
Some disadvantages of BNN :
There is no controlling mechanism.Speed of processing is slow being it is complex.
Differences between ANN and BNN :
Biological Neural Networks (BNNs) and Artificial Neural Networks (ANNs) are both
composed of similar basic components, but there are some differences between them.
Neurons: In both BNNs and ANNs, neurons are the basic building blocks that process
and transmit information. However, BNN neurons are more complex and diverse than
ANNs. In BNNs, neurons have multiple dendrites that receive input from multiple
sources, and the axons transmit signals to other neurons, while in ANNs, neurons
are simplified and usually only have a single output.
Synapses: In both BNNs and ANNs, synapses are the points of connection between
neurons, where information is transmitted. However, in ANNs, the connections
between neurons are usually fixed, and the strength of the connections is
determined by a set of weights, while in BNNs, the connections between neurons are
more flexible, and the strength of the connections can be modified by a variety of
factors, including learning and experience.
Neural Pathways: In both BNNs and ANNs, neural pathways are the connections between
neurons that allow information to be transmitted throughout the network. However,
in BNNs, neural pathways are highly complex and diverse, and the connections
between neurons can be modified by experience and learning. In ANNs, neural
pathways are usually simpler and predetermined by the architecture of the network.
ParametersANNBNNStructure
input
weight
output
hidden layer

dendrites
synapse
axon
cell body
Learningvery precise structures and formatted datathey can tolerate
ambiguityProcessor
complex
high speed
one or a few

simple
low speed
large number
Memory
separate from a processor
localized
non-content addressable

integrated into processor


distributed
content-addressable
Computing
centralized
sequential
stored programs

distributed
parallel
self-learning
Reliabilityvery vulnerablerobustExpertise
numerical and symbolic
manipulations

perceptual
problems
Operating Environment
well-defined
well-constrained

poorly defined
un-constrained
Fault Tolerancethe potential of fault toleranceperformance degraded even on partial
damage
Overall, while BNNs and ANNs share many basic components, there are significant
differences in their complexity, flexibility, and adaptability. BNNs are highly
complex and adaptable systems that can process information in parallel, and their
plasticity allows them to learn and adapt over time. In contrast, ANNs are simpler
systems that are designed to perform specific tasks, and their connections are
usually fixed, with the network architecture determined by the designer.

In this article, we will be understanding the single-layer perceptron and its


implementation in Python using the TensorFlow library. Neural Networks work in the
same way that our biological neuron works.
Structure of a biological neuron
Biological neuron has three basic functionality
Receive signal from outside.Process the signal and enhance whether we need to send
information or not.Communicate the signal to the target cell which can be another
neuron or gland.
In the same way, neural networks also work.
Neural Network in Machine LearningWhat is Single Layer Perceptron?
It is one of the oldest and first introduced neural networks. It was proposed by
Frank Rosenblatt in 1958. Perceptron is also known as an artificial neural network.
Perceptron is mainly used to compute the logical gate like AND, OR, and NOR which
has binary input and binary output.
The main functionality of the perceptron is:-
Takes input from the input layerWeight them up and sum it up.Pass the sum to the
nonlinear function to produce the output.Single-layer neural network
Here activation functions can be anything like sigmoid, tanh, relu Based on the
requirement we will be choosing the most appropriate nonlinear activation function
to produce the better result. Now let us implement a single-layer perceptron.
IMPLEMENTATION OF SINGLE-LAYER PERCEPTRON
Let us now implement a single-layer perceptron using the “MNIST” dataset using the
TensorFlow library.
Step1: Import necessary libraries
Numpy – Numpy arrays are very fast and can perform large computations in a very
short time.Matplotlib – This library is used to draw visualizations.TensorFlow –
This is an open-source library that is used for Machine Learning and Artificial
intelligence and provides a range of functions to achieve complex functionalities
with single lines of code.

Python3

import numpy as np import tensorflow as tf from tensorflow import keras import


matplotlib.pyplot as plt %matplotlib inline
Step 2: Now load the dataset using “Keras” from the imported version of tensor
flow.

Python3

(x_train, y_train),\ (x_test, y_test) = keras.datasets.mnist.load_data()

Step 3: Now display the shape and image of the single image in the dataset. The
image size contains a 28*28 matrix and length of the training set is 60,000 and the
testing set is 10,000.

Python3
len(x_train) len(x_test) x_train[0].shape plt.matshow(x_train[0])

Output:
Sample image from the training dataset
Step 4: Now normalize the dataset in order to compute the calculations in a fast
and accurate manner.

Python3
# Normalizing the dataset x_train = x_train/255x_test = x_test/255 # Flatting the
dataset in order # to compute for model building x_train_flatten =
x_train.reshape(len(x_train), 28*28) x_test_flatten = x_test.reshape(len(x_test),
28*28)

Step 5: Building a neural network with single-layer perception. Here we can observe
as the model is a single-layer perceptron that only contains one input layer and
one output layer there is no presence of the hidden layers.

Python3

model = keras.Sequential([ keras.layers.Dense(10, input_shape=(784,),


activation='sigmoid') ]) model.compile( optimizer='adam',
loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train_flatten, y_train, epochs=5)
Output:
Training progress per epoch
Step 6: Output the accuracy of the model on the testing data.

Python3

model.evaluate(x_test_flatten, y_test)

Output:
Models performance on the testing data
In this article, we will understand the concept of a multi-layer perceptron and its
implementation in Python using the TensorFlow library.
Multi-layer Perceptron
Multi-layer perception is also known as MLP. It is fully connected dense layers,
which transform any input dimension to the desired dimension. A multi-layer
perception is a neural network that has multiple layers. To create a neural network
we combine neurons together so that the outputs of some neurons are inputs of other
neurons.
A gentle introduction to neural networks and TensorFlow can be found here:
Neural NetworksIntroduction to TensorFlow
A multi-layer perceptron has one input layer and for each input, there is one
neuron(or node), it has one output layer with a single node for each output and it
can have any number of hidden layers and each hidden layer can have any number of
nodes. A schematic diagram of a Multi-Layer Perceptron (MLP) is depicted below.

In the multi-layer perceptron diagram above, we can see that there are three inputs
and thus three input nodes and the hidden layer has three nodes. The output layer
gives two outputs, therefore there are two output nodes. The nodes in the input
layer take input and forward it for further process, in the diagram above the nodes
in the input layer forwards their output to each of the three nodes in the hidden
layer, and in the same way, the hidden layer processes the information and passes
it to the output layer.
Every node in the multi-layer perception uses a sigmoid activation function. The
sigmoid activation function takes real values as input and converts them to numbers
between 0 and 1 using the sigmoid formula.

Now that we are done with the theory part of multi-layer perception, let’s go ahead
and implement some code in python using the TensorFlow library.
Stepwise Implementation
Step 1: Import the necessary libraries.

Python3

# importing modules import tensorflow as tf import numpy as np from


tensorflow.keras.models import Sequential from tensorflow.keras.layers import
Flatten from tensorflow.keras.layers import Dense from tensorflow.keras.layers
import Activation import matplotlib.pyplot as plt
Step 2: Download the dataset.
TensorFlow allows us to read the MNIST dataset and we can load it directly in the
program as a train and test dataset.

Python3

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

Output:
Downloading data from
https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] – 2s 0us/step

Step 3: Now we will convert the pixels into floating-point values.

Python3

# Cast the records into float values x_train = x_train.astype('float32') x_test =


x_test.astype('float32') # normalize image pixel values by dividing # by 255
gray_scale = 255x_train /= gray_scale x_test /= gray_scale

We are converting the pixel values into floating-point values to make the
predictions. Changing the numbers into grayscale values will be beneficial as the
values become small and the computation becomes easier and faster. As the pixel
values range from 0 to 256, apart from 0 the range is 255. So dividing all the
values by 255 will convert it to range from 0 to 1
Step 4: Understand the structure of the dataset

Python3
print("Feature matrix:", x_train.shape) print("Target matrix:", x_test.shape)
print("Feature matrix:", y_train.shape) print("Target matrix:", y_test.shape)

Output:
Feature matrix: (60000, 28, 28)
Target matrix: (10000, 28, 28)
Feature matrix: (60000,)
Target matrix: (10000,)
Thus we get that we have 60,000 records in the training dataset and 10,000 records
in the test dataset and Every image in the dataset is of the size 28×28.
Step 5: Visualize the data.

Python3
fig, ax = plt.subplots(10, 10) k = 0for i in range(10): for j in range(10):
ax[i][j].imshow(x_train[k].reshape(28, 28), aspect='auto')
k += 1plt.show()

Output

Step 6: Form the Input, hidden, and output layers.

Python3

model = Sequential([ # reshape 28 row * 28 column data to 28*28 rows


Flatten(input_shape=(28, 28)), # dense layer 1 Dense(256,
activation='sigmoid'), # dense layer 2 Dense(128,
activation='sigmoid'), # output layer Dense(10,
activation='sigmoid'), ])
Some important points to note:
The Sequential model allows us to create models layer-by-layer as we need in a
multi-layer perceptron and is limited to single-input, single-output stacks of
layers.Flatten flattens the input provided without affecting the batch size. For
example, If inputs are shaped (batch_size,) without a feature axis, then flattening
adds an extra channel dimension and output shape is (batch_size, 1).Activation is
for using the sigmoid activation function.The first two Dense layers are used to
make a fully connected model and are the hidden layers.The last Dense layer is the
output layer which contains 10 neurons that decide which category the image belongs
to.
Step 7: Compile the model.

Python

model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy', metrics=['accuracy'])
Compile function is used here that involves the use of loss, optimizers, and
metrics. Here loss function used is sparse_categorical_crossentropy, optimizer used
is adam.
Step 8: Fit the model.

Python3

model.fit(x_train, y_train, epochs=10, batch_size=2000,


validation_split=0.2)

Output:

Some important points to note:


Epochs tell us the number of times the model will be trained in forwarding and
backward passes.Batch Size represents the number of samples, If it’s unspecified,
batch_size will default to 32.Validation Split is a float value between 0 and 1.
The model will set apart this fraction of the training data to evaluate the loss
and any model metrics at the end of each epoch. (The model will not be trained on
this data)
Step 9: Find Accuracy of the model.

Python3
results = model.evaluate(x_test, y_test, verbose = 0) print('test loss, test
acc:', results)

Output:
test loss, test acc: [0.27210235595703125, 0.9223999977111816]
We got the accuracy of our model 92% by using model.evaluate() on the test samples.

This article aims


to implement a deep neural network from scratch. We will implement a deep neural
network containing a hidden layer with four units and one output layer. The
implementation will go from very scratch and the following steps will be
implemented.
Algorithm:
1. Visualizing the input data
2. Deciding the shapes of Weight and bias matrix
3. Initializing matrix, function to be used
4. Implementing the forward propagation method
5. Implementing the cost calculation
6. Backpropagation and optimizing
7. prediction and visualizing the output

Architecture of the model:


The architecture of the model has been defined by the following figure where the
hidden layer uses the Hyperbolic Tangent as the activation function while the
output layer, being the classification problem uses the sigmoid function.

Model Architecture
Weights and bias:
The weights and the bias that is going to be used for both the layers have to be
declared initially and also among them the weights will be declared randomly in
order to avoid the same output of all units, while the bias will be initialized to
zero. The calculation will be done from the scratch itself and according to the
rules given below where W1, W2 and b1, b2 are the weights and bias of first and
second layer respectively. Here A stands for the activation of a particular layer.

Cost Function:
The cost function of the above model will pertain to the cost function used with
logistic regression. Hence, in this tutorial we will be using the cost function:

Code: Visualizing the data

# Package imports import numpy as np import matplotlib.pyplot as plt # here


planar_utils.py can be found on its github repo from planar_utils import
plot_decision_boundary, sigmoid, load_planar_dataset # Loading the Sample data X, Y
= load_planar_dataset() # Visualize the data: plt.scatter(X[0, :], X[1, :], c =
Y, s = 40, cmap = plt.cm.Spectral);
Code: Initializing the Weight and bias matrix
Here is the number of hidden units is four, so, the W1 weight matrix will be of
shape (4, number of features) and bias matrix will be of shape (4, 1) which after
broadcasting will add up to the weight matrix according to the above formula. Same
can be applied to the W2.

# X --> input dataset of shape (input size, number of examples) # Y --> labels of
shape (output size, number of examples) W1 = np.random.randn(4, X.shape[0]) *
0.01b1 = np.zeros(shape =(4, 1)) W2 = np.random.randn(Y.shape[0], 4) * 0.01b2 =
np.zeros(shape =(Y.shape[0], 1))

Code: Forward Propagation :


Now we will perform the forward propagation using the W1, W2 and the bias b1, b2.
In this step the corresponding outputs are calculated in the function defined as
forward_prop.
def forward_prop(X, W1, W2, b1, b2): Z1 = np.dot(W1, X) + b1 A1 =
np.tanh(Z1) Z2 = np.dot(W2, A1) + b2 A2 = sigmoid(Z2) # here the
cache is the data of previous iteration # This will be used for backpropagation
cache = {"Z1": Z1, "A1": A1, "Z2": Z2, "A2":
A2} return A2, cache

Code: Defining the cost function :

# Here Y is actual output def compute_cost(A2, Y): m = Y.shape[1] #


implementing the above formula cost_sum = np.multiply(np.log(A2), Y) +
np.multiply((1 - Y), np.log(1 - A2)) cost = - np.sum(logprobs) / m #
Squeezing to avoid unnecessary dimensions cost = np.squeeze(cost) return
cost
Code: Finally back-propagating function:
This is a very crucial step as it involves a lot of linear algebra for
implementation of backpropagation of the deep neural nets. The Formulas for finding
the derivatives can be derived with some mathematical concept of linear algebra,
which we are not going to derive here. Just keep in mind that dZ, dW, db are the
derivatives of the Cost function w.r.t Weighted sum, Weights, Bias of the layers.

def back_propagate(W1, b1, W2, b2, cache): # Retrieve also A1 and A2 from
dictionary "cache" A1 = cache['A1'] A2 = cache['A2'] # Backward
propagation: calculate dW1, db1, dW2, db2. dZ2 = A2 - Y dW2 = (1 / m) *
np.dot(dZ2, A1.T) db2 = (1 / m) * np.sum(dZ2, axis = 1, keepdims = True)
dZ1 = np.multiply(np.dot(W2.T, dZ2), 1 - np.power(A1, 2)) dW1 = (1 / m) *
np.dot(dZ1, X.T) db1 = (1 / m) * np.sum(dZ1, axis = 1, keepdims = True)
# Updating the parameters according to algorithm W1 = W1 - learning_rate * dW1
b1 = b1 - learning_rate * db1 W2 = W2 - learning_rate * dW2 b2 = b2 -
learning_rate * db2 return W1, W2, b1, b2
Code: Training the custom model Now we will train the model using the functions
defined above, the epochs can be put as per the convenience and power of the
processing unit.

# Please note that the weights and bias are global # Here num_iteration is epochs
for i in range(0, num_iterations): # Forward propagation. Inputs: "X,
parameters". return: "A2, cache". A2, cache = forward_propagation(X, W1,
W2, b1, b2) # Cost function. Inputs: "A2, Y". Outputs: "cost".
cost = compute_cost(A2, Y) # Backpropagation. Inputs: "parameters,
cache, X, Y". Outputs: "grads". W1, W2, b1, b2 = backward_propagation(W1,
b1, W2, b2, cache) # Print the cost every 1000 iterations
if print_cost and i % 1000 == 0: print ("Cost after iteration % i: % f"
% (i, cost))

Output with learnt params


After training the model, take the weights and predict the outcomes using the
forward_propagate function above then use the values to plot the figure of output.
You will have similar output.
Visualizing the boundaries of data
Conclusion:
Deep Learning is a world in which the thrones are captured by the ones who get to
the basics, so, try to develop the basics so strong that afterwards, you may be the
developer of a new architecture of models which may revolutionalize the community.
Last Updated :
08 Jun, 2020

Like Article

Save Article

Previous

Multi-Layer Perceptron Learning in Tensorflow

Next

Understanding Multi-Layer Feed Forward Networks

Share your thoughts in the comments

Add Your Comment

Please Login to comment...


Let’s understand how errors are calculated and weights are updated in
backpropagation networks(BPNs).
Consider the following network in the below figure.
Backpropagation Network (BPN)
The network in the above figure is a simple multi-layer feed-forward network or
backpropagation network. It contains three layers, the input layer with two neurons
x1 and x2, the hidden layer with two neurons z1 and z2 and the output layer with
one neuron yin.
Now let’s write down the weights and bias vectors for each neuron.
Note: The weights are taken randomly.
Input layer: i/p – [x1 x2] = [0 1]
Here since it is the input layer only the input values are present.
Hidden layer: z1 – [v11 v21 v01] = [0.6 -0.1 03]
Here v11 refers to the weight of first input x1 on z1, v21 refers to the weight of
second input x2 on z1 and v01 refers to the bias value on z1.
z2 – [v12 v22 v02] = [-0.3 0.4 0.5]
Here v12 refers to the weight of first input x1 on z2, v22 refers to the weight of
second input x2 on z2 and v02 refers to the bias value on z2.
Output layer: yin – [w11 w21 w01] = [0.4 0.1 -0.2]
Here w11 refers to the weight of first neuron z1 in a hidden layer on yin, w21
refers to the weight of second neuron z2 in a hidden layer on yin and w01 refers to
the bias value on yin. Let’s consider three variables, k which refers to the
neurons in the output layer, ‘j’ which refers to the neurons in the hidden layer
and ‘i’ which refers to the neurons in the input layer.
Therefore,
k = 1
j = 1, 2(meaning first neuron and second neuron in hidden layer)
i = 1, 2(meaning first and second neuron in the input layer)
Below are some conditions to be followed in BPNs.
Conditions/Constraints:In BPN, the activation function used should be
differentiable.The input for bias is always 1.
To proceed with the problem, let:
Target value, t = 1
Learning rate, α = 0.25
Activation function = Binary sigmoid function
Binary sigmoid function, f(x) = (1+e-x)-1 eq. (1)
And, f'(x) = f(x)[1-f(x)] eq. (2)
There are three steps to solve the problem:
Computing the output, y.Backpropagation of errors, i.e., between output and hidden
layer, hidden and input layer.Updating weights.Step 1:
The value y is calculated by finding yin and applying the activation function.
yin is calculated as:
yin = w01 + z1*w11 + z2*w21 eq. (3)
Here, z1 and z2 are the values from hidden layer, calculated by finding zin1, zin2
and applying activation function to them.
zin1 and zin2 are calculated as:
zin1 = v01 + x1*v11 + x2*v21 eq. (4)
zin2 = v02 + x1*v12 + x2*v22 eq. (5)
From (4)
zin1 = 0.3 + 0*0.6 + 1*(-0.1)
zin1 = 0.2
z1 = f(zin1) = (1+e-0.2)-1 From (1)
z1 = 0.5498
From (5)
zin2 = 0.5 + 0*(-0.3) + 1*0.4
zin2 = 0.9
z2 = f(zin2) = (1+e-0.9)-1 From (1)
z2 = 0.7109
From (3)
yin = (-0.2) + 0.5498*0.4 + 0.7109*0.1
yin = 0.0910
y = f(yin) = (1+e-0.0910)-1 From (1)
y = 0.5227
Here, y is not equal to the target ‘t’, which is 1. And we proceed to calculate the
errors and then update weights from them in order to achieve the target value.
Step 2:(a) Calculating the error between output and hidden layer
Error between output and hidden layer is represented as δk, where k represents the
neurons in output layer as mentioned above. The error is calculated as:
δk = (tk – yk) * f'(yink) eq. (6)
where, f'(yink) = f(yink)[1 – f(yink)] From (2)
Since k = 1 (Assumed above),
δ = (t – y) f'(yin) eq. (7)
where, f'(yin) = f(yin)[1 – f(yin)]
f'(yin) = 0.5227[1 – 0.5227]
f'(yin) = 0.2495
Therefore,
δ = (1 – 0.5227) * 0.2495 From (7)
δ = 0.1191, is the error
Note: (Target – Output) i.e., (t – y) is the error in the output not in the layer.
Error in a layer is contributed by different factors like weights and bias.(b)
Calculating the error between hidden and input layer
Error between hidden and input layer is represented as δj, where j represents the
number of neurons in the hidden layer as mentioned above. The error is calculated
as:
δj = δinj * f'(zinj) eq. (8)
where,
δinj = ∑k=1 to n (δk * wjk) eq. (9)
f'(zinj) = f(zinj)[1 – f(zinj)] eq. (10)
Since k = 1(Assumed above) eq. (9) becomes:
δinj = δ * wj1 eq. (11)
As j = 1, 2, we will have one error values for each neuron and total of 2 errors
values.
δ1 = δin1 * f'(zin1) eq. (12), From (8)
δin1 = δ * w11 From (11)
δin1 = 0.1191 * 0.4 From weights vectors
δin1 = 0.04764
f'(zin1) = f(zin1)[1 – f(zin1)]
f'(zin1) = 0.5498[1 – 0.5498] As f(zin1) = z1
f'(zin1) = 0.2475
Substituting in (12)
δ1 = 0.04674 * 0.2475 = 0.0118
δ2 = δin2 * f'(zin2) eq. (13), From (8)
δin2 = δ * w21 From (11)
δin2 = 0.1191 * 0.1 From weights vectors
δin2 = 0.0119
f'(zin2) = f(zin2)[1 – f(zin2)]
f'(zin2) = 0.7109[1 – 0.7109] As f(zin2) = z2
f'(zin2) = 0.2055
Substituting in (13)
δ2 = 0.0119 * 0.2055 = 0.00245
The errors have been calculated, the weights have to be updated using these error
values.
Step 3:
The formula for updating weights for output layer is:
wjk(new) = wjk(old) + Δwjk eq. (14)
where, Δwjk = α * δk * zj eq. (15)
Since k = 1, (15) becomes:
Δwjk = α * δ * zi eq. (16)
The formula for updating weights for hidden layer is:
vij(new) = vij(old) + Δvij eq. (17)
where, Δvi = α * δj * xi eq. (18)
From (14) and (16)
w11(new) = w11(old) + Δw11 = 0.4 + α * δ * z1 = 0.4 + 0.25 * 0.1191 * 0.5498 =
0.4164
w21(new) = w21(old) + Δw21 = 0.1 + α * δ * z2 = 0.1 + 0.25 * 0.1191 * 0.7109 =
0.12117
w01(new) = w01(old) + Δw01 = (-0.2) + α * δ * bias = (-0.2) + 0.25 * 0.1191 * 1 = -
0.1709, kindly note the 1 taken here is input considered for bias as per the
conditions.
These are the updated weights of the output layer.
From (17) and (18)
v11(new) = v11(old) + Δv11 = 0.6 + α * δ1 * x1 = 0.6 + 0.25 * 0.0118 * 0 = 0.6
v21(new) = v21(old) + Δv21 = (-0.1) + α * δ1 * x2 = (-0.1) + 0.25 * 0.0118 * 1 =
0.00295
v01(new) = v01(old) + Δv01 = 0.3 + α * δ1 * bias = 0.3 + 0.25 * 0.0118 * 1 =
0.00295, kindly note the 1 taken here is input considered for bias as per the
conditions.
v12(new) = v12(old) + Δv12 = (-0.3) + α * δ2 * x1 = (-0.3) + 0.25 * 0.00245 * 0 =
-0.3
v22(new) = v22(old) + Δv22 = 0.4 + α * δ2 * x2 = 0.4 + 0.25 * 0.00245 * 1 =
0.400612
v02(new) = v02(old) + Δv02 = 0.5 + α * δ2 * bias = 0.5 + 0.25 * 0.00245 * 1 =
0.500612, kindly note the 1 taken here is input considered for bias as per the
conditions.
These are all the updated weights of the hidden layer.
These three steps are repeated until the output ‘y’ is equal to the target ‘t’.
This is how the BPNs work. The backpropagation in BPN refers to that the error in
the present layer is used to update weights between the present and previous layer
by backpropagating the error values.

Last Updated :
16 Oct, 2021

Like Article

Save Article

Previous
Deep Neural net with forward and back propagation from scratch - Python

Next

List of Deep Learning Layers

Share your thoughts in the comments

Add Your Comment

Please Login to comment...

Deep learning (DL) is characterized by the use of neural networks with multiple
layers to model and solve complex problems. Each layer in the neural network plays
a unique role in the process of converting input data into meaningful and
insightful outputs. The article explores the layers that are used to construct a
neural network.

Table of Content
Role of Deep Learning LayersMATLAB Input LayerMATLAB Fully Connected LayersMATLAB
Convolution LayersMATLAB Recurrent LayersMATLAB Activation LayersMATLAB Pooling and
Unpooling LayersMATLAB Normalization Layer and Dropout LayerMATLAB Output
LayersRole of Deep Learning LayersA layer in a deep learning model serves as a
fundamental building block in the model’s architecture. The structure of the
network is responsible for processing and transforming input data. The flow of
information through these layers is sequential, with each layer taking input from
the preceding layers and passing its transformed output to the subsequent layers.
This cascading process continues through the network until the final layer produces
the model’s ultimate output.
The input to a layer consists of features or representations derived from the data
processed by earlier layers. Each layer performs a specific computation or set of
operations on this input, introducing non-linearity and abstraction to the
information. The transformed output, often referred to as activations or feature
maps, encapsulates higher-level representations that capture complex patterns and
relationships within the data. The nature and function of each layer vary based on
its type within the neural network architecture.
The nature and function of each layer vary based on its type within the neural
network architecture. For instance:
Dense (Fully Connected) Layer: Neurons in this layer are connected to every neuron
in the previous layer, creating a dense network of connections. This layer is
effective in capturing global patterns in the data.Convolutional Layer: Specialized
for grid-like data, such as images, this layer employs convolution operations to
detect spatial patterns and features.Recurrent Layer: Suited for sequential data,
recurrent layers utilize feedback loops to consider context from previous time
steps, making them suitable for tasks like natural language processing.Pooling
Layer: Reduces spatial dimensions and focuses on retaining essential information,
aiding in downsampling and feature selection.MATLAB Input LayerLayer
Description of LayerinputLayer
Input layer receives and process data in a specialized format, serving as the
initial stage for information entry into a neural network.sequenceInputLayer
Sequence input layer receives sequential data for a neural network and incorporates
the normalization of the data during the input process.featureInputLayer
Feature Input Layer processes feature data for a neural network and integrates data
normalization. This layer is suitable when dealing with a dataset consisting of
numerical scalar values that represent features, without spatial or temporal
dimensions.imageInputLayer
Image input layer processes 2 dimensional images in a neural network and uses data
normalization using the input stage.image3dInputLayer
3-D image input layer receives 3-D image for a neural network.MATLAB Fully
Connected LayersLayers
Description of LayerfullyConnectedLayer
Fully connected layer performs matrix multiplication with a weight matrix and
subsequently adding a bias vector.MATLAB Convolution LayersLayer
Description of Layerconvolution1dLayer
One-dimensional convolutional layer employs sliding convolutional filters on 1-D
input data.convolution2dLayer
Two-dimensional convolutional layer employs sliding convolutional filters on 2-D
input data.convolution3dLayer
Three-dimensional convolutional layer employs sliding convolutional filters on 3-D
input data.transposedConv2dLayer
Transposed two-dimensional convolutional layer increases the resolution of two-
dimensional feature maps through upsampling.transposedConv3dLayer
Transposed three-dimensional convolutional layer increases the resolution of three-
dimensional feature maps through upsampling.MATLAB Recurrent LayersLayer
Description of LayerlstmLayer
LSTM layer represents a type of recurrent neural network (RNN) layer specifically
designed to capture and learn long-term dependencies among different time steps in
time-series and sequential data.lstmProjectedLayer
LSTM projected layer, within the realm of recurrent neural networks (RNNs), is
adept at understanding and incorporating long-term dependencies among various time
steps within time-series and sequential data. This is achieved through the
utilization of learnable weights designed for projection.bilstmLayer
Bidirectional LSTM (BiLSTM) layer, belonging to the family of recurrent neural
networks (RNNs), is proficient in capturing long-term dependencies in both forward
and backward directions among different time steps within time-series or sequential
data. This bidirectional learning is valuable when the RNN needs to gather insights
from the entire time series at each individual time step.gruLayer
Gated Recurrent Unit (GRU) layer serves as a type of recurrent neural network (RNN)
layer designed to capture dependencies among different time steps within time-
series and sequential data.gruProjectedLayer
A GRU projected layer, within the context of recurrent neural networks (RNNs), is
specialized in understanding and incorporating dependencies among various time
steps within time-series and sequential data. This is accomplished through the
utilization of learnable weights designed for projection.MATLAB Activation
LayersLayer
Description of LayerreluLayer
ReLU conducts a threshold operation on each element of the input, setting any value
that is less zero to zero.leakyReluLayer
Leaky ReLU applies a threshold operation, where any input value that is less than
zero is multiplied by a constant scalar.clippedReluLayer
Clipped ReLU layer executes a threshold operation, setting any input value below
zero to zero and capping any value surpassing the defined ceiling to that specific
ceiling value.eluLayer
Exponential Linear Unit (ELU) activation layer executes the identity operation for
positive inputs and applies an exponential nonlinearity for negative
inputs.geluLayer
Gaussian Error Linear Unit (GELU) layer adjusts the input by considering its
probability within a Gaussian distribution.tanhLayer
Hyperbolic tangent (tanh) activation layer utilizes the tanh function to transform
the inputs of the layer.swishLayer
Swish activation layer employs the swish function to process the inputs of the
layer.MATLAB Pooling and Unpooling LayersLayer
Description of LayeraveragePooling1dLayer
One dimensional average pooling layer accomplishes downsampling by segmenting the
input into 1-D pooling regions and subsequently calculating the average within each
region.averagePooling2dLayer
Two dimensional average pooling layer conducts downsampling by partitioning the
input into rectangular pooling regions and subsequently determining the average
value within each region.averagePooling3dLayer
Three dimensional average pooling layer achieves downsampling by partitioning the
three-dimensional input into cuboidal pooling regions and then calculating the
average values within each of these regions.globalAveragePooling1dLayer
1-D global average pooling layer achieves downsampling by generating the average
output across the time or spatial dimensions of the
input.globalAveragePooling2dLayer
2-D global average pooling layer accomplishes downsampling by determining the mean
value across the height and width dimensions of the
input.globalAveragePooling3dLayer
3-D global average pooling layer achieves downsampling by calculating the mean
across the height, width, and depth dimensions of the input.maxPooling1dLayer
1-D global max pooling layer achieves downsampling by producing the maximum value
across the time or spatial dimensions of the input.maxUnpooling2dLayer
2-D max unpooling layer reverses the pooling operation on the output of a 2-D max
pooling layer.MATLAB Normalization Layer and Dropout LayerLayer
Description of Layer batchNormalizationLayer
Batch normalization layer normalizes a mini-batch of data independently across all
observations for each channel. To enhance the training speed of a convolutional
neural network and mitigate sensitivity to network initialization, incorporate
batch normalization layers between convolutional layers and non-linearities, such
as ReLU layers.groupNormalizationLayer
Group normalization layer normalizes a mini-batch of data independently across
distinct subsets of channels for each observation. To expedite the training of a
convolutional neural network and minimize sensitivity to network initialization,
integrate group normalization layers between convolutional layers and non-
linearities, such as ReLU layers.layerNormalizationLayer
Layer normalization layer normalizes a mini-batch of data independently across all
channels for each observation. To accelerate the training of recurrent and
multilayer perceptron neural networks and diminish sensitivity to network
initialization, incorporate layer normalization layers after the learnable layers,
such as LSTM and fully connected layers.dropoutLayer
Dropout layer randomly zeros out input elements based on a specified
probability.MATLAB Output LayersLayer
Description of LayersoftmaxLayer
Softmax layer employs the softmax function on the input.sigmoidLayer
Sigmoid layer utilizes a sigmoid function on the input, ensuring that the output is
constrained within the range (0,1).classificationLayer
Classification layer calculates the cross-entropy loss for tasks involving
classification and weighted classification, specifically for scenarios with
mutually exclusive classes.regressionLayer
Regression layer calculates the loss using the half-mean-squared-error for tasks
related to regression.

Last Updated :
31 Jan, 2024

Like Article

Save Article

Previous

Understanding Multi-Layer Feed Forward Networks

Next

Activation Functions

Share your thoughts in the comments

Add Your Comment


Please Login to comment...

To put in simple
terms, an artificial neuron calculates the ‘weighted sum’ of its inputs and adds a
bias, as shown in the figure below by the net input.

Mathematically,

Now the value of net input can be any anything from -inf to +inf. The neuron
doesn’t really know how to bound to value and thus is not able to decide the firing
pattern. Thus the activation function is an important part of an artificial neural
network. They basically decide whether a neuron should be activated or not. Thus it
bounds the value of the net input.
The activation function is a non-linear transformation that we do over the input
before sending it to the next layer of neurons or finalizing it as output.

Types of Activation Functions –


Several different types of activation functions are used in Deep Learning. Some of
them are explained below:

Step Function:
Step Function is one of the simplest kind of activation functions. In this, we
consider a threshold value and if the value of net input say y is greater than the
threshold then the neuron is activated.

Mathematically,

Given below is the graphical representation of step function.

Sigmoid Function:
Sigmoid function is a widely used activation function. It is defined as:

Graphically,
This is a smooth function and is continuously differentiable. The biggest advantage
that it has over step and linear function is that it is non-linear. This is an
incredibly cool feature of the sigmoid function. This essentially means that when I
have multiple neurons having sigmoid function as their activation function – the
output is non linear as well. The function ranges from 0-1 having an S shape.

ReLU:
The ReLU function is the Rectified linear unit. It is the most widely used
activation function. It is defined as:

Graphically,

The main advantage of using the ReLU function over other activation functions is
that it does not activate all the neurons at the same time. What does this mean ?
If you look at the ReLU function if the input is negative it will convert it to
zero and the neuron does not get activated.

Leaky ReLU:
Leaky ReLU function is nothing but an improved version of the ReLU function.Instead
of defining the Relu function as 0 for x less than 0, we define it as a small
linear component of x. It can be defined as:

Graphically,

Last Updated :
23 Aug, 2019

Like Article

Save Article
Previous

List of Deep Learning Layers

Next

Types Of Activation Function in ANN

Share your thoughts in the comments

Add Your Comment

Please Login to comment...

The biological neural network has been modeled in the form of Artificial Neural
Networks with artificial neurons simulating the function of a biological neuron.
The artificial neuron is depicted in the below picture:
Structure of an Artificial Neuron
Each neuron consists of three major components:
A set of ‘i’ synapses having weight wi. A signal xi forms the input to the i-th
synapse having weight wi. The value of any weight may be positive or negative. A
positive weight has an extraordinary effect, while a negative weight has an
inhibitory effect on the output of the summation junction.A summation junction for
the input signals is weighted by the respective synaptic weight. Because it is a
linear combiner or adder of the weighted input signals, the output of the summation
junction can be expressed as follows: A threshold activation function (or simply
the activation function, also known as squashing function) results in an output
signal only when an input signal exceeding a specific threshold value comes as an
input. It is similar in behaviour to the biological neuron which transmits the
signal only when the total input signal meets the firing threshold.
Types of Activation Function :
There are different types of activation functions. The most commonly used
activation function are listed below:
A. Identity Function: Identity function is used as an activation function for the
input layer. It is a linear function having the form

As obvious, the output remains the same as the input.


B. Threshold/step Function: It is a commonly used activation function. As depicted
in the diagram, it gives 1 as output of the input is either 0 or positive. If the
input is negative, it gives 0 as output. Expressing it mathematically,

The threshold function is almost like the step function, with the only difference
being a fact that is used as a threshold value instead of . Expressing
mathematically,

C. ReLU (Rectified Linear Unit) Function: It is the most popularly used activation
function in the areas of convolutional neural networks and deep learning. It is of
the form:

This means that f(x) is zero when x is less than zero and f(x) is equal to x when x
is above or equal to zero. This function is differentiable, except at a single
point x = 0. In that sense, the derivative of a ReLU is actually a sub-derivative.
D. Sigmoid Function: It is by far the most commonly used activation function in
neural networks. The need for sigmoid function stems from the fact that many
learning algorithms require the activation function to be differentiable and hence
continuous. There are two types of sigmoid function:
1. Binary Sigmoid Function

A binary sigmoid function is of the form:


, where k = steepness or slope parameter, By varying the value of k, sigmoid
function with different slopes can be obtained. It has a range of (0,1). The slope
of origin is k/4. As the value of k becomes very large, the sigmoid function
becomes a threshold function.
2. Bipolar Sigmoid Function

A bipolar sigmoid function is of the form


The range of values of sigmoid functions can be varied depending on the
application. However, the range of (-1,+1) is most commonly adopted.
E. Hyperbolic Tangent Function: It is bipolar in nature. It is a widely adopted
activation function for a special type of neural network known as Backpropagation
Network. The hyperbolic tangent function is of the form

This function is similar to the bipolar sigmoid function.

Last Updated :
22 Jan, 2021

Like Article
Save Article

Previous

Activation Functions

Next

Activation Functions in Pytorch

Share your thoughts in the comments

Add Your Comment

Please Login to comment...

In this article, we will Understand PyTorch Activation Functions.


What is an activation function and why to use them?
Activation functions are the building blocks of Pytorch. Before coming to types of
activation function, let us first understand the working of neurons in the human
brain. In the Artificial Neural Networks, we have an input layer which is the input
by the user in some format, a hidden layer that performs the hidden calculations
and identifies features and output is the result. So the whole structure is like a
network with neurons connected to one another. So we have artificial neurons which
are activated by these activation functions. The activation function is a function
that performs calculations to provide an output that may act as input for the next
neurons. An ideal activation function should handle non-linear relationships by
using the linear concepts and it should be differentiable so as to reduce the
errors and adjust the weights accordingly. All activation functions are present in
the torch.nn library.
Types of Pytorch Activation Function
Let us look at the different Pytorch Activation functions:
ReLU Activation FunctionLeaky ReLU Activation FunctionSigmoid Activation
FunctionTanh Activation FunctionSoftmax Activation FunctionReLU Activation
Function:
ReLU stands for Rectified Linear Activation function. It is a non-linear function
and, graphically ReLU has the following transformative behavior:

ReLU is a popular activation function since it is differentiable and nonlinear. If


the inputs are negative its derivative becomes zero which causes the ‘dying’ of
neurons and learning doesn’t take place. Let us illustrate the use of ReLU with the
help of the Python program.

Python3

import torchimport torch.nn as nn # defining relur = nn.ReLU() # Creating a Tensor


with an arrayinput = torch.Tensor([1, -2, 3, -5]) # Passing the array to relu
functionoutput = r(input) print(output)
Output:
tensor([1., 0., 3., 0.])Leaky ReLU Activation Function:
Leaky ReLU Activation Function or LReLU is another type of activation function
which is similar to ReLU but solves the problem of ‘dying’ neurons and, graphically
Leaky ReLU has the following transformative behavior:

This function is very useful as when the input is negative the differentiation of
the function is not zero. Hence the learning of neurons doesn’t stop. Let us
illustrate the use of LReLU with the help of the Python program.

Python3

# codeimport torchimport torch.nn as nn # defining Lrelu and the parameter 0.2 is


passed to control the negative slope ; a=0.2r = nn.LeakyReLU(0.2) # Creating a
Tensor with an array input = torch.Tensor([1,-2,3,-5]) output = r(input)
print(output)

Output:
tensor([ 1.0000, -0.4000, 3.0000, -1.0000])Sigmoid Activation Function:
Sigmoid Function is a non-linear and differentiable activation function. It is an
S-shaped curve that does not pass through the origin. It produces an output that
lies between 0 and 1. The output values are often treated as a probability. It is
often used for binary classification. It is slow in computation and, graphically
Sigmoid has the following transformative behavior:

Sigmoid activation function has a problem of “Vanishing Gradient”. Vanishing


Gradient is a significant problem as a large number of inputs are fed to the neural
network and the number of hidden layers increases, the gradient or derivative
becomes close to zero thus leading to inaccuracy in the neural network.
Let us illustrate the use of the Sigmoid function with the help of a Python
Program.

Python3

import torchimport torch.nn as nn # Calling the sigmoid functionsig =


nn.Sigmoid() # Defining tensorinput = torch.Tensor([1,-2,3,-5]) # Applying sigmoid
to the tensoroutput = sig(input) print(output)

Output:
tensor([0.7311, 0.1192, 0.9526, 0.0067])Tanh Activation Function:
Tanh function is a non-linear and differentiable function similar to the sigmoid
function but output values range from -1 to +1. It is an S-shaped curve that passes
through the origin and, graphically Tanh has the following transformative behavior:

The problem with the Tanh Activation function is it is slow and the vanishing
gradient problem persists. Let us illustrate the use of the Tanh function with the
help of a Python Program.
Python3

import torchimport torch.nn as nn # Calling the Tanh functiont = nn.Tanh() #


Defining tensorinput = torch.Tensor([1,-2,3,-5]) # Applying Tanh to the
tensoroutput = t(input) print(output)

Output:
tensor([0.7311, 0.1192, 0.9526, 0.0067])Softmax Activation Function:
The softmax function is different from other activation functions as it is placed
at the last to normalize the output. We can use other activation functions in
combination with Softmax to produce the output in probabilistic form. It is used in
multiclass classification and generates an output of probabilities whose sum is 1.
The range of output lies between 0 and 1. Softmax has the following transformative
behavior:

Let us illustrate with the help of the Python Program:

Python3
import torchimport torch.nn as nn # Calling the Softmax function with # dimension =
0 as dimension starts # from 0sm = nn.Softmax(dim=0) # Defining tensorinput =
torch.Tensor([1,-2,3,-5]) # Applying function to the tensoroutput = sm(input)
print(output)

Output:
tensor([0.7311, 0.1192, 0.9526, 0.0067])

Last Updated :
07 Jul, 2022

Like Article

Save Article
Previous

Types Of Activation Function in ANN

Next

Understanding Activation Functions in Depth

Share your thoughts in the comments

Add Your Comment

Please Login to comment...

What is an Activation function ?


In artificial neural networks, the activation function of a node defines the output
of that node or neuron for a given input or set of inputs. This output is then used
as input for the next node and so on until a desired solution to the original
problem is found.

It maps the resulting values into the desired range such as between 0 to 1 or -1
to 1 etc. It depends upon the choice of the activation function. For example, the
use of the logistic activation function would map all inputs in the real number
domain into the range of 0 to 1.

Example of a binary classification problem:


In a binary classification problem, we have an input x, say an image, and we have
to classify it as having a correct object or not. If it is a correct object, we
will assign it a 1, else 0. So here, we have only two outputs – either the image
contains a valid object or it does not. This is an example of a binary
classification problem.
when we multiply each of them features with a weight (w1, w2, …, wm) and sum them
all together,
node output = activation(weighted sum of inputs).

(1)
`

Some Important terminology and mathematical concept –

Propagation is a procedure to repeatedly adjust the weights so as to minimize the


difference between actual output and desired output.

Hidden Layers is which are neuron nodes stacked in between inputs and outputs,
allowing neural networks to learn more complicated features (such as XOR logic).

Backpropagation is a procedure to repeatedly adjust the weights so as to minimize


the difference between actual output and desired output.

It allows the information to go back from the cost backward through the network in
order to compute the gradient. Therefore, loop over the nodes starting from the
final node in reverse topological order to compute the derivative of the final node
output. Doing so will help us know who is responsible for the most error and change
the parameters appropriate in that direction.

Gradient Descent is used while training a machine learning model. It is an


optimization algorithm, based on a convex function, that tweaks its parameters
iteratively to minimize a given function to its local minimum. A gradient measures
how much the output of a function changes if you change the inputs a little bit.

Note: If gradient descent is working properly, the cost function should decrease
after every iteration.

Types of activation Functions:

The Activation Functions are basically two types:

1. Linear Activation Function –


Equation : f(x) = x
Range : (-infinity to infinity)

2. Non-linear Activation Functions –


It makes it easy for the model to generalize with a variety of data and to
differentiate between the output. By simulation, it is found that for larger
networks ReLUs is much faster. It has been proven that ReLUs result in much faster
training for large networks. Non-linear means that the output cannot be reproduced
from a linear combination of the inputs.

The main terminologies needed to understand for nonlinear functions are:

1. Derivative: Change in y-axis w.r.t. change in x-axis. It is also known as slope.


2. Monotonic function: A function which is either entirely non-increasing or non-
decreasing.
The Nonlinear Activation Functions are mainly divided on the basis of their range
or curves as follows:

Let’s take a deeper insight in each Activations Functions-


1. Sigmoid:

It is also called as a Binary classifier or Logistic Activation function because


function always pick value either 0(False) or 1 (True).

The sigmoid function produces similar results to step function in that the output
is between 0 and 1. The curve crosses 0.5 at z=0, which we can set up rules for the
activation function, such as: If the sigmoid neuron’s output is larger than or
equal to 0.5, it outputs 1; if the output is smaller than 0.5, it outputs 0.

The sigmoid function does not have a jerk on its curve. It is smooth and it has a
very nice and simple derivative, which is differentiable everywhere on the curve.

Derivation of Sigmoid:

Sigmoids saturate and kill gradients. A very common property of the sigmoid is that
when the neuron’s activation saturates at either 0 or 1, the gradient at these
regions is almost zero. Recall that during backpropagation, this local gradient
will be multiplied by the gradient of this gate’s output for the whole objective.
Therefore, if the local gradient is very small, it will effectively “kill” the
gradient and almost no signal will flow through the neuron to its weights and
recursively to its data. Additionally, the extra penalty will be added initializing
the weights of sigmoid neurons to prevent saturation. For example, if the initial
weights are too large then most neurons would become saturated and the network will
barely learn.

2. ReLU (Rectified Linear Unit):

It is the most widely used activation function. Since it is used in almost all the
convolutional neural networks. ReLU is half rectified from the bottom. The function
and it’s derivative both are monotonic.f(x) = max(0, x)

The models that are close to linear are easy to optimize. Since ReLU shares a lot
of the properties of linear functions, it tends to work well on most of the
problems. The only issue is that the derivative is not defined at z = 0, which we
can overcome by assigning the derivative to 0 at z = 0. However, this means that
for z <= 0 the gradient is zero and again can’t learn.

3. Leaky ReLU:

Leaky ReLU is an improved version of the ReLU function. ReLU function, the gradient
is 0 for x<0, which made the neurons die for activations in that region. Leaky ReLU
is defined to address this problem. Instead of defining the Relu function as 0 for
x less than 0, we define it as a small linear component of x.

Leaky ReLUs are one attempt to fix the Dying ReLU problem. Instead of the function
being zero when x < 0, a leaky ReLU will instead have a small negative slope (of
0.01, or so). That is, the function computes:
(2)

4. Tanh or hyperbolic tangent:

It squashes a real-valued number to the range [-1, 1] Like the Sigmoid, its
activations saturate, but unlike the sigmoid neuron, its output is zero-centred.
Therefore the tanh non-linearity is always preferred to the sigmoid nonlinearity.
tanh neuron is simply a scaled sigmoid neuron.
Tanh is also like logistic sigmoid but better. The advantage is that the negative
inputs will be mapped to strongly negative and the zero inputs will be mapped to
near zero in the tanh graph.

The function is differentiable monotonic but its derivative is not monotonic. Both
tanh and logistic Sigmoid activation functions are used in feed-forward nets.
It is actually just a scaled version of the sigmoid function. tanh(x)=2
sigmoid(2x)-1

5. Softmax :

The sigmoid function can be applied easily and ReLUs will not vanish the effect
during your training process. However, when you want to deal with classification
problems, they cannot help much. the sigmoid function can only handle two classes,
which is not what we expect but we want something more. The softmax function
squashes the outputs of each unit to be between 0 and 1, just like a sigmoid
function. and it also divides each output such that the total sum of the outputs is
equal to 1.

The output of the softmax function is equivalent to a categorical probability


distribution, it tells you the probability that any of the classes are true.

where 0 is a vector of the inputs to the output layer (if you have 10 output units,
then there are 10 elements in z). And again, j indexes the output units, so j = 1,
2, …, K.

Properties of Softmax Function –


1. The calculated probabilities will be in the range of 0 to 1.
2. The sum of all the probabilities is equals to 1.

Softmax Function Usage –


1. Used in multiple classification logistic regression model.
2. In building neural networks softmax functions used in different layer level and
multilayer perceptrons.

Example:
(3)
Softmax function turns logits [1.2, 0.9, 0.4] into probabilities [0.46, 0.34,
0.20], and the probabilities sum to 1.
Last Updated :
03 Jan, 2022

Like Article

Save Article

Previous

Activation Functions in Pytorch

Next

Artificial Neural Networks and its Applications

Share your thoughts in the comments

Add Your Comment

Please Login to comment...


As you read this article, which organ in your body is thinking about it? It’s the
brain of course! But do you know how the brain works? Well, it has neurons or nerve
cells that are the primary units of both the brain and the nervous system. These
neurons receive sensory input from the outside world which they process and then
provide the output which might act as the input to the next neuron.
Each of these neurons is connected to other neurons in complex arrangements at
synapses. Now, are you wondering how this is related to Artificial Neural Networks?
Well, Artificial Neural Networks are modeled after the neurons in the human brain.
Let’s check out what they are in detail and how they learn information.
Artificial Neural Networks
Artificial Neural Networks contain artificial neurons which are called units. These
units are arranged in a series of layers that together constitute the whole
Artificial Neural Network in a system. A layer can have only a dozen units or
millions of units as this depends on how the complex neural networks will be
required to learn the hidden patterns in the dataset. Commonly, Artificial Neural
Network has an input layer, an output layer as well as hidden layers. The input
layer receives data from the outside world which the neural network needs to
analyze or learn about. Then this data passes through one or multiple hidden layers
that transform the input into data that is valuable for the output layer. Finally,
the output layer provides an output in the form of a response of the Artificial
Neural Networks to input data provided. In the majority of neural networks, units
are interconnected from one layer to another. Each of these connections has weights
that determine the influence of one unit on another unit. As the data transfers
from one unit to another, the neural network learns more and more about the data
which eventually results in an output from the output layer.
Neural Networks Architecture
The structures and operations of human neurons serve as the basis for artificial
neural networks. It is also known as neural networks or neural nets. The input
layer of an artificial neural network is the first layer, and it receives input
from external sources and releases it to the hidden layer, which is the second
layer. In the hidden layer, each neuron receives input from the previous layer
neurons, computes the weighted sum, and sends it to the neurons in the next layer.
These connections are weighted means effects of the inputs from the previous layer
are optimized more or less by assigning different-different weights to each input
and it is adjusted during the training process by optimizing these weights for
improved model performance.
Artificial neurons vs Biological neurons
The concept of artificial neural networks comes from biological neurons found in
animal brains So they share a lot of similarities in structure and function wise.
Structure: The structure of artificial neural networks is inspired by biological
neurons. A biological neuron has a cell body or soma to process the impulses,
dendrites to receive them, and an axon that transfers them to other neurons. The
input nodes of artificial neural networks receive input signals, the hidden layer
nodes compute these input signals, and the output layer nodes compute the final
output by processing the hidden layer’s results using activation
functions.Biological Neuron
Artificial Neuron
Dendrite
Inputs
Cell nucleus or Soma
Nodes
Synapses
Weights
Axon
Output
Synapses: Synapses are the links between biological neurons that enable the
transmission of impulses from dendrites to the cell body. Synapses are the weights
that join the one-layer nodes to the next-layer nodes in artificial neurons. The
strength of the links is determined by the weight value. Learning: In biological
neurons, learning happens in the cell body nucleus or soma, which has a nucleus
that helps to process the impulses. An action potential is produced and travels
through the axons if the impulses are powerful enough to reach the threshold. This
becomes possible by synaptic plasticity, which represents the ability of
synapses to become stronger or weaker over time in reaction to changes in their
activity. In artificial neural networks, backpropagation is a technique used
for learning, which adjusts the weights between nodes according to the error or
differences between predicted and actual outcomes.Biological Neuron
Artificial Neuron
Synaptic plasticityBackpropagationsActivation: In biological neurons, activation is
the firing rate of the neuron which happens when the impulses are strong enough to
reach the threshold. In artificial neural networks, A mathematical function known
as an activation function maps the input to the output, and executes
activations.Biological neurons to Artificial neuronsHow do Artificial Neural
Networks learn?
Artificial neural networks are trained using a training set. For example, suppose
you want to teach an ANN to recognize a cat. Then it is shown thousands of
different images of cats so that the network can learn to identify a cat. Once the
neural network has been trained enough using images of cats, then you need to check
if it can identify cat images correctly. This is done by making the ANN classify
the images it is provided by deciding whether they are cat images or not. The
output obtained by the ANN is corroborated by a human-provided description of
whether the image is a cat image or not. If the ANN identifies incorrectly then
back-propagation is used to adjust whatever it has learned during training.
Backpropagation is done by fine-tuning the weights of the connections in ANN units
based on the error rate obtained. This process continues until the artificial
neural network can correctly recognize a cat in an image with minimal possible
error rates.
What are the types of Artificial Neural Networks?Feedforward Neural Network: The
feedforward neural network is one of the most basic artificial neural networks. In
this ANN, the data or the input provided travels in a single direction. It enters
into the ANN through the input layer and exits through the output layer while
hidden layers may or may not exist. So the feedforward neural network has a front-
propagated wave only and usually does not have backpropagation. Convolutional
Neural Network: A Convolutional neural network has some similarities to the feed-
forward neural network, where the connections between units have weights that
determine the influence of one unit on another unit. But a CNN has one or more than
one convolutional layer that uses a convolution operation on the input and then
passes the result obtained in the form of output to the next layer. CNN has
applications in speech and image processing which is particularly useful in
computer vision. Modular Neural Network: A Modular Neural Network contains a
collection of different neural networks that work independently towards obtaining
the output with no interaction between them. Each of the different neural networks
performs a different sub-task by obtaining unique inputs compared to other
networks. The advantage of this modular neural network is that it breaks down a
large and complex computational process into smaller components, thus decreasing
its complexity while still obtaining the required output. Radial basis function
Neural Network: Radial basis functions are those functions that consider the
distance of a point concerning the center. RBF functions have two layers. In the
first layer, the input is mapped into all the Radial basis functions in the hidden
layer and then the output layer computes the output in the next step. Radial basis
function nets are normally used to model the data that represents any underlying
trend or function. Recurrent Neural Network: The Recurrent Neural Network saves the
output of a layer and feeds this output back to the input to better predict the
outcome of the layer. The first layer in the RNN is quite similar to the feed-
forward neural network and the recurrent neural network starts once the output of
the first layer is computed. After this layer, each unit will remember some
information from the previous step so that it can act as a memory cell in
performing computations. Applications of Artificial Neural NetworksSocial Media:
Artificial Neural Networks are used heavily in Social Media. For example, let’s
take the ‘People you may know’ feature on Facebook that suggests people that you
might know in real life so that you can send them friend requests. Well, this
magical effect is achieved by using Artificial Neural Networks that analyze your
profile, your interests, your current friends, and also their friends and various
other factors to calculate the people you might potentially know. Another common
application of Machine Learning in social media is facial recognition. This is done
by finding around 100 reference points on the person’s face and then matching them
with those already available in the database using convolutional neural
networks. Marketing and Sales: When you log onto E-commerce sites like Amazon and
Flipkart, they will recommend your products to buy based on your previous browsing
history. Similarly, suppose you love Pasta, then Zomato, Swiggy, etc. will show you
restaurant recommendations based on your tastes and previous order history. This is
true across all new-age marketing segments like Book sites, Movie services,
Hospitality sites, etc. and it is done by implementing personalized marketing. This
uses Artificial Neural Networks to identify the customer likes, dislikes, previous
shopping history, etc., and then tailor the marketing campaigns
accordingly. Healthcare: Artificial Neural Networks are used in Oncology to train
algorithms that can identify cancerous tissue at the microscopic level at the same
accuracy as trained physicians. Various rare diseases may manifest in physical
characteristics and can be identified in their premature stages by using Facial
Analysis on the patient photos. So the full-scale implementation of Artificial
Neural Networks in the healthcare environment can only enhance the diagnostic
abilities of medical experts and ultimately lead to the overall improvement in the
quality of medical care all over the world. Personal Assistants: I am sure you all
have heard of Siri, Alexa, Cortana, etc., and also heard them based on the phones
you have!!! These are personal assistants and an example of speech recognition that
uses Natural Language Processing to interact with the users and formulate a
response accordingly. Natural Language Processing uses artificial neural networks
that are made to handle many tasks of these personal assistants such as managing
the language syntax, semantics, correct speech, the conversation that is going on,
etc.

Last Updated :
02 Jun, 2023

Like Article

Save Article
Previous

Understanding Activation Functions in Depth

Next

Gradient Descent Optimization in Tensorflow

Share your thoughts in the comments

Add Your Comment

Please Login to comment...

Gradient descent is an optimization algorithm used to find the values of parameters


(coefficients) of a function (f) that minimizes a cost function. In other words,
gradient descent is an iterative algorithm that helps to find the optimal solution
to a given problem.
In this blog, we will discuss gradient descent optimization in TensorFlow, a
popular deep-learning framework. TensorFlow provides several optimizers that
implement different variations of gradient descent, such as stochastic gradient
descent and mini-batch gradient descent.
Before diving into the details of gradient descent in TensorFlow, let’s first
understand the basics of gradient descent and how it works.
What is Gradient Descent?
Gradient descent is an iterative optimization algorithm that is used to minimize a
function by iteratively moving in the direction of the steepest descent as defined
by the negative of the gradient. In other words, the gradient descent algorithm
takes small steps in the direction opposite to the gradient of the function at the
current point, with the goal of reaching a global minimum.
The gradient of a function tells us the direction in which the function is
increasing or decreasing the most. For example, if the gradient of a function is
positive at a certain point, it means that the function is increasing at that
point, and if the gradient is negative, it means that the function is decreasing at
that point.
The gradient descent algorithm starts with an initial guess for the parameters of
the function and then iteratively improves these guesses by taking small steps in
the direction opposite to the gradient of the function at the current point. This
process continues until the algorithm reaches a local or global minimum, where the
gradient is zero (i.e., the function is not increasing or decreasing).
How does Gradient Descent work?
The gradient descent algorithm is an iterative algorithm that updates the
parameters of a function by taking steps in the opposite direction of the gradient
of the function. The gradient of a function tells us the direction in which the
function is increasing or decreasing the most. The gradient descent algorithm uses
the gradient to update the parameters in the direction that reduces the value of
the cost function.
The gradient descent algorithm works in the following way:
Initialize the parameters of the function with some random values.Calculate the
gradient of the cost function with respect to the parameters.Update the parameters
by taking a small step in the opposite direction of the gradient.Repeat steps 2 and
3 until the algorithm reaches a local or global minimum, where the gradient is
zero.Here is a simple example to illustrate the gradient descent algorithm in
action. Let’s say we have a function f(x) = x2, and we want to find the value of x
that minimizes the function. We can use the gradient descent algorithm to find this
value.
First, we initialize the value of x with some random value, say x = 3. Next, we
calculate the gradient of the function with respect to x, which is 2x. In this
case, the gradient is 6 (2 * 3). Since the gradient is positive, it means that the
function is increasing at x = 3, and we need to take a step in the opposite
direction to reduce the value of the function.
We update the value of x by subtracting a small step size (called the learning
rate) from the current value of x. For example, if the learning rate is 0.1, we can
update the value of x as follows:
x = x - 0.1 * gradient
= 3 - 0.1 * 6
= 2.7
We repeat this process until the algorithm reaches a local or global minimum. To
implement gradient descent in TensorFlow, we first need to define the cost function
that we want to minimize. In this example, we will use a simple linear regression
model to illustrate how gradient descent works.
Linear regression is a popular machine learning algorithm that is used to model the
relationship between a dependent variable (y) and one or more independent variables
(x). In a linear regression model, we try to find the best-fit line that describes
the relationship between the dependent and independent variables. To create a
linear regression model in TensorFlow, we first need to define the placeholders for
the input and output data. A placeholder is a TensorFlow variable that we can use
to feed data into our model.
Here is the code to define the placeholders for the input and output data:

Python3
# Import tensorflow 2 as tensorflow 1 import tensorflow.compat.v1 as tf
tf.disable_v2_behavior() # Define the placeholders for # the input and output
data x = tf.placeholder(tf.float32) y = tf.placeholder(tf.float32) # Define the
placeholders for # the input and output data x = tf.placeholder(tf.float32) y =
tf.placeholder(tf.float32)

Next, we need to define the variables that represent the parameters of our linear
regression model. In this example, we will use a single variable (w) to represent
the slope of the best-fit line. We initialize the value of w with a random value,
say 0.5.
Here is the code to define the variable for the model parameters:

Python3

# Define the model parameters w = tf.Variable(0.5, name="weights")


Once we have defined the placeholders and the model parameters, we can define the
linear regression model by using the TensorFlow tf.add() and tf.multiply()
functions. The tf.add() function is used to add the bias term to the model, and the
tf.multiply() function is used to multiply the input data (x) and the model
parameters (w).
Here is the code to define the linear regression model:

Python3

# Define the linear regression model model = tf.add(tf.multiply(x, w), 0.5)


Once we have defined the linear regression model, we need to define the cost
function that we want to minimize. In this example, we will use the mean squared
error (MSE) as the cost function. The MSE is a popular metric that is used to
evaluate the performance of a linear regression model. It measures the average
squared difference between the predicted values and the actual values.
To define the cost function, we first need to calculate the difference between the
predicted values and the actual values using the TensorFlow tf.square() function.
The tf.square() function squares each element in the input tensor and returns the
squared values.
Here is the code to define the cost function using the MSE:

Python3

# Define the cost function (MSE) cost = tf.reduce_mean(tf.square(model - y))

Once we have defined the cost function, we can use the TensorFlow
tf.train.GradientDescentOptimizer() function to create an optimizer that uses the
gradient descent algorithm to minimize the cost function. The
tf.train.GradientDescentOptimizer() function takes the learning rate as an input
parameter. The learning rate is a hyperparameter that determines the size of the
steps that the algorithm takes to reach the minimum of the cost function.
Here is the code to create the gradient descent optimizer:
Python3

# Create the gradient descent optimizer optimizer =


tf.train.GradientDescentOptimizer(learning_rate=0.01)

Once we have defined the optimizer, we can use the minimize() method of the
optimizer to minimize the cost function. The minimize() method takes the cost
function as an input parameter and returns an operation that, when executed,
performs one step of gradient descent on the cost function.
Here is the code to minimize the cost function using the gradient descent
optimizer:

Python3
# Minimize the cost function train = optimizer.minimize(cost)

Once we have defined the gradient descent optimizer and the train operation, we can
use the TensorFlow Session class to train our model. The Session class provides a
way to execute TensorFlow operations. To train the model, we need to initialize the
variables that we have defined earlier (i.e., the model parameters and the
optimizer) and then run the train operation in a loop for a specified number of
iterations.
Here is the code to train the linear regression model using the gradient descent
optimizer:

Python3

# Define the toy dataset x_train = [1, 2, 3, 4] y_train = [2, 4, 6, 8] # Create a


TensorFlow session with tf.Session() as sess: # Initialize the variables
sess.run(tf.global_variables_initializer()) # Training loop for i in
range(1000): sess.run(train, feed_dict={x: x_train,
y: y_train}) # Evaluate the model w_val = sess.run(w) # Mean Squared
Error (MSE) between # the predicted and true output values print(w_val)

In the above code, we have defined a Session object and used the
global_variables_initializer() method to initialize the variables. Next, we have
run the train operation in a loop for 1000 iterations. In each iteration, we have
fed the input and output data to the train operation using the feed_dict parameter.
Finally, we evaluated the trained model by running the w variable to get the value
of the model parameters. This will train a linear regression model on the toy
dataset using gradient descent. The model will learn the weights w that minimizes
the mean squared error between the predicted and true output values.
Visualizing the convergence of Gradient Descent using Linear Regression
Linear regression is a method for modeling the linear relationship between a
dependent variable (also known as the response or output variable) and one or more
independent variables (also known as the predictor or input variables). The goal of
linear regression is to find the values of the model parameters (coefficients) that
minimize the difference between the predicted values and the true values of the
dependent variable.
The linear regression model can be expressed as follows:

where:
is the predicted value of the dependent variable are the independent
variables are the coefficients (model parameters) associated with the independent
variables.b is the intercept (a constant term).
To train the linear regression model, you need a dataset with input features
(independent variables) and labels (dependent variables). You can then use an
optimization algorithm, such as gradient descent, to find the values of the model
parameters that minimize the loss function.
The loss function measures the difference between the predicted values and the true
values of the dependent variable. There are various loss functions that can be used
for linear regression, such as mean squared error (MSE) and mean absolute error
(MAE). The MSE loss function is defined as follows:

where:
is the predicted value for the th sample is the true value for the th sampleN is
the total number of samplesThe MSE loss function measures the average squared
difference between the predicted values and the true values. A lower MSE value
indicates that the model is performing better.

Python3
import tensorflow as tf import matplotlib.pyplot as plt # Set up the data and
model X = tf.constant([[1.], [2.], [3.], [4.]]) y = tf.constant([[2.], [4.], [6.],
[8.]]) w = tf.Variable(0.) b = tf.Variable(0.) # Define the model and loss
function def model(x): return w * x + b def loss(predicted_y, true_y):
return tf.reduce_mean(tf.square(predicted_y - true_y)) # Set the learning rate
learning_rate = 0.001 # Training loop losses = [] for i in range(250): with
tf.GradientTape() as tape: predicted_y = model(X) current_loss =
loss(predicted_y, y) gradients = tape.gradient(current_loss, [w, b])
w.assign_sub(learning_rate * gradients[0]) b.assign_sub(learning_rate *
gradients[1]) losses.append(current_loss.numpy()) # Plot the loss
plt.plot(losses) plt.xlabel("Iteration") plt.ylabel("Loss") plt.show()

Output:
Loss vs Iteration
The loss function calculates the mean squared error (MSE) loss between the
predicted values and the true labels. The model function defines the linear
regression model, which is a linear function of the form .
The training loop performs 250 iterations of gradient descent. At each iteration,
the with tf.GradientTape() as tape: block activates the gradient tape, which
records the operations for computing the gradients of the loss with respect to the
model parameters.
Inside the block, the predicted values are calculated using the model function and
the current values of the model parameters. The loss is then calculated using the
loss function and the predicted values and true labels.
After the loss has been calculated, the gradients of the loss with respect to the
model parameters are computed using the gradient method of the gradient tape. The
model parameters are then updated by subtracting the learning rate multiplied by
the gradients from the current values of the parameters. This process is repeated
until the training loop is completed.
Finally, the model parameters will contain the optimized values that minimize the
loss function, and the model will be trained to predict the dependent variable
given the independent variables.
A list called losses store the loss at each iteration. After the training loop is
completed, the losses list contains the loss values at each iteration.
The plt.plot function plots the losses list as a function of the iteration number,
which is simply the index of the loss in the list. The plt.xlabel and plt.ylabel
functions add labels to the x-axis and y-axis of the plot, respectively. Finally,
the plt.show function displays the plot.
The resulting plot shows how the loss changes over the course of the training
process. As the model is trained, the loss should decrease, indicating that the
model is learning and the model parameters are being optimized to minimize the
loss. Eventually, the loss should converge to a minimum value, indicating that the
model has reached a good solution. The rate at which the loss decreases and the
final value of the loss will depend on various factors, such as the learning rate,
the initial values of the model parameters, and the complexity of the model.
Visualizing the Gradient Descent
Gradient descent is an optimization algorithm that is used to find the values of
the model parameters that minimize the loss function. The algorithm works by
starting with initial values for the parameters and then iteratively updating the
values to minimize the loss.
The equation for the gradient descent algorithm for linear regression can be
written as follows:

where:
w_i is the ith model parameter. is the learning rate (a hyperparameter that
determines the step size of the update). is the partial derivative of the MSE loss
function with respect to the ith model parameter.
This equation updates the value of each parameter in the direction that reduces the
loss. The learning rate determines the size of the update, with a smaller learning
rate resulting in smaller steps and a larger learning rate resulting in larger
steps.
The process of performing gradient descent can be visualized as taking small steps
downhill on a loss surface, with the goal of reaching the global minimum of the
loss function. The global minimum is the point on the loss surface where the loss
is the lowest.
Here is an example of how to plot the loss surface and the trajectory of the
gradient descent algorithm:

Python3
import numpy as np import matplotlib.pyplot as plt # Generate synthetic data
np.random.seed(42) X = 2 * np.random.rand(100, 1) - 1y = 4 + 3 * X +
np.random.randn(100, 1) # Initialize model parameters w = np.random.randn(2, 1) b
= np.random.randn(1)[0] # Set the learning rate alpha = 0.1 # Set the number of
iterations num_iterations = 20 # Create a mesh to plot the loss surface w1, w2 =
np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
# Compute the loss for each point on the grid loss = np.zeros_like(w1) for i in
range(w1.shape[0]): for j in range(w1.shape[1]): loss[i, j] =
np.mean((y - w1[i, j] \ * X - w2[i, j] * X**2)**2)
# Perform gradient descent for i in range(num_iterations): # Compute the
gradient of the loss # with respect to the model parameters grad_w1 = -2 *
np.mean(X * (y - w[0] \ * X - w[1] * X**2))
grad_w2 = -2 * np.mean(X**2 * (y - w[0] \ * X -
w[1] * X**2)) # Update the model parameters w[0] -= alpha * grad_w1
w[1] -= alpha * grad_w2 # Plot the loss surface fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(projection='3d') ax.plot_surface(w1, w2, loss,
cmap='coolwarm') ax.set_xlabel('w1') ax.set_ylabel('w2') ax.set_zlabel('Loss') #
Plot the trajectory of the gradient descent algorithm ax.plot(w[0], w[1],
np.mean((y - w[0]\ * X - w[1] * X**2)**2),
'o', c='red', markersize=10) plt.show()

Output:
Gradient Descent finding global minima
This code generates synthetic data for a quadratic regression problem, initializes
the model parameters, and performs gradient descent to find the values of the model
parameters that minimize the mean squared error loss. The code also plots the loss
surface and the trajectory of the gradient descent algorithm on the loss surface.
The resulting plot shows how the gradient descent algorithm takes small steps
downhill on the loss surface and eventually reaches the global minimum of the loss
function. The global minimum is the point on the loss surface where the loss is the
lowest.
It is important to choose an appropriate learning rate for the gradient descent
algorithm. If the learning rate is too small, the algorithm will take a long time
to converge to the global minimum. On the other hand, if the learning rate is too
large, the algorithm may overshoot the global minimum and may not converge to a
good solution.
Another important consideration is the initialization of the model parameters. If
the initialization is too far from the global minimum, the gradient descent
algorithm may take a long time to converge. It is often helpful to initialize the
model parameters to small random values.
It is also important to choose an appropriate stopping criterion for the gradient
descent algorithm. One common stopping criterion is to stop the algorithm when the
loss function stops improving or when the improvement is below a certain threshold.
Another option is to stop the algorithm after a fixed number of iterations.
Overall, gradient descent is a powerful optimization algorithm that can be used to
find the values of the model parameters that minimize the loss function for a wide
range of machine learning problems.
Conclusion
In this blog, we have discussed gradient descent optimization in TensorFlow and how
to implement it to train a linear regression model. We have seen that TensorFlow
provides several optimizers that implement different variations of gradient
descent, such as stochastic gradient descent and mini-batch gradient descent.
Gradient descent is a powerful optimization algorithm that is widely used in
machine learning and deep learning to find the optimal solution to a given problem.
It is an iterative algorithm that updates the parameters of a function by taking
steps in the opposite direction of the gradient of the function. TensorFlow makes
it easy to implement gradient descent by providing built-in optimizers and
functions for computing gradients.

Last Updated :
09 Feb, 2023

Like Article

Save Article

Previous

Artificial Neural Networks and its Applications

Next

Choose optimal number of epochs to train a neural network in Keras


Share your thoughts in the comments

Add Your Comment

Please Login to comment...

One of the critical issues while training a neural network on the sample data is
Overfitting. When the number of epochs used to train a neural network model is more
than necessary, the training model learns patterns that are specific to sample data
to a great extent. This makes the model incapable to perform well on a new dataset.
This model gives high accuracy on the training set (sample data) but fails to
achieve good accuracy on the test set. In other words, the model loses
generalization capacity by overfitting the training data. To mitigate overfitting
and to increase the generalization capacity of the neural network, the model should
be trained for an optimal number of epochs. A part of the training data is
dedicated to the validation of the model, to check the performance of the model
after each epoch of training. Loss and accuracy on the training set as well as on
the validation set are monitored to look over the epoch number after which the
model starts overfitting.
keras.callbacks.callbacks.EarlyStopping()
Either loss/accuracy values can be monitored by the Early stopping call back
function. If the loss is being monitored, training comes to halt when there is an
increment observed in loss values. Or, If accuracy is being monitored, training
comes to halt when there is a decrement observed in accuracy values.
Syntax with default values:

keras.callbacks.callbacks.EarlyStopping(monitor=’val_loss’, min_delta=0,
patience=0, verbose=0, mode=’auto’, baseline=None, restore_best_weights=False)

Understanding few important arguments:


monitor: The value to be monitored by the function should be assigned. It can be
validation loss or validation accuracy.mode: It is the mode in which change in the
quantity monitored should be observed. This can be ‘min’ or ‘max’ or ‘auto’. When
the monitored value is loss, its value is ‘min’. When the monitored value is
accuracy, its value is ‘max’. When the mode is set is ‘auto’, the function
automatically monitors with the suitable mode.min_delta: The minimum value should
be set for the change to be considered i.e., Change in the value being monitored
should be higher than ‘min_delta’ value.patience: Patience is the number of epochs
for the training to be continued after the first halt. The model waits for patience
number of epochs for any improvement in the model.verbose: Verbose is an integer
value-0, 1 or 2. This value is to select the way in which the progress is displayed
while training.Verbose = 0: Silent mode-Nothing is displayed in this mode.Verbose =
1: A bar depicting the progress of training is displayed.Verbose = 2: In this mode,
one line per epoch, showing the progress of training per epoch is
displayed.restore_best_weights: This is a boolean value. True value restores the
weights which are optimal.
Finding the optimal number of epochs to avoid overfitting on the MNIST dataset.
Step 1: Loading dataset and preprocessing

Python3

import kerasfrom keras.utils.np_utils import to_categoricalfrom keras.datasets


import mnist # Loading data(train_images, train_labels), (test_images, test_labels)
= mnist.load_data() # Reshaping data-Adding number of channels as 1 (Grayscale
images)train_images =
train_images.reshape((train_images.shape[0], tr
ain_images.shape[1], train_images.shape[2],
1)) test_images =
test_images.reshape((test_images.shape[0], test_i
mages.shape[1], test_images.shape[2], 1)) #
Scaling down pixel valuestrain_images =
train_images.astype('float32')/255test_images = test_images.astype('float32')/255 #
Encoding labels to a binary class matrixy_train =
to_categorical(train_labels)y_test = to_categorical(test_labels)

Step 2: Building a CNN model

Python3
from keras import modelsfrom keras import layers model =
models.Sequential()model.add(layers.Conv2D(32, (3, 3),
activation="relu", input_shape=(28, 28,
1)))model.add(layers.MaxPooling2D(2, 2))model.add(layers.Conv2D(64, (3, 3),
activation="relu"))model.add(layers.MaxPooling2D(2,
2))model.add(layers.Flatten())model.add(layers.Dense(64,
activation="relu"))model.add(layers.Dense(10,
activation="softmax")) model.summary()

Output: Summary of the modelStep 4: Compiling the model with RMSprop optimizer,
categorical cross entropy loss function and accuracy as success metric

model.compile(optimizer=”rmsprop”, loss=”categorical_crossentropy”,
metrics=[‘accuracy’])

Step 5: Creating a validation set and training set by partitioning the current
training set

Python3
val_images = train_images[:10000]partial_images = train_images[10000:]val_labels =
y_train[:10000]partial_labels = y_train[10000:]

Step 6: Initializing early stopping callback and training the model.

Python3

from keras import callbacksearlystopping =


callbacks.EarlyStopping(monitor="val_loss",
mode="min",
patience=5, restore_best_weights=True) histo
ry = model.fit(partial_images, partial_labels,
batch_size=128, epochs=25, validation_data=(val_images,
val_labels), callbacks=[earlystopping])

Training stopped at 11th epoch i.e., the model will start overfitting from 12th
epoch.
Observing loss values without using Early Stopping call back function: Train the
model up to 25 epochs and plot the training loss values and validation loss values
against number of epochs. However, the patience in the call-back is set to 5, so
the model will train for 5 more epochs after the optimal. This would make the
optimal 6, not 11. The results are provided in an image which also shows the lowest
validation loss was achieved by epoch 6, not 11, making it the optimal. Therefore,
the optimal number of epochs to train most dataset is 6.

The plot looks like this:


Inference: As the number of epochs increases beyond 11, training set loss decreases
and becomes nearly zero. Whereas, validation loss increases depicting the
overfitting of the model on training data.

Last Updated :
28 Feb, 2023

Like Article

Save Article

Previous
Gradient Descent Optimization in Tensorflow

Next

Python | Classify Handwritten Digits with Tensorflow

Share your thoughts in the comments

Add Your Comment

Please Login to comment...

Classifying handwritten digits is the basic problem of the machine learning and can
be solved in many ways here we will implement them by using TensorFlowUsing a
Linear Classifier Algorithm with tf.contrib.learn linear classifier achieves the
classification of handwritten digits by making a choice based on the value of a
linear combination of the features also known as feature values and is typically
presented to the machine in a vector called a feature vector. Modules
required :NumPy:
$ pip install numpy
Matplotlib:
$ pip install matplotlib
Tensorflow:
$ pip install tensorflow

Steps to follow
Step 1 : Importing all dependence

Python3
import numpy as npimport matplotlib.pyplot as pltimport tensorflow as tf learn =
tf.contrib.learn tf.logging.set_verbosity(tf.logging.ERROR)

Step 2 : Importing Dataset using MNIST Data

Python3
mnist = learn.datasets.load_dataset('mnist')data = mnist.train.imageslabels =
np.asarray(mnist.train.labels, dtype=np.int32)test_data =
mnist.test.imagestest_labels = np.asarray(mnist.test.labels, dtype=np.int32)

after this step a dataset of mnist will be downloaded. output :


Extracting MNIST-data/train-images-idx3-ubyte.gz
Extracting MNIST-data/train-labels-idx1-ubyte.gz
Extracting MNIST-data/t10k-images-idx3-ubyte.gz
Extracting MNIST-data/t10k-labels-idx1-ubyte.gz
Step 3 : Making dataset

Python3

max_examples = 10000data = data[:max_examples]labels = labels[:max_examples]


Step 4 : Displaying dataset using MatplotLib

Python3

def display(i): img = test_data[i] plt.title('label :


{}'.format(test_labels[i])) plt.imshow(img.reshape((28, 28))) # image in
TensorFlow is 28 by 28 pxdisplay(0)

To display data we can use this function – display(0) output :

Step 5 : Fitting data, using linear classifier

Python3
feature_columns = learn.infer_real_valued_columns_from_input(data)classifier =
learn.LinearClassifier(n_classes=10,
feature_columns=feature_columns)classifier.fit(data, labels, batch_size=100,
steps=1000)

Step 6 : Evaluate accuracy

Python3

classifier.evaluate(test_data, test_labels)print(classifier.evaluate(test_data,
test_labels)["accuracy"])
Output :
0.9137
Step 7 : Predicting data

Python3

prediction = classifier.predict(np.array([test_data[0]],
dtype=float),
as_iterable=False)print("prediction : {}, label : {}".format(prediction,
test_labels[0]) )
Output :
prediction : [7], label : 7
Full Code for classifying handwritten

Python3

# importing librariesimport numpy as npimport matplotlib.pyplot as pltimport


tensorflow as tf learn =
tf.contrib.learntf.logging.set_verbosity(tf.logging.ERROR)\ # importing dataset
using MNIST# this is how mnist is used mnist contain test and train datasetmnist =
learn.datasets.load_dataset('mnist')data = mnist.train.imageslabels =
np.asarray(mnist.train.labels, dtype = np.int32)test_data =
mnist.test.imagestest_labels = np.asarray(mnist.test.labels, dtype =
np.int32) max_examples = 10000data = data[:max_examples]labels =
labels[:max_examples] # displaying dataset using Matplotlibdef display(i): img =
test_data[i] plt.title('label :
{}'.format(test_labels[i])) plt.imshow(img.reshape((28, 28))) # img in tf is
28 by 28 px# fitting linear classifierfeature_columns =
learn.infer_real_valued_columns_from_input(data)classifier =
learn.LinearClassifier(n_classes = 10,
feature_columns = feature_columns)classifier.fit(data, labels, batch_size = 100,
steps = 1000) # Evaluate accuracyclassifier.evaluate(test_data,
test_labels)print(classifier.evaluate(test_data, test_labels)
["accuracy"]) prediction = classifier.predict(np.array([test_data[0]],
dtype=float),
as_iterable=False)print("prediction : {}, label : {}".format(prediction,
test_labels[0]) ) if prediction == test_labels[0]: display(0)
Using Deep learning with tf.kerasDeep learning is a subpart of machine learning and
artificial intelligence which is also known as deep neural network this networks
capable of learning unsupervised from provided data which is unorganized or
unlabeled. today, we will implement a neural network in TensorFlow to classify
handwritten digit.Modules required :NumPy:
$ pip install numpy
Matplotlib:
$ pip install matplotlib
Tensorflow:
$ pip install tensorflow

Steps to follow
Step 1 : Importing all dependence

Python3

import tensorflow as tfimport numpy as npimport matplotlib.pyplot as plt

Step 2 : Import data and normalize it

Python3
mnist = tf.keras.datasets.mnist(x_train,y_train) , (x_test,y_test) =
mnist.load_data() x_train = tf.keras.utils.normalize(x_train,axis=1)x_test =
tf.keras.utils.normalize(x_test,axis=1)

Step 3 : view data

Python3

def draw(n): plt.imshow(n,cmap=plt.cm.binary) plt.show()


draw(x_train[0])
Step 4 : make a neural network and train it

Python3

#there are two types of models#sequential is most common, why? model =


tf.keras.models.Sequential() model.add(tf.keras.layers.Flatten(input_shape=(28,
28)))#reshape model.add(tf.keras.layers.Dense(128,activation=tf.nn.relu))model.add(
tf.keras.layers.Dense(128,activation=tf.nn.relu))model.add(tf.keras.layers.Dense(10
,activation=tf.nn.softmax)) model.compile(optimizer='adam', loss='spar
se_categorical_crossentropy', metrics=['accuracy'] )model
.fit(x_train,y_train,epochs=3)

Step 5 : check model accuracy and loss


Python3

val_loss,val_acc = model.evaluate(x_test,y_test)print("loss-> ",val_loss,"\nacc->


",val_acc)

Step 6 : prediction using model

Python3
predictions=model.predict([x_test])print('label -> ',y_test[2])print('prediction ->
',np.argmax(predictions[2])) draw(x_test[2])

saving and testing model


saving the model

Python3

#saving the model# .h5 or .model can be used model.save('epic_num_reader.h5')

loading the saved model


Python3

new_model = tf.keras.models.load_model('epic_num_reader.h5')

prediction using new model

Python3
predictions=new_model.predict([x_test]) print('label ->
',y_test[2])print('prediction -> ',np.argmax(predictions[2])) draw(x_test[2])

Last Updated :
22 Sep, 2021

Like Article

Save Article

Previous

SMS Spam Detection using TensorFlow in Python

Next

Recognizing HandWritten Digits in Scikit Learn


Share your thoughts in the comments

Add Your Comment

Please Login to comment...

Deep learning is a powerful and flexible method for developing state-of-the-art ML


models. PyTorch is a popular open-source deep learning framework that provides a
seamless way to build, train, and evaluate neural networks in Python. In this
article, we will go over the steps of training a deep learning model using PyTorch,
along with an example.
A neural network is a type of machine learning model that is inspired by the
structure and function of the human brain. It consists of layers of interconnected
nodes, called neurons, which process and transmit information. Neural networks are
particularly well suited for tasks such as image and speech recognition, natural
language processing, and making predictions based on large amounts of data.
below is an image of neural network
Neural NetworkInstalling PyTorch
To install PyTorch, you will need to have Python and pip (the package manager for
Python) installed on your computer.
You can install PyTorch and necessary libraries by running the following command in
your command prompt or terminal:
pip install torch
pip install torchvision
pip install torchsummaryMNIST Datasets
The MNIST dataset is a dataset of handwritten digits, consisting of 60,000 training
examples and 10,000 test examples. Each example is a 28×28 grayscale image of a
handwritten digit, with values ranging from 0 (white) to 255 (black). The label for
each example is the digit that the image represents, with values ranging from 0 to
9.
An sample data from Mnist
It is a dataset commonly used for training and evaluating image classification
models, particularly in the field of computer vision. It is considered a “Hello
World” dataset for deep learning because it is small and relatively simple, yet
still requires a non-trivial amount of preprocessing and model architecture design
to achieve good performance.
Step 1: Import the necessary Libraries

Python3
import torch import torchvision import torch.nn as nn import torch.optim as optim
from torchsummary import summary import torch.nn.functional as F

Step 2: Load the MNIST Datasets


First, we need to import the necessary libraries and load the dataset. We will be
using the built-in MNIST dataset in PyTorch, which can be easily loaded using the
torchvision library.

Python3
# Load the MNIST dataset train_dataset = torchvision.datasets.MNIST(root='./data',
train=True,
transform=torchvision.transforms.ToTensor(),
download=True) test_dataset = torchvision.datasets.MNIST(root='./data',
train=False,
transform=torchvision.transforms.ToTensor(),
download=True)

In the above code the torchvision.datasets.MNIST function is used to load the


dataset, it takes several arguments such as:
root: The directory where the dataset will be savedtrain: A Boolean flag indicating
whether to load the training set or the test set.transform: A transformation to be
applied to the datadownload: A Boolean flag indicating whether to download the
dataset if it is not found in the root directory.Step 3: Build the model
Next, we need to define our model. In this example, we will be using a simple
feedforward neural network

Python3

class Classifier(nn.Module): def __init__(self): super().__init__()


self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1) self.conv2 =
nn.Conv2d(32, 64, kernel_size=3, padding=1) self.pool = nn.MaxPool2d(2, 2)
self.dropout1 = nn.Dropout2d(0.25) self.dropout2 = nn.Dropout2d(0.5)
self.fc1 = nn.Linear(64 * 7 * 7, 128) self.fc2 = nn.Linear(128, 10)
def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x =
self.dropout1(x) x = self.pool(F.relu(self.conv2(x))) x =
self.dropout2(x) x = x.view(-1, 64 * 7 * 7) x = F.relu(self.fc1(x))
x = self.fc2(x) return x

The Classifier class inherits from PyTorch’s nn.Module class and defines the
architecture of the CNN. The __init__ method is called when an instance of the
class is created and it sets up the layers of the network.
self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1): This line creates a 2D
convolutional layer with 1 input channel, 32 output channels, a kernel size of 3,
and padding of 1. The convolutional layer applies a set of filters (also called
kernels) to the input image in order to extract features from it.self.conv2 =
nn.Conv2d(32, 64, kernel_size=3, padding=1): This line creates another 2D
convolutional layer with 32 input channels, 64 output channels, a kernel size of 3,
and padding of 1. This layer is connected to the output of the first convolutional
layer, allowing the network to learn more complex features from the previous
layer’s output.self.pool = nn.MaxPool2d(2, 2): This line creates a max pooling
layer with a kernel size of 2 and a stride of 2. Max pooling is a down-sampling
operation that selects the maximum value from a small neighborhood for each input
channel. It helps to reduce the dimensionality of the data, reduce the
computational cost and helps to prevent overfitting.self.dropout1 =
nn.Dropout2d(0.25): This line creates a dropout layer with a probability of 0.25.
Dropout is a regularization technique that randomly drops out some neurons during
training, which helps to reduce overfitting.self.dropout2 = nn.Dropout2d(0.5): This
line creates another dropout layer with a probability of 0.5self.fc1 = nn.Linear(64
* 7 * 7, 128): This line creates a fully connected (linear) layer with 64 * 7 * 7
input features and 128 output features. Fully connected layers are used to make the
final predictions based on the features learned by the previous layers.self.fc2 =
nn.Linear(128, 10): This line creates another fully connected layer with 128 input
features and 10 output features. This layer will produce the final output of the
network with 10 classesThe forward method defines the
Next, there is the Forward pass method of the network. It takes an input x and
applies a series of operations defined by the layers in the __init__ method.
x = self.pool(F.relu(self.conv1(x))): This line applies the ReLU activation
function (F.relu) to the output of the first convolutional layer (self.conv1), and
then applies max pooling (self.pool) to the result.x = self.dropout1(x): This line
applies dropout to the output of the first pooling layer.x =
self.pool(F.relu(self.conv2(x))): This line applies the ReLU activation function to
the output of the second convolutional layer (self.conv2), and then applies max
pooling to the result.x = self.dropout2(x): This line applies dropout to the output
of the second pooling layer.x = x.view(-1, 64 * 7 * 7): This line reshapes the
tensor x to a 1D tensor, with -1 indicating that the number of elements in the
tensor is inferred from the other dimensions.x = F.relu(self.fc1(x)): This line
applies the ReLU activation function to the output of the first fully connected
layer (self.fc1).x = self.fc2(x): This line applies the final fully connected layer
(self.fc2) to the output of the previous layer and returns the result, which will
be the final output of the network.This CNN architecture is a simple one, and it
can be used as a starting point for more complex tasks. However, it could be
improved by adding more layers, using different types of layers, or tuning the
hyperparameters for better performance.GPU VS CPU

Python3

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") device

Output:
device(type='cuda')
This piece of code is used to select the device where we should rain our model. If
we are running our code in google colab we can check if the `cuda` device is
available if it is available we can use it else we can use normal CPU.
`CUDA` is a GPU optimized for running the ML models
Model Summary
After defining the model we can use the class to create a model object and view the
summary of the model. The summary option can be used to print the summary of the
model like below.

Python3
# Instantiate the model model = Classifier() # Move the model to the GPU if
available model.to(device) summary(model, (1, 28, 28))

Output:
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 32, 28, 28] 320
MaxPool2d-2 [-1, 32, 14, 14] 0
Dropout2d-3 [-1, 32, 14, 14] 0
Conv2d-4 [-1, 64, 14, 14] 18,496
MaxPool2d-5 [-1, 64, 7, 7] 0
Dropout2d-6 [-1, 64, 7, 7] 0
Linear-7 [-1, 128] 401,536
Linear-8 [-1, 10] 1,290
================================================================
Total params: 421,642
Trainable params: 421,642
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.43
Params size (MB): 1.61
Estimated Total Size (MB): 2.04
----------------------------------------------------------------Step 4: Define the
loss function and optimizer
Now, we need to define a loss function and an optimizer. For this example, we will
be using the cross-entropy loss and the ADAM optimizer.
Python3

# Define a loss function and optimizer criterion = nn.CrossEntropyLoss() optimizer


= optim.Adam(model.parameters(), lr=0.001)

The code defines the loss function and optimizer for the neural network.
nn.CrossEntropyLoss() is a PyTorch function that creates an instance of the cross-
entropy loss function. Cross-entropy loss is commonly used in classification
problems as it measures the dissimilarity between the predicted class probabilities
and the true class. It is calculated by taking the negative logarithm of the
predicted class probability for the true class.
optimizer = optim.Adam(model.parameters(), lr=0.001): This line creates an instance
of the optim.Adam class, which is an optimization algorithm commonly used for deep
learning. The Adam optimizer is an extension of stochastic gradient descent that
uses moving averages of the parameters to provide a running estimate of the second
raw moments of the gradients; the term Adam is derived from adaptive moment
estimation. It requires the model’s parameters to be passed as the first argument
and the learning rate is set to 0.001. The learning rate is a hyperparameter that
controls the step size at which the optimizer makes updates to the model’s
parameters.
The optimizer and the loss function are used during the training process to update
the model’s parameters and to evaluate the model’s performance, respectively.
Step 5: Train the model
Now, we can train our model using the training dataset. We will be using a batch
size of 100 and will train the model for 10 epochs. The below code is training the
neural network on a dataset using a loop that iterates over the number of training
epochs and over the data in the training dataset.
batch_size = 100 and num_epochs = 10 define the batch size and number of epochs for
the training process. The batch size is the number of samples from the training
dataset that are used in one forward and backward pass of the neural network. The
number of epochs is the number of times the entire training dataset is passed
through the network.torch.utils.data.DataLoader(train_dataset,
batch_size=batch_size, shuffle=True) creates a PyTorch DataLoader for the training
dataset. The DataLoader takes the training dataset as an input and returns an
iterator over the dataset. The iterator will return a set of samples (images and
labels) in each iteration, where the number of samples is determined by the batch
size. By setting shuffle=True, the DataLoader will randomly shuffle the dataset
before each epoch.The outer loop, for epoch in range(num_epochs), iterates over the
number of training epochs.The inner loop, for i, (images, labels) in
enumerate(train_loader), iterates over the DataLoader, which returns batches of
images and labels. The images are passed through the model using outputs =
model(images) to get the model’s predictions.The loss is calculated by passing the
model’s predictions and the true labels to the loss function using loss =
criterion(outputs, labels).The optimizer is used to update the model’s parameters
in the direction that minimizes the loss. This is done in the following 3 steps:
optimizer.zero_grad() which clears the gradients of all optimizable
parameters.loss.backward() computes the gradients of the loss with respect to the
model’s parameters.optimizer.step() updates the model’s parameters based on the
computed gradients.After the end of each epoch, the code prints the current epoch
and the loss at the end of the epoch.
At the end of the training process, the model’s parameters will have been updated
to minimize the loss on the training dataset.
It’s worth noting that it’s also useful to use a validation set to evaluate the
model performance during training, so we can detect overfitting and adjust the
model accordingly. we can achieve this by splitting the training set into two
parts: training and validation. Then, use the training set for training, and use
the validation set for evaluating the model performance during training.

Python3

batch_size=100num_epochs=10# Split the training set into training and validation


sets val_percent = 0.2 # percentage of the data used for validation val_size =
int(val_percent * len(train_dataset)) train_size = len(train_dataset) - val_size
train_dataset, val_dataset = torch.utils.data.random_split(train_dataset,
[train_size,
val_size]) # Create DataLoaders for the training and validation sets train_loader
= torch.utils.data.DataLoader(train_dataset,
batch_size=batch_size, shuffle=True,
pin_memory=True) val_loader = torch.utils.data.DataLoader(val_dataset,
batch_size=batch_size, shuffle=False,
pin_memory=True) losses = [] accuracies = [] val_losses = [] val_accuracies = [] #
Train the model for epoch in range(num_epochs): for i, (images, labels) in
enumerate(train_loader): # Forward pass images=images.to(device)
labels=labels.to(device) outputs = model(images) loss =
criterion(outputs, labels) # Backward pass and optimization
optimizer.zero_grad() loss.backward() optimizer.step() _,
predicted = torch.max(outputs.data, 1) acc = (predicted == labels).sum().item()
/ labels.size(0) accuracies.append(acc) losses.append(loss.item())
# Evaluate the model on the validation set val_loss = 0.0 val_acc =
0.0 with torch.no_grad(): for images, labels in val_loader:
labels=labels.to(device) images=images.to(device) outputs =
model(images) loss = criterion(outputs, labels) val_loss +=
loss.item() _, predicted = torch.max(outputs.data, 1)
total = labels.size(0) correct = (predicted == labels).sum().item()
val_acc += correct / total val_accuracies.append(acc)
val_losses.append(loss.item()) print('Epoch [{}/{}],Loss:
{:.4f},Validation Loss:{:.4f},Accuracy:{:.2f},Validation Accuracy:{:.2f}'.format(
epoch+1, num_epochs, loss.item(), val_loss, acc ,val_acc))

Output:
Epoch [1/10], Loss:0.2086, Validation Loss:14.6681, Accuracy:0.99, Validation
Accuracy:0.94
Epoch [2/10], Loss:0.1703, Validation Loss:11.0446, Accuracy:0.95, Validation
Accuracy:0.94
Epoch [3/10], Loss:0.1617, Validation Loss:8.9060, Accuracy:0.98, Validation
Accuracy:0.97
Epoch [4/10], Loss:0.1670, Validation Loss:7.7104, Accuracy:0.98, Validation
Accuracy:0.97
Epoch [5/10], Loss:0.0723, Validation Loss:7.1193, Accuracy:1.00, Validation
Accuracy:0.96
Epoch [6/10], Loss:0.0970, Validation Loss:7.5116, Accuracy:1.00, Validation
Accuracy:0.98
Epoch [7/10], Loss:0.1623, Validation Loss:6.8909, Accuracy:0.99, Validation
Accuracy:0.96
Epoch [8/10], Loss:0.1251, Validation Loss:7.2684, Accuracy:1.00, Validation
Accuracy:0.97
Epoch [9/10], Loss:0.0874, Validation Loss:6.9928, Accuracy:1.00, Validation
Accuracy:0.98
Epoch [10/10], Loss:0.0405, Validation Loss:6.0112, Accuracy:0.99, Validation
Accuracy:0.99
In this example, we have covered the basic steps to train a deep-learning model
using PyTorch on the MNIST dataset. This model can be further improved by using
more complex architectures, data augmentation, and other techniques. PyTorch is a
powerful and flexible library that allows you to build and train a wide range of
models, and this example is just the beginning of what you can do with it.
Step 6: Plot Training and Validation curve to check overfitting or underfitting
Once the model is trained, We can plot the Training and Validation Loss and
accuracy curve. This can give us an idea of how the model is performing on unseen
data, and if it’s overfitting or underfitting.

Python3

import matplotlib.pyplot as plt # Plot the training and validation loss over time
plt.plot(range(num_epochs), losses, color='red',
label='Training Loss', marker='o') plt.plot(range(num_epochs),
val_losses, color='blue', linestyle='--',
label='Validation Loss', marker='x') plt.xlabel('Epoch')
plt.ylabel('Loss') plt.title('Training and Validation Loss') plt.legend()
plt.show() # Plot the training and validation accuracy over time
plt.plot(range(num_epochs), accuracies, label='Training
Accuracy', color='red', marker='o')
plt.plot(range(num_epochs), val_accuracies, label='Validation
Accuracy', color='blue', linestyle=':', marker='x')
plt.xlabel('Epoch') plt.ylabel('Accuracy') plt.title('Training and Validation
Accuracy') plt.legend() plt.show()
Output:
Training and Validation LossTraining and Validation Accuracy
Note that the loss is generally decreasing with each epoch and accuracy is
increasing. This is the expected scenario
Step 7: Evaluation
Another important aspect is the choice of the evaluation metric. In this example,
we used accuracy as the evaluation metric, which is a good starting point for many
problems. However, it’s important to be aware that accuracy can be misleading in
some cases, especially when the classes are imbalanced. In those cases, other
metrics such as precision, recall, F1-score, or AUC-ROC should be used.
After training the model, you can evaluate its performance on the test dataset by
making predictions and comparing them to the true labels. One way to evaluate the
performance of a classification model is to use a classification report, which is a
summary of the model’s performance across all classes.
The first thing is to evaluate the model on the test dataset and calculate its
overall accuracy by comparing the predicted labels to the true labels using the
torch.max() function.
Then, it generates a classification report using the classification_report function
from the scikit-learn library. The classification report gives you a summary of the
model’s performance across all classes by calculating several metrics such as
precision, recall, f1-score, and support.
Precision – Precision is the number of true positives divided by the number of true
positives plus the number of false positives. It is a measure of how many of the
positive predictions were correct.
Recall – Recall is the number of true positives divided by the number of true
positives plus the number of false negatives. It is a measure of how many of the
actual positive cases were correctly predicted.
F1-score – The F1-score is the harmonic mean of precision and recall. It is a
single number that represents the balance between precision and recall.
Support – Support is the number of instances in the test set that belong to a
specific class.
It is important to note that the classification report is calculated based on the
predictions made on the entire test set, and not just a sample of the test set.
Here is an example of how to evaluate the model and generate a classification
report:

Python3
# Create a DataLoader for the test dataset test_loader =
torch.utils.data.DataLoader(test_dataset,
batch_size=batch_size, shuffle=False)
# Evaluate the model on the test dataset model.eval() with torch.no_grad():
correct = 0 total = 0 y_true = [] y_pred = [] for images, labels in
test_loader: images = images.to(device) labels = labels.to(device)
outputs = model(images) _, predicted = torch.max(outputs.data, 1)
total += labels.size(0) correct += (predicted == labels).sum().item()
predicted=predicted.to('cpu') labels=labels.to('cpu')
y_true.extend(labels) y_pred.extend(predicted) print('Test Accuracy: {}
%'.format(100 * correct / total)) # Generate a classification report from
sklearn.metrics import classification_report print(classification_report(y_true,
y_pred))

Output:
Test Accuracy: 99.1%
precision recall f1-score support

0 0.99 1.00 0.99 980


1 1.00 1.00 1.00 1135
2 0.99 0.99 0.99 1032
3 0.99 0.99 0.99 1010
4 0.99 0.99 0.99 982
5 0.99 0.99 0.99 892
6 1.00 0.99 0.99 958
7 0.98 0.99 0.99 1028
8 1.00 0.99 0.99 974
9 0.99 0.99 0.99 1009

accuracy 0.99 10000


macro avg 0.99 0.99 0.99 10000
weighted avg 0.99 0.99 0.99 10000
Finally, it’s important to keep in mind that deep learning models require a lot of
data and computational resources to train. Training a model on a large dataset
might take a long time and require a powerful GPU. There are also several cloud
providers such as AWS, GCP, and Azure that provide GPU instances that can be used
to train deep learning models, which can be a good option if you don’t have the
resources to train the model locally.
Summary
In summary, deep learning with PyTorch is a powerful tool that can be used to build
and train a wide range of models. The MNIST dataset is a good starting point to
learn the basics, but it’s important to keep in mind that there are many other
aspects to consider when working with real-world datasets and problems. With the
right knowledge and resources, deep learning can be a powerful tool to solve
complex problems and make predictions with high accuracy.

Last Updated :
15 Sep, 2023

Like Article

Save Article

Previous

Python | Classify Handwritten Digits with Tensorflow

Next

Linear Regression using PyTorch

Share your thoughts in the comments

Add Your Comment

Please Login to comment...


Linear Regression is a very commonly used statistical method that allows us to
determine and study the relationship between two continuous variables. The various
properties of linear regression and its Python implementation have been covered in
this article previously. Now, we shall find out how to implement this in PyTorch, a
very popular deep learning library that is being developed by Facebook.Firstly, you
will need to install PyTorch into your Python environment. The easiest way to do
this is to use the pip or conda tool. Visit pytorch.org and install the version of
your Python interpreter and the package manager that you would like to use.

Python3

# We can run this Python code on a Jupyter notebook# to automatically install the
correct version of # PyTorch. # http://pytorch.org / from os import pathfrom
wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tagplatform = '{}{}-
{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()) accelerator = 'cu80' if
path.exists('/opt / bin / nvidia-smi') else 'cpu' ! pip install -q
http://download.pytorch.org / whl/{accelerator}/torch-1.3.1.post4-{platform}-
linux_x86_64.whl torchvision
With PyTorch installed, let us now have a look at the code. Write the two lines
given below to import the necessary library functions and objects.

Python3

import torchfrom torch.autograd import Variable

We also define some data and assign them to variables x_data and y_data as given
below:

Python3
x_data = Variable(torch.Tensor([[1.0], [2.0], [3.0]]))y_data =
Variable(torch.Tensor([[2.0], [4.0], [6.0]]))

Here, x_data is our independent variable and y_data is our dependent variable. This
will be our dataset for now. Next, we need to define our model. There are two main
steps associated with defining our model. They are:
Initializing our model.Declaring the forward pass.
We use the class given below:

Python3

class LinearRegressionModel(torch.nn.Module): def


__init__(self): super(LinearRegressionModel,
self).__init__() self.linear = torch.nn.Linear(1, 1) # One in and one
out def forward(self, x): y_pred = self.linear(x) return y_pred
As you can see, our Model class is a subclass of torch.nn.module. Also, since here
we have only one input and one output, we use a Linear model with both the input
and output dimension as 1.Next, we create an object of this model.

Python3

# our modelour_model = LinearRegressionModel()

After this, we select the optimizer and the loss criteria. Here, we will use the
mean squared error (MSE) as our loss function and stochastic gradient descent (SGD)
as our optimizer. Also, we arbitrarily fix a learning rate of 0.01.
Python3

criterion = torch.nn.MSELoss(size_average = False)optimizer =


torch.optim.SGD(our_model.parameters(), lr = 0.01)

We now arrive at our training step. We perform the following tasks 500 times during
training:
Perform a forward pass bypassing our data and finding out the predicted value of
y.Compute the loss using MSE.Reset all the gradients to 0, perform a
backpropagation and then, update the weights.

Python3
for epoch in range(500): # Forward pass: Compute predicted y by passing # x
to the model pred_y = our_model(x_data) # Compute and print loss loss =
criterion(pred_y, y_data) # Zero gradients, perform a backward pass, # and
update the
weights. optimizer.zero_grad() loss.backward() optimizer.step() print('
epoch {}, loss {}'.format(epoch, loss.item()))

Once the training is completed, we test if we are getting correct results using the
model that we defined. So, we test it for an unknown value of x_data, in this case,
4.0.

Python3

new_var = Variable(torch.Tensor([[4.0]]))pred_y = our_model(new_var)print("predict


(after training)", 4, our_model(new_var).item())
If you performed all steps correctly, you will see that for input 4.0, you are
getting a value that is very close to 8.0 as below. So, our model inherently learns
the relationship between the input data and the output data without being
programmed explicitly.predict (after training) 4 7.966438293457031For your
reference, you can find the entire code of this article given below:

Python3

import torchfrom torch.autograd import Variable x_data =


Variable(torch.Tensor([[1.0], [2.0], [3.0]]))y_data = Variable(torch.Tensor([[2.0],
[4.0], [6.0]])) class LinearRegressionModel(torch.nn.Module): def
__init__(self): super(LinearRegressionModel,
self).__init__() self.linear = torch.nn.Linear(1, 1) # One in and one
out def forward(self, x): y_pred = self.linear(x) return y_pred #
our modelour_model = LinearRegressionModel() criterion =
torch.nn.MSELoss(size_average = False)optimizer =
torch.optim.SGD(our_model.parameters(), lr = 0.01) for epoch in range(500): #
Forward pass: Compute predicted y by passing # x to the model pred_y =
our_model(x_data) # Compute and print loss loss = criterion(pred_y,
y_data) # Zero gradients, perform a backward pass, # and update the
weights. optimizer.zero_grad() loss.backward() optimizer.step() print('
epoch {}, loss {}'.format(epoch, loss.item())) new_var =
Variable(torch.Tensor([[4.0]]))pred_y = our_model(new_var)print("predict (after
training)", 4, our_model(new_var).item())
ReferencesPyTorchZeroToAllPenn State STAT 501

Last Updated :
17 Sep, 2021

Like Article

Save Article

Previous

Train a Deep Learning Model With Pytorch

Next

Linear Regression Using Tensorflow

Share your thoughts in the comments


Add Your Comment

Please Login to comment...

We will briefly summarize Linear Regression before implementing it using


TensorFlow. Since we will not get into the details of either Linear Regression or
Tensorflow, please read the following articles for more details:
Linear Regression (Python Implementation)Introduction to TensorFlowIntroduction to
Tensor with TensorflowBrief Summary of Linear Regression
Linear Regression is a very common statistical method that allows us to learn a
function or relationship from a given set of continuous data. For example, we are
given some data points of x and corresponding y and we need to learn the
relationship between them which is called a hypothesis.
In the case of Linear regression, the hypothesis is a straight line, i.e, Where w
is a vector called Weights and b is a scalar called Bias. The Weights and Bias are
called the parameters of the model.
All we need to do is estimate the value of w and b from the given set of data such
that the resultant hypothesis produces the least cost J which is defined by the
following cost function where m is the number of data points in the given dataset.
This cost function is also called Mean Squared Error.
For finding the optimized value of the parameters for which J is minimum, we will
be using a commonly used optimizer algorithm called Gradient Descent. Following is
the pseudo-code for Gradient Descent:
Repeat until Convergence { w = w – α * δJ/δw b = b – α * δJ/δb}where α is a
hyperparameter called the Learning Rate.
Linear regression is a widely used statistical method for modeling the relationship
between a dependent variable and one or more independent variables. TensorFlow is a
popular open-source software library for data processing, machine learning, and
deep learning applications. Here are some advantages and disadvantages of using
Tensorflow for linear regression:
Advantages:
Scalability: Tensorflow is designed to handle large datasets and can easily scale
up to handle more data and more complex models.Flexibility: Tensorflow provides a
flexible API that allows users to customize their models and optimize their
algorithms.Performance: Tensorflow can run on multiple GPUs and CPUs, which can
significantly speed up the training process and improve performance.Integration:
Tensorflow can be integrated with other open-source libraries like Numpy, Pandas,
and Matplotlib, which makes it easier to preprocess and visualize data.
Disadvantages:
Complexity: Tensorflow has a steep learning curve and requires a good understanding
of machine learning and deep learning concepts.Computational resources: Running
Tensorflow on large datasets requires high computational resources, which can be
expensive.Debugging: Debugging errors in Tensorflow can be challenging, especially
when working with complex models. Overkill for simple models: Tensorflow can be
overkill for simple linear regression models and may not be necessary for smaller
datasets.Overall, using Tensorflow for linear regression has many advantages, but
it also has some disadvantages. When deciding whether to use Tensorflow or not, it
is essential to consider the complexity of the model, the size of the dataset, and
the available computational resources.
Tensorflow
Tensorflow is an open-source computation library made by Google. It is a popular
choice for creating applications that require high-end numerical computations
and/or need to utilize Graphics Processing Units for computation purposes. These
are the main reasons due to which Tensorflow is one of the most popular choices for
Machine Learning applications, especially Deep Learning. It also has APIs like
Estimator which provide a high level of abstraction while building Machine Learning
Applications. In this article, we will not be using any high-level APIs, rather we
will be building the Linear Regression model using low-level Tensorflow in the Lazy
Execution Mode during which Tensorflow creates a Directed Acyclic Graph or DAG
which keeps track of all the computations, and then executes all the computations
done inside a Tensorflow Session.
Implementation
We will start by importing the necessary libraries. We will use Numpy along with
Tensorflow for computations and Matplotlib for plotting.

Python3

import numpy as npimport tensorflow as tfimport matplotlib.pyplot as plt

In order to make the random numbers predictable, we will define fixed seeds for
both Numpy and Tensorflow.
Python3

np.random.seed(101)

Now, let us generate some random data for training the Linear Regression Model.

Python3
# Generating random linear data# There will be 50 data points ranging from 0 to 50x
= np.linspace(0, 50, 50)y = np.linspace(0, 50, 50) # Adding noise to the random
linear datax += np.random.uniform(-4, 4, 50)y += np.random.uniform(-4, 4, 50) n =
len(x) # Number of data points

Let us visualize the training data.

Python3

# Plot of Training Dataplt.scatter(x,


y)plt.xlabel('x')plt.ylabel('y')plt.title("Training Data")plt.show()

Output:
Now we will start creating our model by defining the placeholders X and Y, so that
we can feed our training examples X and Y into the optimizer during the training
process.

Python3

X = tf.placeholder("float")Y = tf.placeholder("float")

Now we will declare two trainable Tensorflow Variables for the Weights and Bias and
initializing them randomly using np.random.randn().

Python3
W = tf.Variable(np.random.randn(), name = "W")b = tf.Variable(np.random.randn(),
name = "b")

Now we will define the hyperparameters of the model, the Learning Rate and the
number of Epochs.

Python3

learning_rate = 0.01training_epochs = 1000


Now, we will be building the Hypothesis, the Cost Function, and the Optimizer. We
won’t be implementing the Gradient Descent Optimizer manually since it is built
inside Tensorflow. After that, we will be initializing the Variables.

Python3

# Hypothesisy_pred = tf.add(tf.multiply(X, W), b) # Mean Squared Error Cost


Functioncost = tf.reduce_sum(tf.pow(y_pred-Y, 2)) / (2 * n) # Gradient Descent
Optimizeroptimizer =
tf.train.GradientDescentOptimizer(learning_rate).minimize(cost) # Global Variables
Initializerinit = tf.global_variables_initializer()

Now we will begin the training process inside a Tensorflow Session.

Python3
# Starting the Tensorflow Sessionwith tf.Session() as sess: # Initializing
the Variables sess.run(init) # Iterating through all the epochs for
epoch in range(training_epochs): # Feeding each data point into the
optimizer using Feed Dictionary for (_x, _y) in zip(x,
y): sess.run(optimizer, feed_dict = {X : _x, Y : _y}) #
Displaying the result after every 50 epochs if (epoch + 1) % 50 ==
0: # Calculating the cost a every epoch c = sess.run(cost,
feed_dict = {X : x, Y : y}) print("Epoch", (epoch + 1), ": cost =", c,
"W =", sess.run(W), "b =", sess.run(b)) # Storing necessary values to be
used outside the Session training_cost = sess.run(cost, feed_dict ={X: x, Y:
y}) weight = sess.run(W) bias = sess.run(b)

Output:

Now let us look at the result.

Python3
# Calculating the predictionspredictions = weight * x + biasprint("Training cost
=", training_cost, "Weight =", weight, "bias =", bias, '\n')

Output:

Note that in this case both the Weight and bias are scalars. This is because, we
have considered only one dependent variable in our training data. If we have m
dependent variables in our training dataset, the Weight will be an m-dimensional
vector while bias will be a scalar.
Finally, we will plot our result.

Python3

# Plotting the Resultsplt.plot(x, y, 'ro', label ='Original data')plt.plot(x,


predictions, label ='Fitted line')plt.title('Linear Regression
Result')plt.legend()plt.show()
Output:

Last Updated :
03 Apr, 2023

Like Article

Save Article

Previous

Linear Regression using PyTorch

Next

Hyperparameter tuning

Share your thoughts in the comments

Add Your Comment


Please Login to comment...

A Machine Learning model is defined as a mathematical model with several parameters


that need to be learned from the data. By training a model with existing data, we
can fit the model parameters. However, there is another kind of parameter, known as
Hyperparameters, that cannot be directly learned from the regular training process.
They are usually fixed before the actual training process begins. These parameters
express important properties of the model such as its complexity or how fast it
should learn. This article aims to explore various strategies to tune
hyperparameters for Machine learning models.
Hyperparameter TuningHyperparameter tuning is the process of selecting the optimal
values for a machine learning model’s hyperparameters. Hyperparameters are settings
that control the learning process of the model, such as the learning rate, the
number of neurons in a neural network, or the kernel size in a support vector
machine. The goal of hyperparameter tuning is to find the values that lead to the
best performance on a given task.
What are Hyperparameters?In the context of machine learning, hyperparameters are
configuration variables that are set before the training process of a model begins.
They control the learning process itself, rather than being learned from the data.
Hyperparameters are often used to tune the performance of a model, and they can
have a significant impact on the model’s accuracy, generalization, and other
metrics.
Different Ways of Hyperparameters TuningHyperparameters are configuration variables
that control the learning process of a machine learning model. They are distinct
from model parameters, which are the weights and biases that are learned from the
data. There are several different types of hyperparameters:
Hyperparameters in Neural NetworksNeural networks have several essential
hyperparameters that need to be adjusted, including:
Learning rate: This hyperparameter controls the step size taken by the optimizer
during each iteration of training. Too small a learning rate can result in slow
convergence, while too large a learning rate can lead to instability and
divergence.Epochs: This hyperparameter represents the number of times the entire
training dataset is passed through the model during training. Increasing the number
of epochs can improve the model’s performance but may lead to overfitting if not
done carefully.Number of layers: This hyperparameter determines the depth of the
model, which can have a significant impact on its complexity and learning
ability.Number of nodes per layer: This hyperparameter determines the width of the
model, influencing its capacity to represent complex relationships in the
data.Architecture: This hyperparameter determines the overall structure of the
neural network, including the number of layers, the number of neurons per
layer, and the connections between layers. The optimal architecture depends on the
complexity of the task and the size of the datasetActivation function: This
hyperparameter introduces non-linearity into the model, allowing it to learn
complex decision boundaries. Common activation functions include sigmoid, tanh, and
Rectified Linear Unit (ReLU).Hyperparameters in Support Vector MachineWe take into
account some essential hyperparameters for fine-tuning SVMs:
C: The regularization parameter that controls the trade-off between the margin and
the number of training errors. A larger value of C penalizes training errors more
heavily, resulting in a smaller margin but potentially better generalization
performance. A smaller value of C allows for more training errors but may lead to
overfitting.Kernel: The kernel function that defines the similarity between data
points. Different kernels can capture different relationships between data points,
and the choice of kernel can significantly impact the performance of the SVM.
Common kernels include linear, polynomial, radial basis function (RBF), and
sigmoid.Gamma: The parameter that controls the influence of support vectors on the
decision boundary. A larger value of gamma indicates that nearby support vectors
have a stronger influence, while a smaller value indicates that distant support
vectors have a weaker influence. The choice of gamma is particularly important for
RBF kernels.Hyperparameters in XGBoostThe following essential XGBoost
hyperparameters need to be adjusted:
learning_rate: This hyperparameter determines the step size taken by the optimizer
during each iteration of training. A larger learning rate can lead to faster
convergence, but it may also increase the risk of overfitting. A smaller learning
rate may result in slower convergence but can help prevent overfitting.
n_estimators: This hyperparameter determines the number of boosting trees to be
trained. A larger number of trees can improve the model’s accuracy, but it can also
increase the risk of overfitting. A smaller number of trees may result in lower
accuracy but can help prevent overfitting.max_depth: This hyperparameter determines
the maximum depth of each tree in the ensemble. A larger max_depth can allow the
trees to capture more complex relationships in the data, but it can also increase
the risk of overfitting. A smaller max_depth may result in less complex trees but
can help prevent overfitting.min_child_weight: This hyperparameter determines the
minimum sum of instance weight (hessian) needed in a child node. A larger
min_child_weight can help prevent overfitting by requiring more data to influence
the splitting of trees. A smaller min_child_weight may allow for more aggressive
tree splitting but can increase the risk of overfitting.subsample: This
hyperparameter determines the percentage of rows used for each tree construction. A
smaller subsample can improve the efficiency of training but may reduce the model’s
accuracy. A larger subsample can increase the accuracy but may make training more
computationally expensive.Some other examples of model hyperparameters include:
The penalty in Logistic Regression Classifier i.e. L1 or L2 regularizationNumber of
Trees and Depth of Trees for Random Forests.The learning rate for training a neural
network.Number of Clusters for Clustering Algorithms.The k in k-nearest
neighbors.Hyperparameter Tuning techniquesModels can have many hyperparameters and
finding the best combination of parameters can be treated as a search problem. The
two best strategies for Hyperparameter tuning are:
GridSearchCVRandomizedSearchCVBayesian Optimization1. GridSearchCV Grid search can
be considered as a “brute force” approach to hyperparameter optimization. We fit
the model using all possible combinations after creating a grid of potential
discrete hyperparameter values. We log each set’s model performance and then choose
the combination that produces the best results. This approach is called
GridSearchCV, because it searches for the best set of hyperparameters from a grid
of hyperparameters values.
An exhaustive approach that can identify the ideal hyperparameter combination is
grid search. But the slowness is a disadvantage. It often takes a lot of processing
power and time to fit the model with every potential combination, which might not
be available.
For example: if we want to set two hyperparameters C and Alpha of the Logistic
Regression Classifier model, with different sets of values. The grid search
technique will construct many versions of the model with all possible combinations
of hyperparameters and will return the best one.
As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3,
0.4]. For a combination of C=0.3 and Alpha=0.2, the performance score comes out to
be 0.726(Highest), therefore it is selected.

The following code illustrates how to use GridSearchCV

Python3
# Necessary importsfrom sklearn.linear_model import LogisticRegressionfrom
sklearn.model_selection import GridSearchCVimport numpy as npfrom sklearn.datasets
import make_classification X, y = make_classification( n_samples=1000,
n_features=20, n_informative=10, n_classes=2, random_state=42) # Creating the
hyperparameter gridc_space = np.logspace(-5, 8, 15)param_grid = {'C': c_space} #
Instantiating logistic regression classifierlogreg = LogisticRegression() #
Instantiating the GridSearchCV objectlogreg_cv = GridSearchCV(logreg, param_grid,
cv=5) # Assuming X and y are your feature matrix and target variable# Fit the
GridSearchCV object to the datalogreg_cv.fit(X, y) # Print the tuned parameters and
scoreprint("Tuned Logistic Regression Parameters:
{}".format(logreg_cv.best_params_))print("Best score is
{}".format(logreg_cv.best_score_))

Output:
Tuned Logistic Regression Parameters: {'C': 0.006105402296585327}
Best score is 0.853

Drawback: GridSearchCV will go through all the intermediate combinations of


hyperparameters which makes grid search computationally very expensive.
2. RandomizedSearchCV As the name suggests, the random search method selects values
at random as opposed to the grid search method’s use of a predetermined set of
numbers. Every iteration, random search attempts a different set of hyperparameters
and logs the model’s performance. It returns the combination that provided the best
outcome after several iterations. This approach reduces unnecessary computation.
RandomizedSearchCV solves the drawbacks of GridSearchCV, as it goes through only a
fixed number of hyperparameter settings. It moves within the grid in a random
fashion to find the best set of hyperparameters. The advantage is that, in most
cases, a random search will produce a comparable result faster than a grid search.
The following code illustrates how to use RandomizedSearchCV

Python3

import numpy as npfrom sklearn.datasets import make_classification # Generate a


synthetic dataset for illustrationX, y = make_classification(n_samples=1000,
n_features=20, n_informative=10, n_classes=2, random_state=42) # Rest of your code
(including the RandomizedSearchCV part)from scipy.stats import randintfrom
sklearn.tree import DecisionTreeClassifierfrom sklearn.model_selection import
RandomizedSearchCV param_dist = { "max_depth": [3, None], "max_features":
randint(1, 9), "min_samples_leaf": randint(1, 9), "criterion": ["gini",
"entropy"]} tree = DecisionTreeClassifier()tree_cv = RandomizedSearchCV(tree,
param_dist, cv=5)tree_cv.fit(X, y) print("Tuned Decision Tree Parameters:
{}".format(tree_cv.best_params_))print("Best score is
{}".format(tree_cv.best_score_))

Output:
Tuned Decision Tree Parameters: {'criterion': 'entropy', 'max_depth': None,
'max_features': 8, 'min_samples_leaf': 7}
Best score is 0.842

Drawback: It’s possible that the outcome could not be the ideal hyperparameter
combination is a disadvantage.
3. Bayesian OptimizationGrid search and random search are often inefficient because
they evaluate many unsuitable hyperparameter combinations without considering the
previous iterations’ results. Bayesian optimization, on the other hand, treats the
search for optimal hyperparameters as an optimization problem. It considers the
previous evaluation results when selecting the next hyperparameter combination and
applies a probabilistic function to choose the combination that will likely yield
the best results. This method discovers a good hyperparameter combination in
relatively few iterations.
Data scientists use a probabilistic model when the objective function is unknown.
The probabilistic model estimates the probability of a hyperparameter combination’s
objective function result based on past evaluation results.
P(score(y)|hyperparameters(x))
It is a “surrogate” of the objective function, which can be the root-mean-square
error (RMSE), for example. The objective function is calculated using the training
data with the hyperparameter combination, and we try to optimize it (maximize or
minimize, depending on the objective function selected).
Applying the probabilistic model to the hyperparameters is computationally
inexpensive compared to the objective function. Therefore, this method typically
updates and improves the surrogate probability model every time the objective
function runs. Better hyperparameter predictions decrease the number of objective
function evaluations needed to achieve a good result. Gaussian processes, random
forest regression, and tree-structured Parzen estimators (TPE) are examples of
surrogate models.
The Bayesian optimization model is complex to implement, but off-the-shelf
libraries like Ray Tune can simplify the process. It’s worth using this type of
model because it finds an adequate hyperparameter combination in relatively few
iterations. However, compared to grid search or random search, we must compute
Bayesian optimization sequentially, so it doesn’t allow distributed processing.
Therefore, Bayesian optimization takes longer yet uses fewer computational
resources.
Drawback: Requires an understanding of the underlying probabilistic model.
Challenges in Hyperparameter TuningDealing with High-Dimensional Hyperparameter
Spaces: Efficient Exploration and OptimizationHandling Expensive Function
Evaluations: Balancing Computational Efficiency and AccuracyIncorporating Domain
Knowledge: Utilizing Prior Information for Informed TuningDeveloping Adaptive
Hyperparameter Tuning Methods: Adjusting Parameters During TrainingApplications of
Hyperparameter TuningModel Selection: Choosing the Right Model Architecture for the
TaskRegularization Parameter Tuning: Controlling Model Complexity for Optimal
PerformanceFeature Preprocessing Optimization: Enhancing Data Quality and Model
PerformanceAlgorithmic Parameter Tuning: Adjusting Algorithm-Specific Parameters
for Optimal ResultsAdvantages of Hyperparameter tuning:Improved model
performanceReduced overfitting and underfittingEnhanced model
generalizabilityOptimized resource utilizationImproved model
interpretabilityDisadvantages of Hyperparameter tuning:Computational costTime-
consuming processRisk of overfittingNo guarantee of optimal performanceRequires
expertiseFrequently Asked Question(FAQ’s)1. What are the methods of hyperparameter
tuning?There are several methods for hyperparameter tuning, including grid search,
random search, and Bayesian optimization. Grid search exhaustively evaluates all
possible combinations of hyperparameter values, while random search randomly
samples combinations. Bayesian optimization uses a probabilistic model to guide the
search for optimal hyperparameters.
2. What is the difference between parameter tuning and hyperparameter tuning?
Parameters are the coefficients or weights learned during the training process of a
machine learning model, while hyperparameters are settings that control the
training process itself. For example, the learning rate is a hyperparameter that
controls how quickly the model learns from the data
3. What is the purpose of hyperparameter tuning?The purpose of hyperparameter
tuning is to find the best set of hyperparameters for a given machine learning
model. This can improve the model’s performance on unseen data, prevent
overfitting, and reduce training time.
4. Which hyperparameter to tune first?The order in which you tune hyperparameters
depends on the specific model and dataset. However, a good rule of thumb is to
start with the most important hyperparameters, such as the learning rate, and then
move on to less important ones.
5. What is hyperparameter tuning and cross validation?Cross validation is a
technique used to evaluate the performance of a machine learning model.
Hyperparameter tuning is often performed within a cross-validation loop to ensure
that the selected hyperparameters generalize well to unseen data.

Last Updated :
07 Dec, 2023

Like Article

Save Article

Previous

Linear Regression Using Tensorflow

Next

Introduction to Convolution Neural Network

Share your thoughts in the comments

Add Your Comment


Please Login to comment...

A Convolutional Neural Network (CNN) is a type of Deep Learning neural network


architecture commonly used in Computer Vision. Computer vision is a field of
Artificial Intelligence that enables a computer to understand and interpret the
image or visual data.
When it comes to Machine Learning, Artificial Neural Networks perform really well.
Neural Networks are used in various datasets like images, audio, and text.
Different types of Neural Networks are used for different purposes, for example for
predicting the sequence of words we use Recurrent Neural Networks more precisely an
LSTM, similarly for image classification we use Convolution Neural networks. In
this blog, we are going to build a basic building block for CNN.
In a regular Neural Network there are three types of layers:
Input Layers: It’s the layer in which we give input to our model. The number of
neurons in this layer is equal to the total number of features in our data (number
of pixels in the case of an image).Hidden Layer: The input from the Input layer is
then fed into the hidden layer. There can be many hidden layers depending on our
model and data size. Each hidden layer can have different numbers of neurons which
are generally greater than the number of features. The output from each layer is
computed by matrix multiplication of the output of the previous layer with
learnable weights of that layer and then by the addition of learnable biases
followed by activation function which makes the network nonlinear.Output Layer: The
output from the hidden layer is then fed into a logistic function like sigmoid or
softmax which converts the output of each class into the probability score of each
class.The data is fed into the model and output from each layer is obtained from
the above step is called feedforward, we then calculate the error using an error
function, some common error functions are cross-entropy, square loss error, etc.
The error function measures how well the network is performing. After that, we
backpropagate into the model by calculating the derivatives. This step is called
Backpropagation which basically is used to minimize the loss.
Convolution Neural NetworkConvolutional Neural Network (CNN) is the extended
version of artificial neural networks (ANN) which is predominantly used to extract
the feature from the grid-like matrix dataset. For example visual datasets like
images or videos where data patterns play an extensive role.
CNN architectureConvolutional Neural Network consists of multiple layers like the
input layer, Convolutional layer, Pooling layer, and fully connected layers.
Simple CNN architecture
The Convolutional layer applies filters to the input image to extract features, the
Pooling layer downsamples the image to reduce computation, and the fully connected
layer makes the final prediction. The network learns the optimal filters through
backpropagation and gradient descent.
How Convolutional Layers worksConvolution Neural Networks or covnets are neural
networks that share their parameters. Imagine you have an image. It can be
represented as a cuboid having its length, width (dimension of the image), and
height (i.e the channel as images generally have red, green, and blue channels).

Now imagine taking a small patch of this image and running a small neural network,
called a filter or kernel on it, with say, K outputs and representing them
vertically. Now slide that neural network across the whole image, as a result, we
will get another image with different widths, heights, and depths. Instead of just
R, G, and B channels now we have more channels but lesser width and height. This
operation is called Convolution. If the patch size is the same as that of the image
it will be a regular neural network. Because of this small patch, we have fewer
weights.
Image source: Deep Learning Udacity
Now let’s talk about a bit of mathematics that is involved in the whole convolution
process.
Convolution layers consist of a set of learnable filters (or kernels) having small
widths and heights and the same depth as that of input volume (3 if the input layer
is image input).For example, if we have to run convolution on an image with
dimensions 34x34x3. The possible size of filters can be axax3, where ‘a’ can be
anything like 3, 5, or 7 but smaller as compared to the image dimension.During the
forward pass, we slide each filter across the whole input volume step by step where
each step is called stride (which can have a value of 2, 3, or even 4 for high-
dimensional images) and compute the dot product between the kernel weights and
patch from input volume.As we slide our filters we’ll get a 2-D output for each
filter and we’ll stack them together as a result, we’ll get output volume having a
depth equal to the number of filters. The network will learn all the filters.Layers
used to build ConvNetsA complete Convolution Neural Networks architecture is also
known as covnets. A covnets is a sequence of layers, and every layer transforms one
volume to another through a differentiable function. Types of layers: datasetsLet’s
take an example by running a covnets on of image of dimension 32 x 32 x 3.
Input Layers: It’s the layer in which we give input to our model. In CNN,
Generally, the input will be an image or a sequence of images. This layer holds the
raw input of the image with width 32, height 32, and depth 3.Convolutional Layers:
This is the layer, which is used to extract the feature from the input dataset. It
applies a set of learnable filters known as the kernels to the input images. The
filters/kernels are smaller matrices usually 2×2, 3×3, or 5×5 shape. it slides over
the input image data and computes the dot product between kernel weight and the
corresponding input image patch. The output of this layer is referred ad feature
maps. Suppose we use a total of 12 filters for this layer we’ll get an output
volume of dimension 32 x 32 x 12.Activation Layer: By adding an activation function
to the output of the preceding layer, activation layers add nonlinearity to the
network. it will apply an element-wise activation function to the output of the
convolution layer. Some common activation functions are RELU: max(0, x), Tanh,
Leaky RELU, etc. The volume remains unchanged hence output volume will have
dimensions 32 x 32 x 12.Pooling layer: This layer is periodically inserted in the
covnets and its main function is to reduce the size of volume which makes the
computation fast reduces memory and also prevents overfitting. Two common types of
pooling layers are max pooling and average pooling. If we use a max pool with 2 x 2
filters and stride 2, the resultant volume will be of dimension 16x16x12.
Image source: cs231n.stanford.edu
Flattening: The resulting feature maps are flattened into a one-dimensional vector
after the convolution and pooling layers so they can be passed into a completely
linked layer for categorization or regression.Fully Connected Layers: It takes the
input from the previous layer and computes the final classification or regression
task.
Image source: cs231n.stanford.edu
Output Layer: The output from the fully connected layers is then fed into a
logistic function for classification tasks like sigmoid or softmax which converts
the output of each class into the probability score of each class.Example:Let’s
consider an image and apply the convolution layer, activation layer, and pooling
layer operation to extract the inside feature.
Input image:
Input image
Step:import the necessary librariesset the parameterdefine the kernelLoad the image
and plot it.Reformat the image Apply convolution layer operation and plot the
output image.Apply activation layer operation and plot the output image.Apply
pooling layer operation and plot the output image.

Python3

# import the necessary librariesimport numpy as npimport tensorflow as tfimport


matplotlib.pyplot as pltfrom itertools import product # set the param
plt.rc('figure', autolayout=True)plt.rc('image', cmap='magma') # define the
kernelkernel = tf.constant([[-1, -1, -1], [-1, 8, -
1], [-1, -1, -1], ]) # load the imageimage =
tf.io.read_file('Ganesh.jpg')image = tf.io.decode_jpeg(image, channels=1)image =
tf.image.resize(image, size=[300, 300]) # plot the imageimg =
tf.squeeze(image).numpy()plt.figure(figsize=(5, 5))plt.imshow(img,
cmap='gray')plt.axis('off')plt.title('Original Gray Scale image')plt.show(); #
Reformatimage = tf.image.convert_image_dtype(image, dtype=tf.float32)image =
tf.expand_dims(image, axis=0)kernel = tf.reshape(kernel, [*kernel.shape, 1,
1])kernel = tf.cast(kernel, dtype=tf.float32) # convolution layerconv_fn =
tf.nn.conv2d image_filter =
conv_fn( input=image, filters=kernel, strides=1, # or (1,
1) padding='SAME',) plt.figure(figsize=(15, 5)) # Plot the convolved
imageplt.subplot(1, 3,
1) plt.imshow( tf.squeeze(image_filter))plt.axis('off')plt.title('Convolution')
# activation layerrelu_fn = tf.nn.relu# Image detectionimage_detect =
relu_fn(image_filter) plt.subplot(1, 3, 2)plt.imshow( # Reformat for
plotting tf.squeeze(image_detect)) plt.axis('off')plt.title('Activation') #
Pooling layerpool = tf.nn.poolimage_condense = pool(input=image_detect,
window_shape=(2,
2), pooling_type='MAX', str
ides=(2,
2), padding='SAME', ) plt.su
bplot(1, 3,
3)plt.imshow(tf.squeeze(image_condense))plt.axis('off')plt.title('Pooling')plt.show
()
Output:
Original Grayscale image
Output
Advantages of Convolutional Neural Networks (CNNs):Good at detecting patterns and
features in images, videos, and audio signals.Robust to translation, rotation, and
scaling invariance.End-to-end training, no need for manual feature extraction.Can
handle large amounts of data and achieve high accuracy.Disadvantages of
Convolutional Neural Networks (CNNs):Computationally expensive to train and require
a lot of memory.Can be prone to overfitting if not enough data or proper
regularization is used.Requires large amounts of labeled data.Interpretability is
limited, it’s hard to understand what the network has learned.Frequently Asked
Questions (FAQs)1: What is a Convolutional Neural Network (CNN)?A Convolutional
Neural Network (CNN) is a type of deep learning neural network that is well-suited
for image and video analysis. CNNs use a series of convolution and pooling layers
to extract features from images and videos, and then use these features to classify
or detect objects or scenes.
2: How do CNNs work? CNNs work by applying a series of convolution and pooling
layers to an input image or video. Convolution layers extract features from the
input by sliding a small filter, or kernel, over the image or video and computing
the dot product between the filter and the input. Pooling layers then downsample
the output of the convolution layers to reduce the dimensionality of the data and
make it more computationally efficient.
3: What are some common activation functions used in CNNs?Some common activation
functions used in CNNs include:
Rectified Linear Unit (ReLU): ReLU is a non-saturating activation function that is
computationally efficient and easy to train.Leaky Rectified Linear Unit (Leaky
ReLU): Leaky ReLU is a variant of ReLU that allows a small amount of negative
gradient to flow through the network. This can help to prevent the network from
dying during training.Parametric Rectified Linear Unit (PReLU): PReLU is a
generalization of Leaky ReLU that allows the slope of the negative gradient to be
learned.4: What is the purpose of using multiple convolution layers in a CNN?Using
multiple convolution layers in a CNN allows the network to learn increasingly
complex features from the input image or video. The first convolution layers learn
simple features, such as edges and corners. The deeper convolution layers learn
more complex features, such as shapes and objects.
5: What are some common regularization techniques used in CNNs?Regularization
techniques are used to prevent CNNs from overfitting the training data. Some common
regularization techniques used in CNNs include:
Dropout: Dropout randomly drops out neurons from the network during training. This
forces the network to learn more robust features that are not dependent on any
single neuron.L1 regularization: L1 regularization regularizes the absolute value
of the weights in the network. This can help to reduce the number of weights and
make the network more efficient.L2 regularization: L2 regularization regularizes
the square of the weights in the network. This can also help to reduce the number
of weights and make the network more efficient.6: What is the difference between a
convolution layer and a pooling layer?A convolution layer extracts features from an
input image or video, while a pooling layer downsamples the output of the
convolution layers. Convolution layers use a series of filters to extract features,
while pooling layers use a variety of techniques to downsample the data, such as
max pooling and average pooling.
Last Updated :
20 Dec, 2023

Like Article

Save Article

Previous

Hyperparameter tuning

Next

Digital Image Processing Basics

Share your thoughts in the comments

Add Your Comment

Please Login to comment...


Digital Image Processing means processing digital image by means of a digital
computer. We can also say that it is a use of computer algorithms, in order to get
enhanced image either to extract some useful information.
Digital image processing is the use of algorithms and mathematical models to
process and analyze digital images. The goal of digital image processing is to
enhance the quality of images, extract meaningful information from images, and
automate image-based tasks.
The basic steps involved in digital image processing are:Image acquisition: This
involves capturing an image using a digital camera or scanner, or importing an
existing image into a computer.Image enhancement: This involves improving the
visual quality of an image, such as increasing contrast, reducing noise, and
removing artifacts.Image restoration: This involves removing degradation from an
image, such as blurring, noise, and distortion.Image segmentation: This involves
dividing an image into regions or segments, each of which corresponds to a specific
object or feature in the image.Image representation and description: This involves
representing an image in a way that can be analyzed and manipulated by a computer,
and describing the features of an image in a compact and meaningful way.Image
analysis: This involves using algorithms and mathematical models to extract
information from an image, such as recognizing objects, detecting patterns, and
quantifying features.Image synthesis and compression: This involves generating new
images or compressing existing images to reduce storage and transmission
requirements.Digital image processing is widely used in a variety of applications,
including medical imaging, remote sensing, computer vision, and multimedia. Image
processing mainly include the following steps:
1.Importing the image via image acquisition tools; 2.Analysing and manipulating the
image; 3.Output in which result can be altered image or a report which is based on
analysing that image.
What is an image?
An image is defined as a two-dimensional function,F(x,y), where x and y are spatial
coordinates, and the amplitude of F at any pair of coordinates (x,y) is called the
intensity of that image at that point. When x,y, and amplitude values of F are
finite, we call it a digital image. In other words, an image can be defined by a
two-dimensional array specifically arranged in rows and columns. Digital Image is
composed of a finite number of elements, each of which elements have a particular
value at a particular location.These elements are referred to as picture
elements,image elements,and pixels.A Pixel is most widely used to denote the
elements of a Digital Image.
Types of an imageBINARY IMAGE– The binary image as its name suggests, contain only
two pixel elements i.e 0 & 1,where 0 refers to black and 1 refers to white. This
image is also known as Monochrome.BLACK AND WHITE IMAGE– The image which consist of
only black and white color is called BLACK AND WHITE IMAGE.8 bit COLOR FORMAT– It
is the most famous image format.It has 256 different shades of colors in it and
commonly known as Grayscale Image. In this format, 0 stands for Black, and 255
stands for white, and 127 stands for gray.16 bit COLOR FORMAT– It is a color image
format. It has 65,536 different colors in it.It is also known as High Color Format.
In this format the distribution of color is not as same as Grayscale image.
A 16 bit format is actually divided into three further formats which are Red, Green
and Blue. That famous RGB format.
Image as a Matrix
As we know, images are represented in rows and columns we have the following syntax
in which images are represented:

The right side of this equation is digital image by definition. Every element of
this matrix is called image element , picture element , or pixel.
DIGITAL IMAGE REPRESENTATION IN MATLAB:

In MATLAB the start index is from 1 instead of 0. Therefore, f(1,1) =


f(0,0). henceforth the two representation of image are identical, except for the
shift in origin. In MATLAB, matrices are stored in a variable i.e X,x,input_image ,
and so on. The variables must be a letter as same as other programming languages.
PHASES OF IMAGE PROCESSING:
1.ACQUISITION– It could be as simple as being given an image which is in digital
form. The main work involves: a) Scaling b) Color conversion(RGB to Gray or vice-
versa) 2.IMAGE ENHANCEMENT– It is amongst the simplest and most appealing in areas
of Image Processing it is also used to extract some hidden details from an image
and is subjective. 3.IMAGE RESTORATION– It also deals with appealing of an image
but it is objective(Restoration is based on mathematical or probabilistic model or
image degradation). 4.COLOR IMAGE PROCESSING– It deals with pseudocolor and full
color image processing color models are applicable to digital image
processing. 5.WAVELETS AND MULTI-RESOLUTION PROCESSING– It is foundation of
representing images in various degrees. 6.IMAGE COMPRESSION-It involves in
developing some functions to perform this operation. It mainly deals with image
size or resolution. 7.MORPHOLOGICAL PROCESSING-It deals with tools for extracting
image components that are useful in the representation & description of
shape. 8.SEGMENTATION PROCEDURE-It includes partitioning an image into its
constituent parts or objects. Autonomous segmentation is the most difficult task in
Image Processing. 9.REPRESENTATION & DESCRIPTION-It follows output of segmentation
stage, choosing a representation is only the part of solution for transforming raw
data into processed data. 10.OBJECT DETECTION AND RECOGNITION-It is a process that
assigns a label to an object based on its descriptor.
OVERLAPPING FIELDS WITH IMAGE PROCESSING

According to block 1,if input is an image and we get out image as a output, then it
is termed as Digital Image Processing. According to block 2,if input is an image
and we get some kind of information or description as a output, then it is termed
as Computer Vision. According to block 3,if input is some description or code and
we get image as an output, then it is termed as Computer Graphics. According to
block 4,if input is description or some keywords or some code and we get
description or some keywords as a output,then it is termed as Artificial
Intelligence
Advantages of Digital Image Processing:Improved image quality: Digital image
processing algorithms can improve the visual quality of images, making them
clearer, sharper, and more informative.Automated image-based tasks: Digital image
processing can automate many image-based tasks, such as object recognition, pattern
detection, and measurement.Increased efficiency: Digital image processing
algorithms can process images much faster than humans, making it possible to
analyze large amounts of data in a short amount of time.Increased accuracy: Digital
image processing algorithms can provide more accurate results than humans,
especially for tasks that require precise measurements or quantitative
analysis.Disadvantages of Digital Image Processing:High computational cost: Some
digital image processing algorithms are computationally intensive and require
significant computational resources.Limited interpretability: Some digital image
processing algorithms may produce results that are difficult for humans to
interpret, especially for complex or sophisticated algorithms.Dependence on quality
of input: The quality of the output of digital image processing algorithms is
highly dependent on the quality of the input images. Poor quality input images can
result in poor quality output.Limitations of algorithms: Digital image processing
algorithms have limitations, such as the difficulty of recognizing objects in
cluttered or poorly lit scenes, or the inability to recognize objects with
significant deformations or occlusions.Dependence on good training data: The
performance of many digital image processing algorithms is dependent on the quality
of the training data used to develop the algorithms. Poor quality training data can
result in poor performance of the algorit REFERENCES
Digital Image Processing (Rafael c. gonzalez)
Reference books:
“Digital Image Processing” by Rafael C. Gonzalez and Richard E. Woods.“Computer
Vision: Algorithms and Applications” by Richard Szeliski.“Digital Image Processing
Using MATLAB” by Rafael C. Gonzalez, Richard E. Woods, and Steven L. Eddins.

Last Updated :
22 Feb, 2023

Like Article

Save Article

Previous

Introduction to Convolution Neural Network

Next

Difference between Image Processing and Computer Vision

Share your thoughts in the comments

Add Your Comment


Please Login to comment...

Image processing and Computer Vision both are very exciting field of Computer
Science.
Computer Vision: In Computer Vision, computers or machines are made to gain high-
level understanding from the input digital images or videos with the purpose of
automating tasks that the human visual system can do. It uses many techniques and
Image Processing is just one of them.
Image Processing: Image Processing is the field of enhancing the images by tuning
many parameter and features of the images. So Image Processing is the subset of
Computer Vision. Here, transformations are applied to an input image and the
resultant output image is returned. Some of these transformations are- sharpening,
smoothing, stretching etc.
Now, as both the fields deal with working in visuals, i.e., images and videos,
there seems to be lot of confusion about the difference about these fields of
computer science. In this article we will discuss the difference between them.
Difference between Image Processing and Computer Vision:
Image ProcessingComputer VisionImage processing is mainly focused on processing the
raw input images to enhance them or preparing them to do other tasksComputer vision
is focused on extracting information from the input images or videos to have a
proper understanding of them to predict the visual input like human brain.Image
processing uses methods like Anisotropic diffusion, Hidden Markov models,
Independent component analysis, Different Filtering etc.Image processing is one of
the methods that is used for computer vision along with other Machine learning
techniques, CNN etc.Image Processing is a subset of Computer Vision.Computer Vision
is a superset of Image Processing.Examples of some Image Processing applications
are- Rescaling image (Digital Zoom), Correcting illumination, Changing tones
etc.Examples of some Computer Vision applications are- Object detection, Face
detection, Hand writing recognition etc.

Last Updated :
03 Nov, 2022

Like Article

Save Article
Previous

Digital Image Processing Basics

Next

CNN | Introduction to Pooling Layer

Share your thoughts in the comments

Add Your Comment

Please Login to comment...

The pooling operation involves sliding a two-dimensional filter over each channel
of feature map and summarising the features lying within the region covered by the
filter. For a feature map having dimensions nh x nw x nc, the dimensions of output
obtained after a pooling layer is
(nh - f + 1) / s x (nw - f + 1)/s x nc
where,
-> nh - height of feature map
-> nw - width of feature map
-> nc - number of channels in the feature map
-> f - size of filter
-> s - stride length
A common CNN model architecture is to have a number of convolution and pooling
layers stacked one after the other.
Why to use Pooling Layers?Pooling layers are used to reduce the dimensions of the
feature maps. Thus, it reduces the number of parameters to learn and the amount of
computation performed in the network.The pooling layer summarises the features
present in a region of the feature map generated by a convolution layer. So,
further operations are performed on summarised features instead of precisely
positioned features generated by the convolution layer. This makes the model more
robust to variations in the position of the features in the input image. Types of
Pooling Layers: Max PoolingMax pooling is a pooling operation that selects the
maximum element from the region of the feature map covered by the filter. Thus, the
output after max-pooling layer would be a feature map containing the most prominent
features of the previous feature map.

This can be achieved using MaxPooling2D layer in keras as follows:Code #1 :


Performing Max Pooling using keras

Python3

import numpy as npfrom keras.models import Sequentialfrom keras.layers import


MaxPooling2D # define input imageimage = np.array([[2, 2, 7,
3], [9, 4, 6, 1], [8, 5, 2,
4], [3, 1, 2, 6]])image = image.reshape(1, 4, 4, 1) # define model
containing just a single max pooling layermodel =
Sequential( [MaxPooling2D(pool_size = 2, strides = 2)]) # generate pooled
outputoutput = model.predict(image) # print output imageoutput =
np.squeeze(output)print(output)

Output: [[9. 7.]


[8. 6.]]Average PoolingAverage pooling computes the average of the elements present
in the region of feature map covered by the filter. Thus, while max pooling gives
the most prominent feature in a particular patch of the feature map, average
pooling gives the average of features present in a patch.

Code #2 : Performing Average Pooling using keras

Python3

import numpy as npfrom keras.models import Sequentialfrom keras.layers import


AveragePooling2D # define input imageimage = np.array([[2, 2, 7,
3], [9, 4, 6, 1], [8, 5, 2,
4], [3, 1, 2, 6]])image = image.reshape(1, 4, 4, 1) # define model
containing just a single average pooling layermodel =
Sequential( [AveragePooling2D(pool_size = 2, strides = 2)]) # generate pooled
outputoutput = model.predict(image) # print output imageoutput =
np.squeeze(output)print(output)

Output:
[[4.25 4.25]
[4.25 3.5 ]]Global PoolingGlobal pooling reduces each channel in the feature map to
a single value. Thus, an nh x nw x nc feature map is reduced to 1 x 1 x nc feature
map. This is equivalent to using a filter of dimensions nh x nw i.e. the dimensions
of the feature map. Further, it can be either global max pooling or global average
pooling.Code #3 : Performing Global Pooling using keras

Python3
import numpy as npfrom keras.models import Sequentialfrom keras.layers import
GlobalMaxPooling2Dfrom keras.layers import GlobalAveragePooling2D # define input
imageimage = np.array([[2, 2, 7, 3], [9, 4, 6,
1], [8, 5, 2, 4], [3, 1, 2, 6]])image =
image.reshape(1, 4, 4, 1) # define gm_model containing just a single global-max
pooling layergm_model = Sequential( [GlobalMaxPooling2D()]) # define ga_model
containing just a single global-average pooling layerga_model =
Sequential( [GlobalAveragePooling2D()]) # generate pooled outputgm_output =
gm_model.predict(image)ga_output = ga_model.predict(image) # print output
imagegm_output = np.squeeze(gm_output)ga_output =
np.squeeze(ga_output)print("gm_output: ", gm_output)print("ga_output: ", ga_output)

Output:
gm_output: 9.0
ga_output: 4.0625
In convolutional neural networks (CNNs), the pooling layer is a common type of
layer that is typically added after convolutional layers. The pooling layer is used
to reduce the spatial dimensions (i.e., the width and height) of the feature maps,
while preserving the depth (i.e., the number of channels).
The pooling layer works by dividing the input feature map into a set of non-
overlapping regions, called pooling regions. Each pooling region is then
transformed into a single output value, which represents the presence of a
particular feature in that region. The most common types of pooling operations are
max pooling and average pooling.In max pooling, the output value for each pooling
region is simply the maximum value of the input values within that region. This has
the effect of preserving the most salient features in each pooling region, while
discarding less relevant information. Max pooling is often used in CNNs for object
recognition tasks, as it helps to identify the most distinctive features of an
object, such as its edges and corners.In average pooling, the output value for each
pooling region is the average of the input values within that region. This has the
effect of preserving more information than max pooling, but may also dilute the
most salient features. Average pooling is often used in CNNs for tasks such as
image segmentation and object detection, where a more fine-grained representation
of the input is required.
Pooling layers are typically used in conjunction with convolutional layers in a
CNN, with each pooling layer reducing the spatial dimensions of the feature maps,
while the convolutional layers extract increasingly complex features from the
input. The resulting feature maps are then passed to a fully connected layer, which
performs the final classification or regression task.
Advantages of Pooling Layer:
Dimensionality reduction: The main advantage of pooling layers is that they help in
reducing the spatial dimensions of the feature maps. This reduces the computational
cost and also helps in avoiding overfitting by reducing the number of parameters in
the model.Translation invariance: Pooling layers are also useful in achieving
translation invariance in the feature maps. This means that the position of an
object in the image does not affect the classification result, as the same features
are detected regardless of the position of the object.Feature selection: Pooling
layers can also help in selecting the most important features from the input, as
max pooling selects the most salient features and average pooling preserves more
information.
Disadvantages of Pooling Layer:
Information loss: One of the main disadvantages of pooling layers is that they
discard some information from the input feature maps, which can be important for
the final classification or regression task.Over-smoothing: Pooling layers can also
cause over-smoothing of the feature maps, which can result in the loss of some
fine-grained details that are important for the final classification or regression
task.Hyperparameter tuning: Pooling layers also introduce hyperparameters such as
the size of the pooling regions and the stride, which need to be tuned in order to
achieve optimal performance. This can be time-consuming and requires some expertise
in model building.

Last Updated :
21 Apr, 2023

Like Article

Save Article

Previous
Difference between Image Processing and Computer Vision

Next

CIFAR-10 Image Classification in TensorFlow

Share your thoughts in the comments

Add Your Comment

Please Login to comment...

Prerequisites:Image ClassificationConvolution Neural Networks including basic


pooling, convolution layers with normalization in neural networks, and dropout.Data
Augmentation.Neural Networks.Numpy arrays.
In this article, we are going to discuss how to classify images using TensorFlow.
Image Classification is a method to classify the images into their respective
category classes. CIFAR-10 Dataset as it suggests has 10 different categories of
images in it. There is a total of 60000 images of 10 different classes naming
Airplane, Automobile, Bird, Cat, Deer, Dog, Frog, Horse, Ship, Truck. All the
images are of size 32×32. There are in total 50000 train images and 10000 test
images.
To build an image classifier we make use of tensorflow‘ s keras API to build our
model. In order to build a model, it is recommended to have GPU support, or you may
use the Google colab notebooks as well.
Stepwise Implementation:The first step towards writing any code is to import all
the required libraries and modules. This includes importing tensorflow and other
modules like numpy. If the module is not present then you can download it using pip
install tensorflow on the command prompt (for windows) or if you are using a
jupyter notebook then simply type !pip install tensorflow in the cell and run it in
order to download the module. Other modules can be imported similarly.

Python3
import tensorflow as tf # Display the versionprint(tf.__version__) # other
importsimport numpy as npimport matplotlib.pyplot as pltfrom
tensorflow.keras.layers import Input, Conv2D, Dense, Flatten, Dropoutfrom
tensorflow.keras.layers import GlobalMaxPooling2D, MaxPooling2Dfrom
tensorflow.keras.layers import BatchNormalizationfrom tensorflow.keras.models
import Model

Output:
2.4.1
The output of the above code should display the version of tensorflow you are using
eg 2.4.1 or any other.
Now we have the required module support so let’s load in our data. The dataset of
CIFAR-10 is available on tensorflow keras API, and we can download it on our local
machine using tensorflow.keras.datasets.cifar10 and then distribute it to train and
test set using load_data() function.

Python3
# Load in the datacifar10 = tf.keras.datasets.cifar10 # Distribute it to train and
test set(x_train, y_train), (x_test, y_test) =
cifar10.load_data()print(x_train.shape, y_train.shape, x_test.shape, y_test.shape)

Output:
The output of the above code will display the shape of all four partitions and will
look something like this

Here we can see we have 5000 training images and 1000 test images as specified
above and all the images are of 32 by 32 size and have 3 color channels i.e. images
are color images. As well as it is also visible that there is only a single label
assigned with each image.
Until now, we have our data with us. But still, we cannot be sent it directly to
our neural network. We need to process the data in order to send it to the network.
The first thing in the process is to reduce the pixel values. Currently, all the
image pixels are in a range from 1-256, and we need to reduce those values to a
value ranging between 0 and 1. This enables our model to easily track trends and
efficient training. We can do this simply by dividing all pixel values by 255.0.
Another thing we want to do is to flatten(in simple words rearrange them in form of
a row) the label values using the flatten() function.

Python3
# Reduce pixel valuesx_train, x_test = x_train / 255.0, x_test / 255.0 # flatten
the label valuesy_train, y_test = y_train.flatten(), y_test.flatten()

Now is a good time to see few images of our dataset. We can visualize it in a
subplot grid form. Since the image size is just 32×32 so don’t expect much from the
image. It would be a blurred one. We can do the visualization using the subplot()
function from matplotlib and looping over the first 25 images from our training
dataset portion.

Python3

# visualize data by plotting imagesfig, ax = plt.subplots(5, 5)k = 0 for i in


range(5): for j in range(5): ax[i][j].imshow(x_train[k],
aspect='auto') k += 1 plt.show()
Output:

Though the images are not clear there are enough pixels for us to specify which
object is there in those images.
After completing all the steps now is the time to built our model. We are going to
use a Convolution Neural Network or CNN to train our model. It includes using a
convolution layer in this which is Conv2d layer as well as pooling and
normalization methods. Finally, we’ll pass it into a dense layer and the final
dense layer which is our output layer. We are using ‘relu‘ activation function. The
output layer uses a “softmax” function.

Python3

# number of classesK = len(set(y_train)) # calculate total number of classes # for


output layerprint("number of classes:", K) # Build the model using the functional
API# input layeri = Input(shape=x_train[0].shape)x = Conv2D(32, (3, 3),
activation='relu', padding='same')(i)x = BatchNormalization()(x)x = Conv2D(32, (3,
3), activation='relu', padding='same')(x)x = BatchNormalization()(x)x =
MaxPooling2D((2, 2))(x) x = Conv2D(64, (3, 3), activation='relu', padding='same')
(x)x = BatchNormalization()(x)x = Conv2D(64, (3, 3), activation='relu',
padding='same')(x)x = BatchNormalization()(x)x = MaxPooling2D((2, 2))(x) x =
Conv2D(128, (3, 3), activation='relu', padding='same')(x)x = BatchNormalization()
(x)x = Conv2D(128, (3, 3), activation='relu', padding='same')(x)x =
BatchNormalization()(x)x = MaxPooling2D((2, 2))(x) x = Flatten()(x)x = Dropout(0.2)
(x) # Hidden layerx = Dense(1024, activation='relu')(x)x = Dropout(0.2)(x) # last
hidden layer i.e.. output layerx = Dense(K, activation='softmax')(x) model =
Model(i, x) # model descriptionmodel.summary()
Output:

Our model is now ready, it’s time to compile it. We are using model.compile()
function to compile our model. For the parameters, we are using
adam optimizersparse_categorical_crossentropy as the loss
functionmetrics=[‘accuracy’]

Python3

#
Compilemodel.compile(optimizer='adam', loss='sparse_categorical_crosse
ntropy', metrics=['accuracy'])

Now let’s fit our model using model.fit() passing all our data to it. We are going
to train our model till 50 epochs, it gives us a fair result though you can tweak
it if you want.

Python3

# Fitr = model.fit( x_train, y_train, validation_data=(x_test, y_test), epochs=50)

Output:
The model will start training, and it will look something like this

After this, our model is trained. Though it will work fine but to make our model
much more accurate we can add data augmentation on our data and then train it
again. Calling model.fit() again on augmented data will continue training where it
left off. We are going to fir our data on a batch size of 32 and we are going to
shift the range of width and height by 0.1 and flip the images horizontally. Then
call model.fit again for 50 epochs.

Python3
# Fit with data augmentation# Note: if you run this AFTER calling# the previous
model.fit()# it will CONTINUE training where it left offbatch_size =
32data_generator =
tf.keras.preprocessing.image.ImageDataGenerator( width_shift_range=0.1,
height_shift_range=0.1, horizontal_flip=True) train_generator =
data_generator.flow(x_train, y_train, batch_size)steps_per_epoch = x_train.shape[0]
// batch_size r = model.fit(train_generator, validation_data=(x_test,
y_test), steps_per_epoch=steps_per_epoch, epochs=50)

Output:
The model will start training for 50 epochs. Though it is running on GPU it will
take at least 10 to 15 minutes.

Now we have trained our model, before making any predictions from it let’s
visualize the accuracy per iteration for better analysis. Though there are other
methods that include confusion matrix for better analysis of the model.

Python3
# Plot accuracy per iterationplt.plot(r.history['accuracy'], label='acc',
color='red')plt.plot(r.history['val_accuracy'], label='val_acc',
color='green')plt.legend()

Output:

Let’s make a prediction over an image from our model using model.predict()
function. Before sending the image to our model we need to again reduce the pixel
values between 0 and 1 and change its shape to (1,32,32,3) as our model expects the
input to be in this form only. To make things easy let us take an image from the
dataset itself. It is already in reduced pixels format still we have to reshape it
(1,32,32,3) using reshape() function. Since we are using data from the dataset we
can compare the predicted output and original output.

Python3

# label mapping labels = '''airplane automobile bird cat deerdog frog horseship
truck'''.split() # select the image from our test datasetimage_number = 0 # display
the imageplt.imshow(x_test[image_number]) # load the image in an arrayn =
np.array(x_test[image_number]) # reshape itp = n.reshape(1, 32, 32, 3) # pass in
the network for prediction and # save the predicted labelpredicted_label =
labels[model.predict(p).argmax()] # load the original labeloriginal_label =
labels[y_test[image_number]] # display the resultprint("Original label is {} and
predicted label is {}".format( original_label, predicted_label))

Output:

Now we have the output as Original label is cat and the predicted label is also
cat.
Let’s check it for some label which was misclassified by our model, e.g. for image
number 5722 we receive something like this:

Finally, let’s save our model using model.save() function as an h5 file. If you are
using Google colab you can download your model from the files section.

Python3

# save the modelmodel.save('geeksforgeeks.h5')


Hence, in this way, one can classify images using Tensorflow.

Last Updated :
02 Nov, 2022

Like Article

Save Article

Previous

Fake News Detection Model using TensorFlow in Python

Next

Black and white image colorization with OpenCV and Deep Learning

Share your thoughts in the comments

Add Your Comment


Please Login to comment...

Introduction:
Introduced in the 1980s by Yann LeCun, Convolution Neural Networks(also called CNNs
or ConvNets) have come a long way. From being employed for simple digit
classification tasks, CNN-based architectures are being used very profoundly over
much Deep Learning and Computer Vision-related tasks like object detection, image
segmentation, gaze tracking, among others. Using the PyTorch framework, this
article will implement a CNN-based image classifier on the popular CIFAR-10
dataset.
Before going ahead with the code and installation, the reader is expected to
understand how CNNs work theoretically and with various related operations like
convolution, pooling, etc. The article also assumes a basic familiarity with the
PyTorch workflow and its various utilities, like Dataloaders, Datasets, Tensor
transforms, and CUDA operations. For a quick refresher of these concepts, the
reader is encouraged to go through the following articles:
Introduction to Convolutional Neural NetworkTraining Neural Networks with
Validation using PyTorchHow to set up and Run CUDA Operations in Pytorch?
Installation
For the implementation of the CNN and downloading the CIFAR-10 dataset, we’ll be
requiring the torch and torchvision modules. Apart from that, we’ll be using numpy
and matplotlib for data analysis and plotting. The required libraries can be
installed using the pip package manager through the following command:

pip install torch torchvision torchaudio numpy matplotlib


Stepwise implementation
Step 1: Downloading data and printing some sample images from the training set.
Before starting our journey to implementing CNN, we first need to download the
dataset onto our local machine, which we’ll be training our model over. We’ll be
using the torchvision utility for this purpose and downloading the CIFAR-10 dataset
into training and testing sets in directories “./CIFAR10/train” and
“./CIFAR10/test,“ respectively. We also apply a normalized transform where the
procedure is done over the three channels for all the images.Now, we have a
training dataset and a test dataset with 50000 and 10000 images, respectively, of a
dimension 32x32x3. After that, we convert these datasets into data loaders of a
batch size of 128 for better generalization and a faster training process.Finally,
we plot out some sample images from the 1st training batch to get an idea of the
images we’re dealing with using the make_grid utility from torchvision.
Code:

Python3
import torch import torchvision import matplotlib.pyplot as plt import numpy as np
# The below two lines are optional and are just there to avoid any SSL # related
errors while downloading the CIFAR-10 dataset import ssl
ssl._create_default_https_context = ssl._create_unverified_context #Defining
plotting settings plt.rcParams['figure.figsize'] = 14, 6 #Initializing normalizing
transform for the dataset normalize_transform = torchvision.transforms.Compose([
torchvision.transforms.ToTensor(), torchvision.transforms.Normalize(mean =
(0.5, 0.5, 0.5), std = (0.5, 0.5, 0.5))])
#Downloading the CIFAR10 dataset into train and test sets train_dataset =
torchvision.datasets.CIFAR10( root="./CIFAR10/train", train=True,
transform=normalize_transform, download=True) test_dataset =
torchvision.datasets.CIFAR10( root="./CIFAR10/test", train=False,
transform=normalize_transform, download=True) #Generating data loaders from
the corresponding datasets batch_size = 128train_loader =
torch.utils.data.DataLoader(train_dataset, batch_size=batch_size) test_loader =
torch.utils.data.DataLoader(test_dataset, batch_size=batch_size) #Plotting 25
images from the 1st batch dataiter = iter(train_loader) images, labels =
dataiter.next() plt.imshow(np.transpose(torchvision.utils.make_grid( images[:25],
normalize=True, padding=1, nrow=5).numpy(), (1, 2, 0))) plt.axis('off')

Output:
Figure 1: Some sample images from the training dataset
Step-2: Plotting class distribution of the dataset
It’s generally a good idea to plot out the class distribution of the training set.
This helps in checking whether the provided dataset is balanced or not. To do this,
we iterate over the entire training set in batches and collect the respective
classes of each instance. Finally, we calculate the counts of the unique classes
and plot them.
Code:
Python3

#Iterating over the training dataset and storing the target class for each sample
classes = [] for batch_idx, data in enumerate(train_loader, 0): x, y = data
classes.extend(y.tolist()) #Calculating the unique classes and the respective
counts and plotting them unique, counts = np.unique(classes, return_counts=True)
names = list(test_dataset.class_to_idx.keys()) plt.bar(names, counts)
plt.xlabel("Target Classes") plt.ylabel("Number of training instances")

Output:
Figure 2: Class distribution of the training set
As shown in Figure 2, each of the ten classes has almost the same number of
training samples. Thus we don’t need to take additional steps to rebalance the
dataset.
Step-3: Implementing the CNN architecture
On the architecture side, we’ll be using a simple model that employs three
convolution layers with depths 32, 64, and 64, respectively, followed by two fully
connected layers for performing classification.
Each convolutional layer involves a convolutional operation involving a 3×3
convolution filter and is followed by a ReLU activation operation for introducing
nonlinearity into the system and a max-pooling operation with a 2×2 filter to
reduce the dimensionality of the feature map.After the end of the convolutional
blocks, we flatten the multidimensional layer into a low dimensional structure for
starting our classification blocks. After the first linear layer, the last output
layer(also a linear layer) has ten neurons for each of the ten unique classes in
our dataset.
The architecture is as follows:
Figure 3: Architecture of the CNN
For building our model, we’ll make a CNN class inherited from the torch.nn.Module
class for taking advantage of the Pytorch utilities. Apart from that, we’ll be
using the torch.nn.Sequential container to combine our layers one after the other.
The Conv2D(), ReLU(), and MaxPool2D() layers perform the convolution, activation,
and pooling operations. We used padding of 1 to give sufficient learning space to
the kernel as padding gives the image more coverage area, especially the pixels in
the outer frame.After the convolutional blocks, the Linear() fully connected layers
perform classification.
Code:

Python3

class CNN(torch.nn.Module): def __init__(self): super().__init__()


self.model = torch.nn.Sequential( #Input = 3 x 32 x 32, Output = 32 x
32 x 32 torch.nn.Conv2d(in_channels = 3, out_channels = 32, kernel_size
= 3, padding = 1), torch.nn.ReLU(), #Input = 32 x 32 x 32,
Output = 32 x 16 x 16 torch.nn.MaxPool2d(kernel_size=2),
#Input = 32 x 16 x 16, Output = 64 x 16 x 16
torch.nn.Conv2d(in_channels = 32, out_channels = 64, kernel_size = 3, padding = 1),
torch.nn.ReLU(), #Input = 64 x 16 x 16, Output = 64 x 8 x 8
torch.nn.MaxPool2d(kernel_size=2), #Input = 64 x 8 x 8,
Output = 64 x 8 x 8 torch.nn.Conv2d(in_channels = 64, out_channels =
64, kernel_size = 3, padding = 1), torch.nn.ReLU(), #Input
= 64 x 8 x 8, Output = 64 x 4 x 4 torch.nn.MaxPool2d(kernel_size=2),
torch.nn.Flatten(), torch.nn.Linear(64*4*4, 512),
torch.nn.ReLU(), torch.nn.Linear(512, 10) ) def
forward(self, x): return self.model(x)
Step-4: Defining the training parameters and beginning the training process
We begin the training process by selecting the device to train our model onto,
i.e., CPU or a GPU. Then, we define our model hyperparameters which are as follows:
We train our models over 50 epochs, and since we have a multiclass problem, we used
the Cross-Entropy Loss as our objective function.We used the popular Adam optimizer
with a learning rate of 0.001 and weight_decay of 0.01 to prevent overfitting
through regularization to optimize the objective function.
Finally, we begin our training loop, which involves calculating outputs for each
batch and the loss by comparing the predicted labels with the true labels. In the
end, we’ve plotted the training loss for each respective epoch to ensure the
training process went as per the plan.
Code:

Python3

#Selecting the appropriate training device device = 'cuda' if


torch.cuda.is_available() else 'cpu'model = CNN().to(device) #Defining the model
hyper parameters num_epochs = 50learning_rate = 0.001weight_decay = 0.01criterion =
torch.nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters(),
lr=learning_rate, weight_decay=weight_decay) #Training process begins
train_loss_list = [] for epoch in range(num_epochs): print(f'Epoch
{epoch+1}/{num_epochs}:', end = ' ') train_loss = 0 #Iterating over
the training dataset in batches model.train() for i, (images, labels) in
enumerate(train_loader): #Extracting images and target labels for
the batch being iterated images = images.to(device) labels =
labels.to(device) #Calculating the model output and the cross entropy
loss outputs = model(images) loss = criterion(outputs, labels)
#Updating weights according to calculated loss optimizer.zero_grad()
loss.backward() optimizer.step() train_loss += loss.item()
#Printing loss for each epoch
train_loss_list.append(train_loss/len(train_loader)) print(f"Training loss =
{train_loss_list[-1]}") #Plotting loss for all epochs
plt.plot(range(1,num_epochs+1), train_loss_list) plt.xlabel("Number of epochs")
plt.ylabel("Training loss")
Output:
Figure 4: Plot of training loss vs. number of epochs
From FIgure 4, we can see that the loss decreases as the epochs increase,
indicating a successful training procedure.
Step-5: Calculating the model’s accuracy on the test set
Now that our model’s trained, we need to check its performance on the test set. To
do that, we iterate over the entire test set in batches and calculate the accuracy
score by comparing the true and predicted labels for each batch.
Code:

Python3

test_acc=0model.eval() with torch.no_grad(): #Iterating over the training


dataset in batches for i, (images, labels) in enumerate(test_loader):
images = images.to(device) y_true = labels.to(device)
#Calculating outputs for the batch being iterated outputs = model(images)
#Calculated prediction labels from models _, y_pred =
torch.max(outputs.data, 1) #Comparing predicted and true labels
test_acc += (y_pred == y_true).sum().item() print(f"Test set accuracy =
{100 * test_acc / len(test_dataset)} %")
Output:
Figure 5: Accuracy on the test set
Step 6: Generating predictions for sample images in the test set
As shown in Figure 5, our model has achieved an accuracy of nearly 72%. To validate
its performance, we can generate some predictions for some sample images. To do
that, we take the first five images of the last batch of the test set and plot them
using the make_grid utility from torchvision. We then collect their true labels and
predictions from the model and show them in the plot’s title.
Code:

Python3

#Generating predictions for 'num_images' amount of images from the last batch of
test set num_images = 5y_true_name = [names[y_true[idx]] for idx in
range(num_images)] y_pred_name = [names[y_pred[idx]] for idx in
range(num_images)] #Generating the title for the plot title = f"Actual labels:
{y_true_name}, Predicted labels: {y_pred_name}" #Finally plotting the images with
their actual and predicted labels in the title
plt.imshow(np.transpose(torchvision.utils.make_grid(images[:num_images].cpu(),
normalize=True, padding=1).numpy(), (1, 2, 0))) plt.title(title) plt.axis("off")
Output:
Figure 6: Actual vs. Predicted labels for 5 sample images from the test set. Note
that the labels are in the same order as the respective images, from left to right.
As can be seen from Figure 6, the model is producing correct predictions for all
the images except the 2nd one as it misclassifies the dog as a cat!
Conclusion:
This article covered the PyTorch implementation of a simple CNN on the popular
CIFAR-10 dataset. The reader is encouraged to play around with the network
architecture and model hyperparameters to increase the model accuracy even more!
Referenceshttps://cs231n.github.io/convolutional-networks/https://pytorch.org/
docs/stable/index.htmlhttps://pytorch.org/tutorials/beginner/blitz/
cifar10_tutorial.html

Last Updated :
25 Feb, 2022

Like Article

Save Article

Previous

CIFAR-10 Image Classification in TensorFlow

Next

Convolutional Neural Network (CNN) Architectures


Share your thoughts in the comments

Add Your Comment

Please Login to comment...

Convolutional Neural Network(CNN) is a neural network architecture in Deep


Learning, used to recognize the pattern from structured arrays. However, over many
years, CNN architectures have evolved. Many variants of the fundamental CNN
Architecture This been developed, leading to amazing advances in the growing deep-
learning field.
Let’s discuss, How CNN architecture developed and grow over time.
1. LeNet-5The First LeNet-5 architecture is the most widely known CNN architecture.
It was introduced in 1998 and is widely used for handwritten method digit
recognition. LeNet-5 has 2 convolutional and 3 full layers. This LeNet-5
architecture has 60,000 parameters.LeNet-5The LeNet-5 has the ability to process
higher one-resolution images that require larger and more CNN convolutional
layers.The leNet-5 technique is measured by the availability of all computing
resources
Example Model of LeNet-5

Python3
import torch from torchsummary import summary import torch.nn as nn import
torch.nn.functional as F class LeNet5(nn.Module): def __init__(self):
# Call the parent class's init method super(LeNet5, self).__init__()
# First Convolutional Layer self.conv1 = nn.Conv2d(in_channels=1,
out_channels=6, kernel_size=5, stride=1) # Max Pooling Layer
self.pool = nn.MaxPool2d(kernel_size=2, stride=2) # Second
Convolutional Layer self.conv2 = nn.Conv2d(in_channels=6, out_channels=16,
kernel_size=5, stride=1) # First Fully Connected Layer
self.fc1 = nn.Linear(in_features=16 * 5 * 5, out_features=120) #
Second Fully Connected Layer self.fc2 = nn.Linear(in_features=120,
out_features=84) # Output Layer self.fc3 =
nn.Linear(in_features=84, out_features=10) def forward(self, x): #
Pass the input through the first convolutional layer and activation function
x = self.pool(F.relu(self.conv1(x))) # Pass the output of the
first layer through # the second convolutional layer and activation
function x = self.pool(F.relu(self.conv2(x))) # Reshape
the output to be passed through the fully connected layers x = x.view(-1,
16 * 5 * 5) # Pass the output through the first fully connected
layer and activation function x = F.relu(self.fc1(x)) #
Pass the output of the first fully connected layer through # the second
fully connected layer and activation function x = F.relu(self.fc2(x))
# Pass the output of the second fully connected layer through the output layer
x = self.fc3(x) # Return the final output return x
lenet5 = LeNet5() print(lenet5)

Output:
LeNet5(
(conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
(pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1,
ceil_mode=False)
(conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
(fc1): Linear(in_features=400, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)Model Summary :
Print the summary of the lenet5 to check the params

Python3
# add the cuda to the mode device = torch.device("cuda" if
torch.cuda.is_available() else "cpu") lenet5.to(device) #Print the summary of the
model summary(lenet5, (1, 32, 32))

Output:
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 6, 28, 28] 156
MaxPool2d-2 [-1, 6, 14, 14] 0
Conv2d-3 [-1, 16, 10, 10] 2,416
MaxPool2d-4 [-1, 16, 5, 5] 0
Linear-5 [-1, 120] 48,120
Linear-6 [-1, 84] 10,164
Linear-7 [-1, 10] 850
================================================================
Total params: 61,706
Trainable params: 61,706
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.06
Params size (MB): 0.24
Estimated Total Size (MB): 0.30
----------------------------------------------------------------2. AlexNNetThe
AlexNet CNN architecture won the 2012 ImageNet ILSVRC challenges of deep learning
algorithm by a large variance by achieving 17% with top-5 error rate as the second
best achieved 26%! It was introduced by Alex Krizhevsky (name of founder), The Ilya
Sutskever and Geoffrey Hinton are quite similar to LeNet-5, only much bigger and
deeper and it was introduced first to stack convolutional layers directly on top of
each other models, instead of stacking a pooling layer top of each on CN network
convolutional layer.AlexNNet has 60 million parameters as AlexNet has total 8
layers, 5 convolutional and 3 fully connected layers. AlexNNet is first to execute
(ReLUs) Rectified Linear Units as activation functionsit was the first CNN
architecture that uses GPU to improve the performance.ALexNNet
Example Model of AlexNNet

Python3

import torch from torchsummary import summary import torch.nn as nn import


torch.nn.functional as F class AlexNet(nn.Module): def __init__(self,
num_classes=1000): # Call the parent class's init method to initialize the
base class super(AlexNet, self).__init__() # First
Convolutional Layer with 11x11 filters, stride of 4, and 2 padding
self.conv1 = nn.Conv2d(in_channels=3, out_channels=96, kernel_size=11, stride=4,
padding=2) # Max Pooling Layer with a kernel size of 3 and stride
of 2 self.pool = nn.MaxPool2d(kernel_size=3, stride=2) #
Second Convolutional Layer with 5x5 filters and 2 padding self.conv2 =
nn.Conv2d(in_channels=96, out_channels=256, kernel_size=5, padding=2)
# Third Convolutional Layer with 3x3 filters and 1 padding self.conv3 =
nn.Conv2d(in_channels=256, out_channels=384, kernel_size=3, padding=1)
# Fourth Convolutional Layer with 3x3 filters and 1 padding self.conv4 =
nn.Conv2d(in_channels=384, out_channels=384, kernel_size=3, padding=1)
# Fifth Convolutional Layer with 3x3 filters and 1 padding self.conv5 =
nn.Conv2d(in_channels=384, out_channels=256, kernel_size=3, padding=1)
# First Fully Connected Layer with 4096 output features self.fc1 =
nn.Linear(in_features=256 * 6 * 6, out_features=4096) # Second
Fully Connected Layer with 4096 output features self.fc2 =
nn.Linear(in_features=4096, out_features=4096) # Output Layer
with `num_classes` output features self.fc3 = nn.Linear(in_features=4096,
out_features=num_classes) def forward(self, x): # Pass the input
through the first convolutional layer and ReLU activation function x =
self.pool(F.relu(self.conv1(x))) # Pass the output of the first
layer through # the second convolutional layer and ReLU activation
function x = self.pool(F.relu(self.conv2(x))) # Pass the
output of the second layer through # the third convolutional layer and
ReLU activation function x = F.relu(self.conv3(x)) # Pass
the output of the third layer through # the fourth convolutional layer and
ReLU activation function x = F.relu(self.conv4(x)) # Pass
the output of the fourth layer through # the fifth convolutional layer and
ReLU activation function x = self.pool(F.relu(self.conv5(x)))
# Reshape the output to be passed through the fully connected layers x =
x.view(-1, 256 * 6 * 6) # Pass the output through the first fully
connected layer and activation function x = F.relu(self.fc1(x)) x =
F.dropout(x, 0.5) # Pass the output of the first fully
connected layer through # the second fully connected layer and activation
function x = F.relu(self.fc2(x)) # Pass the output of the
second fully connected layer through the output layer x = self.fc3(x)
# Return the final output return x
alexnet = AlexNet() print(alexnet)

Output:
AlexNet(
(conv1): Conv2d(3, 96, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
(pool): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1,
ceil_mode=False)
(conv2): Conv2d(96, 256, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(conv3): Conv2d(256, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(conv4): Conv2d(384, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(conv5): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(fc1): Linear(in_features=9216, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=4096, bias=True)
(fc3): Linear(in_features=4096, out_features=1000, bias=True)
)Model Summary :
Print the summary of the alexnet to check the params

Python3
# add the cuda to the mode device = torch.device("cuda" if
torch.cuda.is_available() else "cpu") alexnet.to(device) #Print the summary of
the model summary(alexnet, (3, 224, 224))

Output:
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 96, 55, 55] 34,944
MaxPool2d-2 [-1, 96, 27, 27] 0
Conv2d-3 [-1, 256, 27, 27] 614,656
MaxPool2d-4 [-1, 256, 13, 13] 0
Conv2d-5 [-1, 384, 13, 13] 885,120
Conv2d-6 [-1, 384, 13, 13] 1,327,488
Conv2d-7 [-1, 256, 13, 13] 884,992
MaxPool2d-8 [-1, 256, 6, 6] 0
Linear-9 [-1, 4096] 37,752,832
Linear-10 [-1, 4096] 16,781,312
Linear-11 [-1, 1000] 4,097,000
================================================================
Total params: 62,378,344
Trainable params: 62,378,344
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.57
Forward/backward pass size (MB): 5.96
Params size (MB): 237.95
Estimated Total Size (MB): 244.49
----------------------------------------------------------------
Output as in google Colab Link –
https://colab.research.google.com/drive/1kicnALE1T2c28hHPYeyFwNaOpkl_nFpQ?
usp=sharing
3. GoogleNet (Inception vl)The GoogleNet architecture was created by Christian
Szegedy from Google Research and achieved a breakthrough result by lowering the
top-5 error rate to below 7% in the ILSVRC 2014 challenge. This success was largely
attributed to its deeper architecture than other CNNs, enabled by its inception
modules which enabled more efficient use of parameters than preceding
architecturesGoogleNet has fewer parameters than AlexNet, with a ratio of 10:1
(roughly 6 million instead of 60 million)The architecture of the inception module
looks as shown in Fig. GoogleNet (Inception Module)The notation “3 x 3 + 2(5)”
means that the layer uses a 3 x 3 kernel, a stride of 2, and SAME padding. The
input signal is then fed to four different layers, each with a RelU activation
function and a stride of 1. These convolutional layers have varying kernel sizes (1
x 1, 3 x 3, and 5 x 5) to capture patterns at different scales. Additionally, each
layer uses SAME padding, so all outputs have the same height and width as their
inputs. This allows for the feature maps from all four top convolutional layers to
be concatenated along the depth dimension in the final depth concat layer.The
overall GoogleNet architecture has 22 larger deep CNN layers.4. ResNet (Residual
Network)Residual Network (ResNet), the winner of the ILSVRC 2015 challenge, was
developed by Kaiming He and delivered an impressive top-5 error rate of 3.6% with
an extremely deep CNN composed of 152 layers. An essential factor enabling the
training of such a deep network is the use of skip connections (also known as
shortcut connections). The signal that enters a layer is added to the output of a
layer located higher up in the stack. Let’s explore why this is beneficial.When
training a neural network, the goal is to make it replicate a target function h(x).
By adding the input x to the output of the network (a skip connection), the network
is made to model f(x) = h(x) – x, a technique known as residual learning.F(x) =
H(x) - x which gives H(x) := F(x) + x. Skip (Shortcut) connectionWhen initializing
a regular neural network, its weights are near zero, resulting in the network
outputting values close to zero. With the addition of skip connections, the
resulting network outputs a copy of its inputs, effectively modeling the identity
function. This can be beneficial if the target function is similar to the identity
function, as it will accelerate training. Furthermore, if multiple skip connections
are added, the network can begin to make progress even if several layers have not
yet begun learning.the target function is fairly close to the identity function
(which is often the case), this will speed up training considerably. Moreover, if
you add many skin connections, the network can start making progress even if
severalThe deep residual network can be viewed as a series of residual units, each
of which is a small neural network with a skip connection5. DenseNetThe DenseNet
model introduced the concept of a densely connected convolutional network, where
the output of each layer is connected to the input of every subsequent layer. This
design principle was developed to address the issue of accuracy decline caused by
the vanishing and exploding gradients in high-level neural networks.In simpler
terms, due to the long distance between the input and output layer, the data is
lost before it reaches its destination.The DenseNet model introduced the concept of
a densely connected convolutional network, where the output of each layer is
connected to the input of every subsequent layer. This design principle was
developed to address the issue of accuracy decline caused by the vanishing and
exploding gradients in high-level neural networks.DenseNetAll convolutions in a
dense block are ReLU-activated and use batch normalization. Channel-wise
concatenation is only possible if the height and width dimensions of the data
remain unchanged, so convolutions in a dense block are all of stride 1. Pooling
layers are inserted between dense blocks for further dimensionality
reduction.Intuitively, one might think that by concatenating all previously seen
outputs, the number of channels and parameters would exponentially increase.
However, DenseNet is surprisingly economical in terms of learnable parameters. This
is because each concatenated block, which may have a relatively large number of
channels, is first fed through a 1×1 convolution, reducing it to a small number of
channels. Additionally, 1×1 convolutions are economical in terms of parameters.
Then, a 3×3 convolution with the same number of channels is applied.The resulting
channels from each step of the DenseNet are concatenated to the collection of all
previously generated outputs. Each step, which utilizes a pair of 1×1 and 3×3
convolutions, adds K channels to the data. Consequently, the number of channels
increases linearly with the number of convolutional steps in the dense block. The
growth rate remains constant throughout the network, and DenseNet has demonstrated
good performance with K values between 12 and 40.Dense blocks and pooling layers
are combined to form a Tu DenseNet network. The DenseNet21 has 121 layers, however,
the structure is adjustable and can readily be extended to more than 200
layersDenseNet
Last Updated :
21 Mar, 2023

Like Article

Save Article

Previous

Implementation of a CNN based Image Classifier using PyTorch

Next

Object Detection vs Object Recognition vs Image Segmentation

Share your thoughts in the comments

Add Your Comment

Please Login to comment...


Object Recognition:
Object recognition is the technique of identifying the object present in images and
videos. It is one of the most important applications of machine learning and deep
learning. The goal of this field is to teach machines to understand (recognize) the
content of an image just like humans do.
Object Recognition
Object Recognition Using Machine Learning
HOG (Histogram of oriented Gradients) feature Extractor and SVM (Support Vector
Machine) model: Before the era of deep learning, it was a state-of-the-art method
for object detection. It takes histogram descriptors of both positive ( images that
contain objects) and negative (images that does not contain objects) samples and
trains our SVM model on that. Bag of features model: Just like bag of words
considers document as an orderless collection of words, this approach also
represents an image as an orderless collection of image features. Examples of this
are SIFT, MSER, etc.Viola-Jones algorithm: This algorithm is widely used for face
detection in the image or real-time. It performs Haar-like feature extraction from
the image. This generates a large number of features. These features are then
passed into a boosting classifier. This generates a cascade of the boosted
classifier to perform image detection. An image needs to pass to each of the
classifiers to generate a positive (face found) result. The advantage of Viola-
Jones is that it has a detection time of 2 fps which can be used in a real-time
face recognition system.Object Recognition Using Deep Learning
Convolution Neural Network (CNN) is one of the most popular ways of doing object
recognition. It is widely used and most state-of-the-art neural networks used this
method for various object recognition related tasks such as image classification.
This CNN network takes an image as input and outputs the probability of the
different classes. If the object present in the image then it’s output probability
is high else the output probability of the rest of classes is either negligible or
low. The advantage of Deep learning is that we don’t need to do feature extraction
from data as compared to machine learning.
Challenges of Object Recognition:
Since we take the output generated by last (fully connected) layer of the CNN model
is a single class label. So, a simple CNN approach will not work if more than one
class labels are present in the image.If we want to localize the presence of an
object in the bounding box, we need to try a different approach that not only
outputs the class label but also outputs the bounding box locations.
Overview of tasks related to Object Recognition
Image Classification :
In Image classification, it takes an image as an input and outputs the
classification label of that image with some metric (probability, loss, accuracy,
etc). For Example: An image of a cat can be classified as a class label “cat” or an
image of Dog can be classified as a class label “dog” with some probability.
Image Classification
Object Localization: This algorithm locates the presence of an object in the image
and represents it with a bounding box. It takes an image as input and outputs the
location of the bounding box in the form of (position, height, and width).
Object Detection:
Object Detection algorithms act as a combination of image classification and
object localization. It takes an image as input and produces one or more bounding
boxes with the class label attached to each bounding box. These algorithms are
capable enough to deal with multi-class classification and localization as well as
to deal with the objects with multiple occurrences.
Challenges of Object Detection:
In object detection, the bounding boxes are always rectangular. So, it does not
help with determining the shape of objects if the object contains the curvature
part.Object detection cannot accurately estimate some measurements such as the area
of an object, perimeter of an object from image.Difference between classification.
Localization and Detection (Source: Link)
Image Segmentation:
Image segmentation is a further extension of object detection in which we mark the
presence of an object through pixel-wise masks generated for each object in the
image. This technique is more granular than bounding box generation because this
can helps us in determining the shape of each object present in the image because
instead of drawing bounding boxes , segmentation helps to figure out pixels that
are making that object. This granularity helps us in various fields such as medical
image processing, satellite imaging, etc. There are many image segmentation
approaches proposed recently. One of the most popular is Mask R-CNN proposed by K
He et al. in 2017.
Object Detection vs Segmentation (Source: Link)
There are primarily two types of segmentation:
Instance Segmentation: Multiple instances of same class are separate segments i.e.
objects of same class are treated as different. Therefore, all the objects are
coloured with different colour even if they belong to same class.Semantic
Segmentation: All objects of same class form a single classification ,therefore ,
all objects of same class are coloured by same colour.Semantic vs Instance
Segmentation (Source: Link)
Applications:
The above-discussed object recognition techniques can be utilized in many fields
such as:
Driver-less Cars: Object Recognition is used for detecting road signs, other
vehicles, etc.Medical Image Processing: Object Recognition and Image Processing
techniques can help detect disease more accurately. Image segmentation helps to
detect the shape of the defect present in the body . For Example, Google AI for
breast cancer detection detects more accurately than doctors. Surveillance and
Security: such as Face Recognition, Object Tracking, Activity Recognition, etc.
References:
Ross Girshick’s RCNN paperMathworks Object Recognition vs Object Detection CS231n
Stanford Slides

Last Updated :
28 Jun, 2022

Like Article

Save Article

Previous
Convolutional Neural Network (CNN) Architectures

Next

YOLO v2 - Object Detection

Share your thoughts in the comments

Add Your Comment

Please Login to comment...

In terms of speed,
YOLO is one of the best models in object recognition, able to recognize objects and
process frames at the rate up to 150 FPS for small networks. However, In terms of
accuracy mAP, YOLO was not the state of the art model but has fairly good Mean
average Precision (mAP) of 63% when trained on PASCAL VOC2007 and PASCAL VOC 2012.
However, Fast R-CNN which was the state of the art at that time has an mAP of 71%.

YOLO v2 and YOLO 9000 was proposed by J. Redmon and A. Farhadi in 2016 in the paper
titled YOLO 9000: Better, Faster, Stronger. At 67 FPS, YOLOv2 gives mAP of 76.8%
and at 67 FPS it gives an mAP of 78.6% on VOC 2007 dataset bettered the models
like Faster R-CNN and SSD. YOLO 9000 used YOLO v2 architecture but was able to
detect more than 9000 classes. YOLO 9000, however, has an mAP of 19.7%.

Let’s look at the architecture and working of YOLO v2:

Architecture Changes vs YOLOv1:


The previous YOLO architecture has a lot of problems when compared to the state-of-
the-art method like Fast R-CNN. It made a lot of localization errors and has a low
recall. So, the goal of this paper is not only to improve these shortcomings of
YOLO but also to maintain the speed of the architecture. There are some incremental
improvements that are made in basic YOLO. Let’s discuss these changes below:
Darknet-19 simplified

Batch Normalization:
By adding batch normalization to the architecture we can increase the convergence
of the model that leads us for faster training. This also eliminates the need for
applying other types of normalization such as Dropout without overfitting. It is
also observed that adding batch normalization alone can cause an increase in mAP by
2% as compared to basic YOLO.

High Resolution Classifier:


The previous version of YOLO uses 224 *224 as input size during training but at the
time of detection, it takes an image up to size 448*448. This causes the model to
adjust to a new resolution that in turn causes a decrease in mAP.
The YOLOv2 version trains on higher resolution (448 * 448) for 10 epochs on
ImageNet data. This gives network time to adjust the filters for higher resolution.
By training on 448*448 images size the mAP increased by 4%.

Use Anchor Boxes For Bounding Boxes:


YOLO uses fully connected layers to predict bounding boxes instead of predicting
coordinates directly from the convolution network like in Fast R-CNN, Faster R-CNN.
In this version, we remove the fully connected layer and instead add the anchor
boxes to predict the bounding boxes. We made the following changes in the
architecture:
Bounding Boxes with more than 1 anchors (that will provide more accurate
localisation)

We remove the fully connected layer responsible for predicting bounding boxes and
replace it with anchor boxes prediction.
YOLOv1 with layers removed (in filled red color)
We change the size of input from 448 * 448 to 416 * 416. This creates a feature map
of size 13 * 13 when we downsample it 32x. The idea behind this that there is a
good possibility of the object at the center of the feature map.
Remove one pooling layer to get 13 * 13 spatial network instead of 7*7
With these changes, the mAP of the model is slightly decreased (from 69.5% to
69.2%) however recall increases from 81% to 88%.
Output of each object proposal

Dimensionality clusters:
We need to identify the number of anchors (priors) generated so that they provide
the best results. Let’s take as K for now. Our task is to identify the top-K
bounding boxes for images that have maximum accuracy. We use the K-means clustering
algorithm for that purpose. But, we don’t need to minimize the Euclidean distance
instead we maximize the IOU as the target of this algorithm.
YOLO v2 uses K=5 for the better trade-off of the algorithm. We can conclude from
the graph below that as we increase the value of K=5 accuracy doesn’t change
significantly.
IOU based clustering on K = 5 gives mAP of 61%.
Dimension clusters(number of dimension for each anchors) vs mAP

Direct Location Problem:


The previous version of YOLO does not have a constraint on location prediction
which makes it unstable on early iteration. YOLOv2 predicts 5 parameters (tx, ty,
tw, th, to (objectness score)) and applies the sigma function to constraint its
value falls between 0 and 1.

This direct location constraint increases the mAP by 5%.


Fine Grained Features :
YOLOv2 which generates 13 * 13 is sufficient fr detecting large objects. However,
if we want to detect finer objects we can modify the architecture such that the
output of previous layer 26 * 26 * 512 to 13 * 13 * 2048 and concatenates with the
original 13 * 13 * 1024 output layer making our output layer of size.

Multi-Scale Training :
YOLO v2 has been trained on different input sizes from 320 * 320 to 608 * 608 using
step of 32. This architecture randomly chooses image dimensions for every 10
batches. There can be a trade-off established between accuracy and image size. For
Example, YOLOv2 with images size of 288 * 288 at 90 FPS gives as much as mAP as
Fast R-CNN.

Architecture:
YOLO v2 is trained on different architectures such as VGG-16 and GoogleNet. The
paper also proposed an architecture called Darknet-19. The reason for choosing the
Darknet architecture is its lower processing requirement than other architectures
5.58 FLOPS ( as compared to 30.69 FLOPS on VGG-16 for 224 * 224 image size and 8.52
FLOPS in customized GoogleNet). The structure of Darknet-19 is given below:

For detection purposes, we replace the last convolution layer of this architecture
and instead add three 3 * 3 convolution layers every 1024 filters followed by 1 * 1
convolution with the number of outputs we need for detection.

For VOC we predict 5 boxes with 5 coordinates (tx, ty, tw, th, to (objectness
score)) each with 20 classes per box. So total number of filters is 125.
Darknet-19 architecture
Training:
The YOLOv2 is trained for two purposes :

For classification task the model is trained on ImageNet-1000 classification task


for 160 epochs with a starting learning rate 0.1, weight decay of 0.0005 and
momentum of 0.9 using Darknet-19 architecture. There are some standard Data
augmentation techniques applied for this training.

For detection there are some modifications made in the Darknet-19 architecture
which we discussed above. The model is trained for 160 epochs on starting learning
rate 10-3, weight decay of 0.0005 and momentum of 0.9. The same strategy used for
training the model on both COCO and VOC.

Results and Conclusion:


Results of Different object detection frameworks
YOLOv2 gives state-of-the-art detection accuracy on the PASCAL VOC and COCO. It can
run on varying sizes offering a tradeoff between speed and accuracy. At 67 FPS,
YOLOv2 can give an mAP of 76.8 while at 40 FPS the detector gives an accuracy of
78.6 mAP, better than the state-of-the-model such as Faster R-CNN and SSD while
running significantly faster than those models.
Speed vs Accuracy Curve for different object detection
This model has also been the basis of the YOLO9000 model which is able to detect
more than 9000 classes in real-time.

Reference:

YOLO9000:Better, Faster, Stronger


Last Updated :
06 Dec, 2022

Like Article

Save Article

Previous

Object Detection vs Object Recognition vs Image Segmentation

Next

Natural Language Processing (NLP) Tutorial

Share your thoughts in the comments

Add Your Comment

Please Login to comment...


Learn the basics and advanced concepts of natural language processing (NLP) with
our complete NLP tutorial and get ready to explore the vast and exciting field of
NLP, where technology meets human language.
NLP tutorial is designed for both beginners and professionals. Whether you’re a
data scientist, a developer, or someone curious about the power of language, our
tutorial will provide you with the knowledge and skills you need to take your
understanding of NLP to the next level.
What is NLP?
NLP stands for Natural Language Processing. It is the branch of Artificial
Intelligence that gives the ability to machine understand and process human
languages. Human languages can be in the form of text or audio format.
History of NLP
Natural Language Processing started in 1950 When Alan Mathison Turing published an
article in the name Computing Machinery and Intelligence. It is based on Artificial
intelligence. It talks about automatic interpretation and generation of natural
language. As the technology evolved, different approaches have come to deal with
NLP tasks.
Heuristics-Based NLP: This is the initial approach of NLP. It is based on defined
rules. Which comes from domain knowledge and expertise. Example: regexStatistical
Machine learning-based NLP: It is based on statistical rules and machine learning
algorithms. In this approach, algorithms are applied to the data and learned from
the data, and applied to various tasks. Examples: Naive Bayes, support vector
machine (SVM), hidden Markov model (HMM), etc.Neural Network-based NLP: This is the
latest approach that comes with the evaluation of neural network-based learning,
known as Deep learning. It provides good accuracy, but it is a very data-hungry and
time-consuming approach. It requires high computational power to train the model.
Furthermore, it is based on neural network architecture. Examples: Recurrent neural
networks (RNNs), Long short-term memory networks (LSTMs), Convolutional neural
networks (CNNs), Transformers, etc.Advantages of NLPNLP helps us to analyse data
from both structured and unstructured sources.NLP is very fast and time
efficient.NLP offers end-to-end exact answers to the question. So, It saves time
that going to consume unnecessary and unwanted information.NLP offers users to ask
questions about any subject and give a direct response within
milliseconds.Disadvantages of NLPFor the training of the NLP model, A lot of data
and computation are required.Many issues arise for NLP when dealing with informal
expressions, idioms, and cultural jargon.NLP results are sometimes not to be
accurate, and accuracy is directly proportional to the accuracy of data.NLP is
designed for a single, narrow job since it cannot adapt to new domains and has a
limited function.Components of NLP
There are two components of Natural Language Processing:
Natural Language UnderstandingNatural Language GenerationApplications of NLP
The applications of Natural Language Processing are as follows:
Text and speech processing like-Voice assistants – Alexa, Siri, etc.Text
classification like Grammarly, Microsoft Word, and Google DocsInformation
extraction like-Search engines like DuckDuckGo, GoogleChatbot and Question
Answering like:- website botsLanguage Translation like:- Google TranslateText
summarization Phases of Natural Language Processing

NLP LibrariesNLTKSpacyGensimfastTextStanford toolkit (Glove)Apache OpenNLPClassical


Approaches
Classical Approaches to Natural Language Processing
Text PreprocessingRegular ExpressionsHow to write Regular Expressions?Properties of
Regular expressionsText Preprocessing using RERegular ExpressionEmail Extraction
using RETokenizationWhite Space TokenizationDictionary Based TokenizationRule-Based
TokenizationRegular Expression TokenizerPenn Treebank TokenizationSpacy
TokenizerSubword TokenizationTokenization with TextblobTokenize text using NLTK in
pythonHow tokenizing text, sentences, and words
worksLemmatizationStemmingTypesPorter StemmerLovins StemmerDawson StemmerKrovetz
StemmerXerox StemmerStopwords removalRemoving stop words with NLTK in PythonParts
of Speech (POS)Part of Speech – Default TaggingPart of speech tagging – word
corpusPart of Speech Tagging with Stop words using NLTK in pythonPart of Speech
Tagging using TextBlobText NormalizationText Vectorization or Encoding:vector space
model (VSM)Words and vectorsCosine similarityBasic Text Vectorization approach:One-
Hot EncodingByte-Pair Encoding (BPE)Bag of words (BOW)N-GramsTerm frequency Inverse
Document Frequency (TFIDF)N-Gram Language Modelling with NLTKDistributed
Representations:Word EmbeddingsPre-Trained Word EmbeddingsWord Embedding using
Word2VecFinding the Word Analogy from given words using Word2Vec
embeddingsGloVefasttextTrain Own Word EmbeddingsContinuous bag of words
(CBOW)SkipGramDoc2VecUniversal Text RepresentationsEmbeddings from Language Models
(ELMo)Bidirectional Encoder Representations from Transformers (BERT)Embeddings
Visualizationst-sne (t-distributed Stochastic Neighbouring
Embedding)TextEvaluatorEmbeddings semantic propertiesSemantic AnalysisWhat is
Sentiment Analysis?Understanding Semantic AnalysisSentiment classification:Naive
Bayes ClassifiersLogistic RegressionSentiment Classification Using BERTTwitter
Sentiment Analysis using textblobParts of Speech tagging and Named Entity
Recognizations:Parts of Speech tagging with NLTKParts of Speech tagging with
spacyHidden Markov Model for POS taggingMarkov ChainsHidden Markov ModelViterbi
AlgorithmConditional Random Fields (CRFs) Conditional Random Fields (CRFs) for POS
taggingNamed Entity RecognitionRule Based ApproachNamed Entity RecognizationsNeural
Network for NLP:Feedforwards networks for NLPRecurrent Neural NetworksRNN for Text
ClassificationsRNN for Sequence LabelingStacked RNNsBidirectional RNNsLong Short-
Term Memory (LSTM)LSTM with TensorflowBidirectional LSTMGated Recurrent Unit
(GRU)Sentiment Analysis with RNN,LSTM, GRUEmotion Detection using Bidirectional
LSTM & GRUTransformers for NLPTransfer Learning for NLP:Bidirectional Encoder
Representations from TransformersRoBERTaSpanBERTTransfer Learning with Fine-
tuningInformations ExtractionsKeyphrase ExtractionNamed Entity
RecognitionRelationship ExtractionInformation RetrievalText GenerationsText
Generations introductionsText summarizationExtractive Text Summarization using
GensimQuestions – AnsweringChatbot & Dialogue Systems:Simple Chat Bot using
ChatterBotGUI chat application using TkinterMachine translationMachine translation
IntroductionsStatistical Machine Translation IntroductionPhoneticsImplement
Phonetic Search in Python with Soundex AlgorithmConvert English text into the
PhoneticsSpeech Recognition and Text-to-SpeechConvert Text to SpeechConvert Speech
to text and text to SpeechSpeech Recognition using Google Speech APIEmpirical and
Statistical ApproachesTreebank Annotation Fundamental Statistical Techniques for
NLPPart-of-Speech TaggingRules-based systemStatistical ParsingMultiword
ExpressionsNormalized Web Distance and Word SimilarityWord Sense DisambiguationFAQs
on Natural Language Processing What is the most difficult part of natural language
processing?
Ambiguity is the main challenge of natural language processing because in natural
language, words are unique, but they have different meanings depending upon the
context which causes ambiguity on lexical, syntactic, and semantic levels.
What are the 4 pillars of NLP?
The four main pillars of NLP are 1.) Outcomes, 2.) Sensory acuity, 3.) behavioural
flexibility, and 4.) report.
What language is best for natural language processing?
Python is considered the best programming language for NLP because of their
numerous libraries, simple syntax, and ability to easily integrate with other
programming languages.
What is the life cycle of NLP?
There are four stages included in the life cycle of NLP – development, validation,
deployment, and monitoring of the models.
Last Updated :
03 Aug, 2023

Like Article

Save Article

Previous

YOLO v2 - Object Detection

Next

Introduction to NLTK: Tokenization, Stemming, Lemmatization, POS Tagging

Share your thoughts in the comments

Add Your Comment

Please Login to comment...


Natural Language Toolkit (NLTK) is one of the largest Python libraries for
performing various Natural Language Processing tasks. From rudimentary tasks such
as text pre-processing to tasks like vectorized representation of text – NLTK’s API
has covered everything. In this article, we will accustom ourselves to the basics
of NLTK and perform some crucial NLP tasks: Tokenization, Stemming, Lemmatization,
and POS Tagging.
Table of Content
What is the Natural Language Toolkit (NLTK)?TokenizationStemming and
Lemmatization Stemming LemmatizationPart of Speech TaggingWhat is the Natural
Language Toolkit (NLTK)?As discussed earlier, NLTK is Python’s API library for
performing an array of tasks in human language. It can perform a variety of
operations on textual data, such as classification, tokenization, stemming,
tagging, Leparsing, semantic reasoning, etc.
Installation:NLTK can be installed simply using pip or by running the following
code.
! pip install nltkAccessing Additional Resources:To incorporate the usage of
additional resources, such as recourses of languages other than English – you can
run the following in a python script. It has to be done only once when you are
running it for the first time in your system.

Python3

import nltknltk.download('all')
Now, having installed NLTK successfully in our system, let’s perform some basic
operations on text data using NLTK.
TokenizationTokenization refers to break down the text into smaller units. It
entails splitting paragraphs into sentences and sentences into words. It is one of
the initial steps of any NLP pipeline. Let us have a look at the two major kinds of
tokenization that NLTK provides:
Work TokenizationIt involves breaking down the text into words.
"I study Machine Learning on GeeksforGeeks." will be word-tokenized as ['I',
'study', 'Machine', 'Learning', 'on', 'GeeksforGeeks', '.']. Sentence
TokenizationIt involves breaking down the text into individual sentences.
Example:"I study Machine Learning on GeeksforGeeks. Currently, I'm studying NLP"
will be sentence-tokenized as ['I study Machine Learning on GeeksforGeeks.',
'Currently, I'm studying NLP.']In Python, both these tokenizations can be
implemented in NLTK as follows:

Python3

# Tokenization using NLTKfrom nltk import word_tokenize, sent_tokenizesent =


"GeeksforGeeks is a great learning platform.\It is one of the best for Computer
Science students."print(word_tokenize(sent))print(sent_tokenize(sent))

Output:
['GeeksforGeeks', 'is', 'a', 'great', 'learning', 'platform', '.','It', 'is',
'one', 'of', 'the', 'best', 'for', 'Computer', 'Science', 'students', '.']
['GeeksforGeeks is a great learning platform.', 'It is one of the best for Computer
Science students.']Stemming and Lemmatization When working with Natural Language,
we are not much interested in the form of words – rather, we are concerned with the
meaning that the words intend to convey. Thus, we try to map every word of the
language to its root/base form. This process is called canonicalization.
E.g. The words ‘play’, ‘plays’, ‘played’, and ‘playing’ convey the same action –
hence, we can map them all to their base form i.e. ‘play’.
Now, there are two widely used canonicalization techniques: Stemming and
Lemmatization.
Stemming Stemming generates the base word from the inflected word by removing the
affixes of the word. It has a set of pre-defined rules that govern the dropping of
these affixes. It must be noted that stemmers might not always result in
semantically meaningful base words. Stemmers are faster and computationally less
expensive than lemmatizers.
In the following code, we will be stemming words using Porter Stemmer – one of the
most widely used stemmers:

Python3

from nltk.stem import PorterStemmer # create an object of class PorterStemmerporter


=
PorterStemmer()print(porter.stem("play"))print(porter.stem("playing"))print(porter.
stem("plays"))print(porter.stem("played"))

Output:
playplayplayplayWe can see that all the variations of the word ‘play’ have been
reduced to the same word – ‘play’. In this case, the output is a meaningful word,
‘play’. However, this is not always the case. Let us take an example.
Please note that these groups are stored in the lemmatizer; there is no removal of
affixes as in the case of a stemmer.

Python3

from nltk.stem import PorterStemmer# create an object of class PorterStemmerporter


= PorterStemmer()print(porter.stem("Communication"))

Output:
communThe stemmer reduces the word ‘communication’ to a base word ‘commun’ which is
meaningless in itself.
LemmatizationLemmatization involves grouping together the inflected forms of the
same word. This way, we can reach out to the base form of any word which will be
meaningful in nature. The base from here is called the Lemma.
Lemmatizers are slower and computationally more expensive than stemmers.
Example:'play', 'plays', 'played', and 'playing' have 'play' as the lemma. In
Python, both these tokenizations can be implemented in NLTK as follows:

Python3
from nltk.stem import WordNetLemmatizer# create an object of class
WordNetLemmatizerlemmatizer =
WordNetLemmatizer()print(lemmatizer.lemmatize("plays",
'v'))print(lemmatizer.lemmatize("played", 'v'))print(lemmatizer.lemmatize("play",
'v'))print(lemmatizer.lemmatize("playing", 'v'))

Output:
playplayplayplayPlease note that in lemmatizers, we need to pass the Part of Speech
of the word along with the word as a function argument.
Also, stemmers always result in meaningful base words. Let us take the same example
as we took in the case for stemmers.

Python3
from nltk.stem import WordNetLemmatizer # create an object of class
WordNetLemmatizerlemmatizer =
WordNetLemmatizer()print(lemmatizer.lemmatize("Communication", 'v'))

Output:
CommunicationPart of Speech TaggingPart of Speech (POS) tagging refers to assigning
each word of a sentence to its part of speech. It is significant as it helps give a
better syntactic overview of a sentence.
Example:"GeeksforGeeks is a Computer Science platform."Let's see how NLTK's POS
tagger will tag this sentence.In Python, both these tokenizations can be
implemented in NLTK as follows:

Python3

from nltk import pos_tagfrom nltk import word_tokenize text = "GeeksforGeeks is a


Computer Science platform."tokenized_text = word_tokenize(text)tags = tokens_tag =
pos_tag(tokenized_text)tags
Output:
[('GeeksforGeeks', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('Computer', 'NNP'),
('Science', 'NNP'), ('platform', 'NN'), ('.', '.')]Conclusion In conclusion, the
Natural Language Toolkit (NLTK) works as a powerful Python library that a wide
range of tools for Natural Language Processing (NLP). From fundamental tasks like
text pre-processing to more advanced operations such as semantic reasoning, NLTK
provides a versatile API that caters to the diverse needs of language-related
tasks.

Last Updated :
03 Jan, 2024

Like Article

Save Article

Previous

Natural Language Processing (NLP) Tutorial

Next

Word Embeddings in NLP


Share your thoughts in the comments

Add Your Comment

Please Login to comment...

Word Embeddings are numeric representations of words in a lower-dimensional space,


capturing semantic and syntactic information. They play a vital role in Natural
Language Processing (NLP) tasks. This article explores traditional and neural
approaches, such as TF-IDF, Word2Vec, and GloVe, offering insights into their
advantages and disadvantages. Understanding the importance of pre-trained word
embeddings, providing a comprehensive understanding of their applications in
various NLP scenarios.
What is Word Embedding in NLP?Word Embedding is an approach for representing words
and documents. Word Embedding or Word Vector is a numeric vector input that
represents a word in a lower-dimensional space. It allows words with similar
meanings to have a similar representation.
Word Embeddings are a method of extracting features out of text so that we can
input those features into a machine learning model to work with text data. They try
to preserve syntactical and semantic information. The methods such as Bag of Words
(BOW), CountVectorizer and TFIDF rely on the word count in a sentence but do not
save any syntactical or semantic information. In these algorithms, the size of the
vector is the number of elements in the vocabulary. We can get a sparse matrix if
most of the elements are zero. Large input vectors will mean a huge number of
weights which will result in high computation required for training. Word
Embeddings give a solution to these problems.
Need for Word Embedding?To reduce dimensionalityTo use a word to predict the words
around it.Inter-word semantics must be captured.How are Word Embeddings used?They
are used as input to machine learning models.Take the words —-> Give their numeric
representation —-> Use in training or inference.To represent or visualize any
underlying patterns of usage in the corpus that was used to train them.Let’s take
an example to understand how word vector is generated by taking emotions which are
most frequently used in certain conditions and transform each emoji into a vector
and the conditions will be our features.

In a similar way, we can create word vectors for different words as well on the
basis of given features. The words with similar vectors are most likely to have the
same meaning or are used to convey the same sentiment.
Approaches for Text Representation1. Traditional ApproachThe conventional method
involves compiling a list of distinct terms and giving each one a unique integer
value, or id. and after that, insert each word’s distinct id into the sentence.
Every vocabulary word is handled as a feature in this instance. Thus, a large
vocabulary will result in an extremely large feature size. Common traditional
methods include:
1.1. One-Hot EncodingOne-hot encoding is a simple method for representing words in
natural language processing (NLP). In this encoding scheme, each word in the
vocabulary is represented as a unique vector, where the dimensionality of the
vector is equal to the size of the vocabulary. The vector has all elements set to
0, except for the element corresponding to the index of the word in the vocabulary,
which is set to 1.

Python3

def one_hot_encode(text): words = text.split() vocabulary =


set(words) word_to_index = {word: i for i, word in
enumerate(vocabulary)} one_hot_encoded = [] for word in
words: one_hot_vector = [0] *
len(vocabulary) one_hot_vector[word_to_index[word]] =
1 one_hot_encoded.append(one_hot_vector) return one_hot_encoded,
word_to_index, vocabulary # sampleexample_text = "cat in the hat dog on the mat
bird in the tree" one_hot_encoded, word_to_index, vocabulary =
one_hot_encode(example_text) print("Vocabulary:", vocabulary)print("Word to Index
Mapping:", word_to_index)print("One-Hot Encoded Matrix:")for word, encoding in
zip(example_text.split(), one_hot_encoded): print(f"{word}: {encoding}")

Output:
Vocabulary: {'mat', 'the', 'bird', 'hat', 'on', 'in', 'cat', 'tree', 'dog'}Word to
Index Mapping: {'mat': 0, 'the': 1, 'bird': 2, 'hat': 3, 'on': 4, 'in': 5, 'cat':
6, 'tree': 7, 'dog': 8}One-Hot Encoded Matrix:cat: [0, 0, 0, 0, 0, 0, 1, 0, 0]in:
[0, 0, 0, 0, 0, 1, 0, 0, 0]the: [0, 1, 0, 0, 0, 0, 0, 0, 0]hat: [0, 0, 0, 1, 0, 0,
0, 0, 0]dog: [0, 0, 0, 0, 0, 0, 0, 0, 1]on: [0, 0, 0, 0, 1, 0, 0, 0, 0]the: [0, 1,
0, 0, 0, 0, 0, 0, 0]mat: [1, 0, 0, 0, 0, 0, 0, 0, 0]bird: [0, 0, 1, 0, 0, 0, 0, 0,
0]in: [0, 0, 0, 0, 0, 1, 0, 0, 0]the: [0, 1, 0, 0, 0, 0, 0, 0, 0]tree: [0, 0, 0, 0,
0, 0, 0, 1, 0]While one-hot encoding is a simple and intuitive method for
representing words in NLP, it has several disadvantages, which may limit its
effectiveness in certain applications.
One-hot encoding results in high-dimensional vectors, making it computationally
expensive and memory-intensive, especially with large vocabularies.It does not
capture semantic relationships between words; each word is treated as an isolated
entity without considering its meaning or context.It is restricted to the
vocabulary seen during training, making it unsuitable for handling out-of-
vocabulary words.1.2. Bag of Word (Bow)Bag-of-Words (BoW) is a text representation
technique that represents a document as an unordered set of words and their
respective frequencies. It discards the word order and captures the frequency of
each word in the document, creating a vector representation.

Python3

from sklearn.feature_extraction.text import CountVectorizerdocuments = ["This is


the first document.", "This document is the second
document.", "And this is the third one.", "Is this the
first document?"] vectorizer = CountVectorizer()X =
vectorizer.fit_transform(documents)feature_names =
vectorizer.get_feature_names_out() print("Bag-of-Words
Matrix:")print(X.toarray())print("Vocabulary (Feature Names):", feature_names)

Output:
Bag-of-Words Matrix:[[0 1 1 1 0 0 1 0 1] [0 2 0 1 0 1 1 0 1] [1 0 0 1 1 0 1 1 1] [0
1 1 1 0 0 1 0 1]]Vocabulary (Feature Names): ['and' 'document' 'first' 'is' 'one'
'second' 'the' 'third' 'this']While BoW is a simple and interpretable
representation, below disadvantages highlight its limitations in capturing certain
aspects of language structure and semantics:
BoW ignores the order of words in the document, leading to a loss of sequential
information and context making it less effective for tasks where word order is
crucial, such as in natural language understanding.BoW representations are often
sparse, with many elements being zero resulting in increased memory requirements
and computational inefficiency, especially when dealing with large datasets.1.3.
Term frequency-inverse document frequency (TF-IDF)Term Frequency-Inverse Document
Frequency, commonly known as TF-IDF, is a numerical statistic that reflects the
importance of a word in a document relative to a collection of documents (corpus).
It is widely used in natural language processing and information retrieval to
evaluate the significance of a term within a specific document in a larger corpus.
TF-IDF consists of two components:
Term Frequency (TF): Term Frequency measures how often a term (word) appears in a
document. It is calculated using the formula:
Inverse Document Frequency (IDF): Inverse Document Frequency measures the
importance of a term across a collection of documents. It is calculated using the
formula:
The TF-IDF score for a term t in a document d is then given by multiplying the TF
and IDF values:

The higher the TF-IDF score for a term in a document, the more important that term
is to that document within the context of the entire corpus. This weighting scheme
helps in identifying and extracting relevant information from a large collection of
documents, and it is commonly used in text mining, information retrieval, and
document clustering.
Let’s Implement Term Frequency-Inverse Document Frequency (TF-IDF) using python
with the scikit-learn library. It begins by defining a set of sample documents. The
TfidfVectorizer is employed to transform these documents into a TF-IDF matrix. The
code then extracts and prints the TF-IDF values for each word in each document.
This statistical measure helps assess the importance of words in a document
relative to their frequency across a collection of documents, aiding in information
retrieval and text analysis tasks.

Python3

from sklearn.feature_extraction.text import TfidfVectorizer # Sampledocuments =


[ "The quick brown fox jumps over the lazy dog.", "A journey of a thousand
miles begins with a single step.",] vectorizer = TfidfVectorizer() # Create the
TF-IDF vectorizertfidf_matrix = vectorizer.fit_transform(documents)feature_names =
vectorizer.get_feature_names_out()tfidf_values = {} for doc_index, doc in
enumerate(documents): feature_index = tfidf_matrix[doc_index, :].nonzero()
[1] tfidf_doc_values = zip(feature_index, [tfidf_matrix[doc_index, x] for x in
feature_index]) tfidf_values[doc_index] = {feature_names[i]: value for i, value
in tfidf_doc_values}#let's printfor doc_index, values in
tfidf_values.items(): print(f"Document {doc_index + 1}:") for word,
tfidf_value in values.items(): print(f"{word}: {tfidf_value}") print("\
n")

Output:
Document 1:dog: 0.3404110310756642lazy: 0.3404110310756642over:
0.3404110310756642jumps: 0.3404110310756642fox: 0.3404110310756642brown:
0.3404110310756642quick: 0.3404110310756642the: 0.43455990318254417Document 2:step:
0.3535533905932738single: 0.3535533905932738with: 0.3535533905932738begins:
0.3535533905932738miles: 0.3535533905932738thousand: 0.3535533905932738of:
0.3535533905932738journey: 0.3535533905932738TF-IDF is a widely used technique in
information retrieval and text mining, but its limitations should be considered,
especially when dealing with tasks that require a deeper understanding of language
semantics. For example:
TF-IDF treats words as independent entities and doesn’t consider semantic
relationships between them. This limitation hinders its ability to capture
contextual information and word meanings.Sensitivity to Document Length: Longer
documents tend to have higher overall term frequencies, potentially biasing TF-IDF
towards longer documents. 2. Neural Approach2.1. Word2VecWord2Vec is a neural
approach for generating word embeddings. It belongs to the family of neural word
embedding techniques and specifically falls under the category of distributed
representation models. It is a popular technique in natural language processing
(NLP) that is used to represent words as continuous vector spaces. Developed by a
team at Google, Word2Vec aims to capture the semantic relationships between words
by mapping them to high-dimensional vectors. The underlying idea is that words with
similar meanings should have similar vector representations. In Word2Vec every word
is assigned a vector. We start with either a random vector or one-hot vector.
There are two neural embedding methods for Word2Vec, Continuous Bag of Words (CBOW)
and Skip-gram.
2.2. Continuous Bag of Words(CBOW)
Continuous Bag of Words (CBOW) is a type of neural network architecture used in the
Word2Vec model. The primary objective of CBOW is to predict a target word based on
its context, which consists of the surrounding words in a given window. Given a
sequence of words in a context window, the model is trained to predict the target
word at the center of the window.
CBOW is a feedforward neural network with a single hidden layer. The input layer
represents the context words, and the output layer represents the target word. The
hidden layer contains the learned continuous vector representations (word
embeddings) of the input words.
The architecture is useful for learning distributed representations of words in a
continuous vector space.

The hidden layer contains the continuous vector representations (word embeddings)
of the input words.
The weights between the input layer and the hidden layer are learned during
training.The dimensionality of the hidden layer represents the size of the word
embeddings (the continuous vector space).

Python3

import torchimport torch.nn as nnimport torch.optim as optim # Define CBOW


modelclass CBOWModel(nn.Module): def __init__(self, vocab_size,
embed_size): super(CBOWModel, self).__init__() self.embeddings =
nn.Embedding(vocab_size, embed_size) self.linear = nn.Linear(embed_size,
vocab_size) def forward(self, context): context_embeds =
self.embeddings(context).sum(dim=1) output =
self.linear(context_embeds) return output # Sample datacontext_size =
2raw_text = "word embeddings are awesome"tokens = raw_text.split()vocab =
set(tokens)word_to_index = {word: i for i, word in enumerate(vocab)}data = []for i
in range(2, len(tokens) - 2): context = [word_to_index[word] for word in
tokens[i - 2:i] + tokens[i + 1:i + 3]] target =
word_to_index[tokens[i]] data.append((torch.tensor(context),
torch.tensor(target))) # Hyperparametersvocab_size = len(vocab)embed_size =
10learning_rate = 0.01epochs = 100 # Initialize CBOW modelcbow_model =
CBOWModel(vocab_size, embed_size)criterion = nn.CrossEntropyLoss()optimizer =
optim.SGD(cbow_model.parameters(), lr=learning_rate) # Training loopfor epoch in
range(epochs): total_loss = 0 for context, target in
data: optimizer.zero_grad() output = cbow_model(context) loss
= criterion(output.unsqueeze(0),
target.unsqueeze(0)) loss.backward() optimizer.step() total_lo
ss += loss.item() print(f"Epoch {epoch + 1}, Loss: {total_loss}") # Example
usage: Get embedding for a specific wordword_to_lookup = "embeddings"word_index =
word_to_index[word_to_lookup]embedding =
cbow_model.embeddings(torch.tensor([word_index]))print(f"Embedding for
'{word_to_lookup}': {embedding.detach().numpy()}")
Output:
Embedding for 'embeddings': [[-2.7053456 2.1384873 0.6417674 1.2882394
0.53470695 0.5651745 0.64166373 -1.1691749 0.32658175 -0.99961764]]2.3. Skip-
GramThe Skip-Gram model learns distributed representations of words in a continuous
vector space. The main objective of Skip-Gram is to predict context words (words
surrounding a target word) given a target word. This is the opposite of the
Continuous Bag of Words (CBOW) model, where the objective is to predict the target
word based on its context. It is shown that this method produces more meaningful
embeddings.

After applying the above neural embedding methods we get trained vectors of each
word after many iterations through the corpus. These trained vectors preserve
syntactical or semantic information and are converted to lower dimensions. The
vectors with similar meaning or semantic information are placed close to each other
in space.
Let’s understand with a basic example. The python code contains, vector_size
parameter that controls the dimensionality of the word vectors, and you can adjust
other parameters such as window based on your specific needs.
Note: Word2Vec models can perform better with larger datasets. If you have a large
corpus, you might achieve more meaningful word embeddings.

Python3

!pip install gensimfrom gensim.models import Word2Vecfrom nltk.tokenize import


word_tokenizeimport nltknltk.download('punkt') # Download the tokenizer models if
not already downloaded sample = "Word embeddings are dense vector representations
of words."tokenized_corpus = word_tokenize(sample.lower()) # Lowercasing for
consistency skipgram_model =
Word2Vec(sentences=[tokenized_corpus], vector_size=100, #
Dimensionality of the word vectors window=5, #
Maximum distance between the current and predicted word within a
sentence sg=1, # Skip-Gram model (1 for Skip-
Gram, 0 for CBOW) min_count=1, # Ignores all words
with a total frequency lower than this workers=4) #
Number of CPU cores to use for training the model #
Trainingskipgram_model.train([tokenized_corpus], total_examples=1,
epochs=10)skipgram_model.save("skipgram_model.model")loaded_model =
Word2Vec.load("skipgram_model.model")vector_representation =
loaded_model.wv['word']print("Vector representation of 'word':",
vector_representation)

Output:
Vector representation of 'word': [-9.5800208e-03 8.9437785e-03 4.1664648e-03
9.2367809e-03 6.6457358e-03 2.9233587e-03 9.8055992e-03 -4.4231843e-03 -
6.8048164e-03 4.2256550e-03 3.7299085e-03 -5.6668529e-
03-------------------------------------------------------------- 2.8835384e-03 -
1.5386029e-03 9.9318363e-03 8.3507905e-03 2.4184163e-03 7.1170190e-03
5.8888551e-03 -5.5787875e-03]In practice, the choice between CBOW and Skip-gram
often depends on the specific characteristics of the data and the task at hand.
CBOW might be preferred when training resources are limited, and capturing
syntactic information is important. Skip-gram, on the other hand, might be chosen
when semantic relationships and the representation of rare words are crucial.
3. Pretrained Word-EmbeddingPre-trained word embeddings are representations of
words that are learned from large corpora and are made available for reuse in
various natural language processing (NLP) tasks. These embeddings capture semantic
relationships between words, allowing the model to understand similarities and
relationships between different words in a meaningful way.
3.1. GloVeGloVe is trained on global word co-occurrence statistics. It leverages
the global context to create word embeddings that reflect the overall meaning of
words based on their co-occurrence probabilities. this method, we take the corpus
and iterate through it and get the co-occurrence of each word with other words in
the corpus. We get a co-occurrence matrix through this. The words which occur next
to each other get a value of 1, if they are one word apart then 1/2, if two words
apart then 1/3 and so on.
Let us take an example to understand how the matrix is created. We have a small
corpus:
Corpus:It is a nice evening.Good Evening!Is it a nice
evening? itisaniceeveninggoodit0 is1+10 a1/2+11+1/20 nice1/3+1/21/2+1/31+1
0 evening1/4+1/31/3+1/41/2+1/21+10 good000010The upper half of the matrix will be
a reflection of the lower half. We can consider a window frame as well to calculate
the co-occurrences by shifting the frame till the end of the corpus. This helps
gather information about the context in which the word is used.
Initially, the vectors for each word is assigned randomly. Then we take two pairs
of vectors and see how close they are to each other in space. If they occur
together more often or have a higher value in the co-occurrence matrix and are far
apart in space then they are brought close to each other. If they are close to each
other but are rarely or not frequently used together then they are moved further
apart in space.
After many iterations of the above process, we’ll get a vector space representation
that approximates the information from the co-occurrence matrix. The performance of
GloVe is better than Word2Vec in terms of both semantic and syntactic capturing.

Python3

from gensim.models import KeyedVectorsfrom gensim.downloader import


load glove_model = load('glove-wiki-gigaword-50')word_pairs = [('learn',
'learning'), ('india', 'indian'), ('fame', 'famous')] # Compute similarity for each
pair of wordsfor pair in word_pairs: similarity =
glove_model.similarity(pair[0], pair[1]) print(f"Similarity between '{pair[0]}'
and '{pair[1]}' using GloVe: {similarity:.3f}")

Output:
Similarity between 'learn' and 'learning' using GloVe: 0.802Similarity between
'india' and 'indian' using GloVe: 0.865Similarity between 'fame' and 'famous' using
GloVe: 0.5893.2. FasttextDeveloped by Facebook, FastText extends Word2Vec by
representing words as bags of character n-grams. This approach is particularly
useful for handling out-of-vocabulary words and capturing morphological variations.

Python3
import gensim.downloader as apifasttext_model = api.load("fasttext-wiki-news-
subwords-300") ## Load the pre-trained fastText model# Define word pairs to compute
similarity forword_pairs = [('learn', 'learning'), ('india', 'indian'), ('fame',
'famous')] # Compute similarity for each pair of wordsfor pair in
word_pairs: similarity = fasttext_model.similarity(pair[0],
pair[1]) print(f"Similarity between '{pair[0]}' and '{pair[1]}' using FastText:
{similarity:.3f}")

Output:
Similarity between 'learn' and 'learning' using Word2Vec: 0.642Similarity between
'india' and 'indian' using Word2Vec: 0.708Similarity between 'fame' and 'famous'
using Word2Vec: 0.5193.3. BERT (Bidirectional Encoder Representations from
Transformers)BERT is a transformer-based model that learns contextualized
embeddings for words. It considers the entire context of a word by considering both
left and right contexts, resulting in embeddings that capture rich contextual
information.

Python3
from transformers import BertTokenizer, BertModelimport torch # Load pre-trained
BERT model and tokenizermodel_name = 'bert-base-uncased'tokenizer =
BertTokenizer.from_pretrained(model_name)model =
BertModel.from_pretrained(model_name) word_pairs = [('learn', 'learning'),
('india', 'indian'), ('fame', 'famous')] # Compute similarity for each pair of
wordsfor pair in word_pairs: tokens = tokenizer(pair,
return_tensors='pt') with torch.no_grad(): outputs =
model(**tokens) # Extract embeddings for the [CLS] token cls_embedding =
outputs.last_hidden_state[:, 0, :] similarity =
torch.nn.functional.cosine_similarity(cls_embedding[0], cls_embedding[1],
dim=0) print(f"Similarity between '{pair[0]}' and '{pair[1]}' using BERT:
{similarity:.3f}")

Output:
Similarity between 'learn' and 'learning' using BERT: 0.930Similarity between
'india' and 'indian' using BERT: 0.957Similarity between 'fame' and 'famous' using
BERT: 0.956Considerations for Deploying Word Embedding ModelsYou need to use the
exact same pipeline during deploying your model as were used to create the training
data for the word embedding. If you use a different tokenizer or different method
of handling white space, punctuation etc. you might end up with incompatible
inputs.Words in your input that doesn’t have a pre-trained vector. Such words are
known as Out of Vocabulary Word(oov). What you can do is replace those words with
“UNK” which means unknown and then handle them separately.Dimension mis-match:
Vectors can be of many lengths. If you train a model with vectors of length say 400
and then try to apply vectors of length 1000 at inference time, you will run into
errors. So make sure to use the same dimensions throughout.Advantages and
Disadvantage of Word EmbeddingsAdvantagesIt is much faster to train than hand build
models like WordNet (which uses graph embeddings).Almost all modern NLP
applications start with an embedding layer.It Stores an approximation of
meaning.Disadvantages It can be memory intensive.It is corpus dependent. Any
underlying bias will have an effect on your model.It cannot distinguish between
homophones. Eg: brake/break, cell/sell, weather/whether etc.ConclusionIn
conclusion, word embedding techniques such as TF-IDF, Word2Vec, and GloVe play a
crucial role in natural language processing by representing words in a lower-
dimensional space, capturing semantic and syntactic information.
Frequently Asked Questions (FAQs)1. Does GPT use word embeddings?GPT uses context-
based embeddings rather than traditional word embeddings. It captures word meaning
in the context of the entire sentence.
2. What is the difference between Bert and word embeddings?BERT is contextually
aware, considering the entire sentence, while traditional word embeddings, like
Word2Vec, treat each word independently.
3. What are the two types of word embedding?Word embeddings can be broadly
evaluated in two categories, intrinsic and extrinsic. For intrinsic evaluation,
word embeddings are used to calculate or predict semantic similarity between words,
terms, or sentences.
4. How does word vectorization work?Word vectorization converts words into
numerical vectors, capturing semantic relationships. Techniques like TF-IDF,
Word2Vec, and GloVe are common.
5. What are the benefits of word embeddings?Word embeddings offer semantic
understanding, capture context, and enhance NLP tasks. They reduce dimensionality,
speed up training, and aid in language pattern recognition.

Last Updated :
05 Jan, 2024

Like Article

Save Article

Previous

Introduction to NLTK: Tokenization, Stemming, Lemmatization, POS Tagging

Next

Introduction to Recurrent Neural Network

Share your thoughts in the comments


Add Your Comment

Please Login to comment...

In this article, we will introduce a new variation of neural network which is the
Recurrent Neural Network also known as (RNN) that works better than a simple neural
network when data is sequential like Time-Series data and text data.
What is Recurrent Neural Network (RNN)?Recurrent Neural Network(RNN) is a type of
Neural Network where the output from the previous step is fed as input to the
current step. In traditional neural networks, all the inputs and outputs are
independent of each other. Still, in cases when it is required to predict the next
word of a sentence, the previous words are required and hence there is a need to
remember the previous words. Thus RNN came into existence, which solved this issue
with the help of a Hidden Layer. The main and most important feature of RNN is its
Hidden state, which remembers some information about a sequence. The state is also
referred to as Memory State since it remembers the previous input to the network.
It uses the same parameters for each input as it performs the same task on all the
inputs or hidden layers to produce the output. This reduces the complexity of
parameters, unlike other neural networks.
Recurrent Neural NetworkHow RNN differs from Feedforward Neural Network?Artificial
neural networks that do not have looping nodes are called feed forward neural
networks. Because all information is only passed forward, this kind of neural
network is also referred to as a multi-layer neural network.
Information moves from the input layer to the output layer – if any hidden layers
are present – unidirectionally in a feedforward neural network. These networks are
appropriate for image classification tasks, for example, where input and output are
independent. Nevertheless, their inability to retain previous inputs automatically
renders them less useful for sequential data analysis.
Recurrent Vs Feedfoward networksRecurrent Neuron and RNN UnfoldingThe fundamental
processing unit in a Recurrent Neural Network (RNN) is a Recurrent Unit, which is
not explicitly called a “Recurrent Neuron.” This unit has the unique ability to
maintain a hidden state, allowing the network to capture sequential dependencies by
remembering previous inputs while processing. Long Short-Term Memory (LSTM) and
Gated Recurrent Unit (GRU) versions improve the RNN’s ability to handle long-term
dependencies.
Recurrent NeuronRNN Unfolding
Types Of RNNThere are four types of RNNs based on the number of inputs and outputs
in the network.
One to One One to Many Many to One Many to Many One to One This type of RNN behaves
the same as any simple Neural network it is also known as Vanilla Neural Network.
In this Neural network, there is only one input and one output.
One to One RNNOne To Many In this type of RNN, there is one input and many outputs
associated with it. One of the most used examples of this network is Image
captioning where given an image we predict a sentence having Multiple words.
One to Many RNNMany to One In this type of network, Many inputs are fed to the
network at several states of the network generating only one output. This type of
network is used in the problems like sentimental analysis. Where we give multiple
words as input and predict only the sentiment of the sentence as output.
Many to One RNNMany to Many In this type of neural network, there are multiple
inputs and multiple outputs corresponding to a problem. One Example of this Problem
will be language translation. In language translation, we provide multiple words
from one language as input and predict multiple words from the second language as
output.
Many to Many RNNRecurrent Neural Network ArchitectureRNNs have the same input and
output architecture as any other deep neural architecture. However, differences
arise in the way information flows from input to output. Unlike Deep neural
networks where we have different weight matrices for each Dense network in RNN, the
weight across the network remains the same. It calculates state hidden state Hi
for every input Xi . By using the following formulas:
h= σ(UX + Wh-1 + B)
Y = O(Vh + C)
Hence
Y = f (X, h , W, U, V, B, C)
Here S is the State matrix which has element si as the state of the network at
timestep iThe parameters in the network are W, U, V, c, b which are shared across
timestep
Recurrent Neural Architecture
How does RNN work?The Recurrent Neural Network consists of multiple fixed
activation function units, one for each time step. Each unit has an internal state
which is called the hidden state of the unit. This hidden state signifies the past
knowledge that the network currently holds at a given time step. This hidden state
is updated at every time step to signify the change in the knowledge of the network
about the past. The hidden state is updated using the following recurrence
relation:-
The formula for calculating the current state:

where,
ht -> current state ht-1 -> previous state xt -> input stateFormula for applying
Activation function(tanh)

where,
whh -> weight at recurrent neuronwxh -> weight at input neuronThe formula for
calculating output:

Yt -> outputWhy -> weight at output layerThese parameters are updated using
Backpropagation. However, since RNN works on sequential data here we use an updated
backpropagation which is known as Backpropagation through time.
Backpropagation Through Time (BPTT)In RNN the neural network is in an ordered
fashion and since in the ordered network each variable is computed one at a time in
a specified order like first h1 then h2 then h3 so on. Hence we will apply
backpropagation throughout all these hidden time states sequentially.
Backpropagation Through Time (BPTT) In RNNL(θ)(loss function) depends on h3h3 in
turn depends on h2 and Wh2 in turn depends on h1 and Wh1 in turn depends on h0 and
Wwhere h0 is a constant starting state.

For simplicity of this equation, we will apply backpropagation on only one row

We already know how to compute this one as it is the same as any simple deep neural
network backpropagation.

.However, we will see how to apply backpropagation to this term


As we know h3 = σ(Wh2 + b)
And In such an ordered network, we can’t compute by simply treating h3 as a
constant because as it also depends on W. the total derivative has two parts:
Explicit: treating all other inputs as constantImplicit: Summing over all indirect
paths from h3 to WLet us see how to do this

For simplicity, we will short-circuit some of the paths

Finally, we have

Where

Hence,

This algorithm is called backpropagation through time (BPTT) as we backpropagate


over all previous time steps
Issues of Standard RNNsVanishing Gradient: Text generation, machine translation,
and stock market prediction are just a few examples of the time-dependent and
sequential data problems that can be modelled with recurrent neural networks. You
will discover, though, that the gradient problem makes training RNN
difficult.Exploding Gradient: An Exploding Gradient occurs when a neural network is
being trained and the slope tends to grow exponentially rather than decay. Large
error gradients that build up during training lead to very large updates to the
neural network model weights, which is the source of this issue.Training through
RNNA single-time step of the input is provided to the network.Then calculate its
current state using a set of current input and the previous state.The current ht
becomes ht-1 for the next time step.One can go as many time steps according to the
problem and join the information from all the previous states.Once all the time
steps are completed the final current state is used to calculate the output.The
output is then compared to the actual output i.e the target output and the error is
generated.The error is then back-propagated to the network to update the weights
and hence the network (RNN) is trained using Backpropagation through
time.Advantages and Disadvantages of Recurrent Neural NetworkAdvantagesAn RNN
remembers each and every piece of information through time. It is useful in time
series prediction only because of the feature to remember previous inputs as well.
This is called Long Short Term Memory.Recurrent neural networks are even used with
convolutional layers to extend the effective pixel neighborhood.Disadvantages
Gradient vanishing and exploding problems.Training an RNN is a very difficult
task.It cannot process very long sequences if using tanh or relu as an activation
function.Applications of Recurrent Neural NetworkLanguage Modelling and Generating
TextSpeech RecognitionMachine TranslationImage Recognition, Face detectionTime
series ForecastingVariation Of Recurrent Neural Network (RNN)To overcome the
problems like vanishing gradient and exploding gradient descent several new
advanced versions of RNNs are formed some of these are as;
Bidirectional Neural Network (BiNN)Long Short-Term Memory (LSTM)Bidirectional
Neural Network (BiNN)
A BiNN is a variation of a Recurrent Neural Network in which the input information
flows in both direction and then the output of both direction are combined to
produce the input. BiNN is useful in situations when the context of the input is
more important such as Nlp tasks and Time-series analysis problems.
Long Short-Term Memory (LSTM)
Long Short-Term Memory works on the read-write-and-forget principle where given the
input information network reads and writes the most useful information from the
data and it forgets about the information which is not important in predicting the
output. For doing this three new gates are introduced in the RNN. In this way, only
the selected information is passed through the network.
Difference between RNN and Simple Neural Network RNN is considered to be the better
version of deep neural when the data is sequential. There are significant
differences between the RNN and deep neural networks they are listed as:
Recurrent Neural Network Deep Neural Network
Weights are same across all the layers number of a Recurrent Neural Network Weights
are different for each layer of the network Recurrent Neural Networks are used when
the data is sequential and the number of inputs is not predefined.A Simple Deep
Neural network does not have any special method for sequential data also here the
the number of inputs is fixed The Numbers of parameter in the RNN are higher than
in simple DNNThe Numbers of Parameter are lower than RNNExploding and vanishing
gradients is the the major drawback of RNNThese problems also occur in DNN but
these are not the major problem with DNNRNN Code ImplementationImported libraries:
Imported some necessary libraries such as numpy, tensorflow for numerical
calculation an model building.

Python3

import numpy as npimport tensorflow as tffrom tensorflow.keras.models import


Sequentialfrom tensorflow.keras.layers import SimpleRNN, Dense

Input Generation:
Generated some example data using text.

Python3
text = "This is GeeksforGeeks a software training institute"chars =
sorted(list(set(text)))char_to_index = {char: i for i, char in
enumerate(chars)}index_to_char = {i: char for i, char in enumerate(chars)}

Created input sequences and corresponding labels for further implementation.

Python3

seq_length = 3sequences = []labels = [] for i in range(len(text) -


seq_length): seq = text[i:i+seq_length] label =
text[i+seq_length] sequences.append([char_to_index[char] for char in
seq]) labels.append(char_to_index[label])
Converted sequences and labels into numpy arrays and used one-hot encoding to
convert text into vector.

Python3

X = np.array(sequences)y = np.array(labels) X_one_hot = tf.one_hot(X,


len(chars))y_one_hot = tf.one_hot(y, len(chars))

Model Building:
Build RNN Model using ‘relu’ and ‘softmax‘ activation function.

Python3
model = Sequential()model.add(SimpleRNN(50, input_shape=(seq_length, len(chars)),
activation='relu'))model.add(Dense(len(chars), activation='softmax'))

Model Compilation:
The model.compile line builds the neural network for training by specifying the
optimizer (Adam), the loss function (categorical crossentropy), and the training
metric (accuracy).

Python3
model.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])

Model Training:
Using the input sequences (X_one_hot) and corresponding labels (y_one_hot) for 100
epochs, the model is trained using the model.fit line, which optimises the model
parameters to minimise the categorical crossentropy loss.

Python3

model.fit(X_one_hot, y_one_hot, epochs=100)

output:
Epoch 1/100
2/2 [==============================] - 2s 54ms/step - loss: 2.8327 - accuracy:
0.0000e+00
Epoch 2/100
2/2 [==============================] - 0s 16ms/step - loss: 2.8121 - accuracy:
0.0000e+00
Epoch 3/100
2/2 [==============================] - 0s 16ms/step - loss: 2.7944 - accuracy:
0.0208
Epoch 4/100
2/2 [==============================] - 0s 16ms/step - loss: 2.7766 - accuracy:
0.0208
Epoch 5/100
2/2 [==============================] - 0s 15ms/step - loss: 2.7596 - accuracy:
0.0625
Epoch 6/100
2/2 [==============================] - 0s 13ms/step - loss: 2.7424 - accuracy:
0.0833
Epoch 7/100
2/2 [==============================] - 0s 13ms/step - loss: 2.7254 - accuracy:
0.1042
Epoch 8/100
2/2 [==============================] - 0s 12ms/step - loss: 2.7092 - accuracy:
0.1042
Epoch 9/100
2/2 [==============================] - 0s 11ms/step - loss: 2.6917 - accuracy:
0.1458
Epoch 10/100
2/2 [==============================] - 0s 12ms/step - loss: 2.6742 - accuracy:
0.1667
Epoch 11/100
2/2 [==============================] - 0s 10ms/step - loss: 2.6555 - accuracy:
0.1667
Epoch 12/100
2/2 [==============================] - 0s 16ms/step - loss: 2.6369 - accuracy:
0.1667
Epoch 13/100
2/2 [==============================] - 0s 11ms/step - loss: 2.6179 - accuracy:
0.1667
Epoch 14/100
2/2 [==============================] - 0s 11ms/step - loss: 2.5993 - accuracy:
0.1875
Epoch 15/100
2/2 [==============================] - 0s 17ms/step - loss: 2.5789 - accuracy:
0.2083
Epoch 16/100
2/2 [==============================] - 0s 11ms/step - loss: 2.5593 - accuracy:
0.2083
Epoch 17/100
2/2 [==============================] - 0s 16ms/step - loss: 2.5397 - accuracy:
0.2083
Epoch 18/100
2/2 [==============================] - 0s 20ms/step - loss: 2.5182 - accuracy:
0.2292
Epoch 19/100
2/2 [==============================] - 0s 18ms/step - loss: 2.4979 - accuracy:
0.2292
Epoch 20/100
2/2 [==============================] - 0s 11ms/step - loss: 2.4761 - accuracy:
0.2292
Epoch 21/100
2/2 [==============================] - 0s 13ms/step - loss: 2.4536 - accuracy:
0.2292
Epoch 22/100
2/2 [==============================] - 0s 17ms/step - loss: 2.4299 - accuracy:
0.2292
Epoch 23/100
2/2 [==============================] - 0s 10ms/step - loss: 2.4067 - accuracy:
0.2708
Epoch 24/100
2/2 [==============================] - 0s 27ms/step - loss: 2.3824 - accuracy:
0.2917
Epoch 25/100
2/2 [==============================] - 0s 22ms/step - loss: 2.3582 - accuracy:
0.2917
Epoch 26/100
2/2 [==============================] - 0s 10ms/step - loss: 2.3324 - accuracy:
0.2917
Epoch 27/100
2/2 [==============================] - 0s 10ms/step - loss: 2.3068 - accuracy:
0.3125
Epoch 28/100
2/2 [==============================] - 0s 10ms/step - loss: 2.2819 - accuracy:
0.3125
Epoch 29/100
2/2 [==============================] - 0s 11ms/step - loss: 2.2535 - accuracy:
0.3333
Epoch 30/100
2/2 [==============================] - 0s 10ms/step - loss: 2.2278 - accuracy:
0.3333
Epoch 31/100
2/2 [==============================] - 0s 12ms/step - loss: 2.1992 - accuracy:
0.3333
Epoch 32/100
2/2 [==============================] - 0s 12ms/step - loss: 2.1719 - accuracy:
0.3333
Epoch 33/100
2/2 [==============================] - 0s 13ms/step - loss: 2.1434 - accuracy:
0.3333
Epoch 34/100
2/2 [==============================] - 0s 14ms/step - loss: 2.1134 - accuracy:
0.3542
Epoch 35/100
2/2 [==============================] - 0s 14ms/step - loss: 2.0852 - accuracy:
0.3542
Epoch 36/100
2/2 [==============================] - 0s 15ms/step - loss: 2.0547 - accuracy:
0.3958
Epoch 37/100
2/2 [==============================] - 0s 18ms/step - loss: 2.0240 - accuracy:
0.4167
Epoch 38/100
2/2 [==============================] - 0s 24ms/step - loss: 1.9933 - accuracy:
0.5000
Epoch 39/100
2/2 [==============================] - 0s 14ms/step - loss: 1.9626 - accuracy:
0.5000
Epoch 40/100
2/2 [==============================] - 0s 14ms/step - loss: 1.9306 - accuracy:
0.5000
Epoch 41/100
2/2 [==============================] - 0s 16ms/step - loss: 1.9002 - accuracy:
0.5000
Epoch 42/100
2/2 [==============================] - 0s 15ms/step - loss: 1.8669 - accuracy:
0.5000
Epoch 43/100
2/2 [==============================] - 0s 14ms/step - loss: 1.8353 - accuracy:
0.5208
Epoch 44/100
2/2 [==============================] - 0s 22ms/step - loss: 1.8029 - accuracy:
0.5417
Epoch 45/100
2/2 [==============================] - 0s 15ms/step - loss: 1.7708 - accuracy:
0.5625
Epoch 46/100
2/2 [==============================] - 0s 10ms/step - loss: 1.7373 - accuracy:
0.5625
Epoch 47/100
2/2 [==============================] - 0s 12ms/step - loss: 1.7052 - accuracy:
0.6042
Epoch 48/100
2/2 [==============================] - 0s 12ms/step - loss: 1.6737 - accuracy:
0.6042
Epoch 49/100
2/2 [==============================] - 0s 14ms/step - loss: 1.6388 - accuracy:
0.6250
Epoch 50/100
2/2 [==============================] - 0s 12ms/step - loss: 1.6071 - accuracy:
0.6458
Epoch 51/100
2/2 [==============================] - 0s 10ms/step - loss: 1.5737 - accuracy:
0.6667
Epoch 52/100
2/2 [==============================] - 0s 12ms/step - loss: 1.5386 - accuracy:
0.6667
Epoch 53/100
2/2 [==============================] - 0s 11ms/step - loss: 1.5059 - accuracy:
0.6875
Epoch 54/100
2/2 [==============================] - 0s 17ms/step - loss: 1.4727 - accuracy:
0.6875
Epoch 55/100
2/2 [==============================] - 0s 14ms/step - loss: 1.4381 - accuracy:
0.6667
Epoch 56/100
2/2 [==============================] - 0s 13ms/step - loss: 1.4039 - accuracy:
0.6667
Epoch 57/100
2/2 [==============================] - 0s 15ms/step - loss: 1.3718 - accuracy:
0.6667
Epoch 58/100
2/2 [==============================] - 0s 10ms/step - loss: 1.3391 - accuracy:
0.6667
Epoch 59/100
2/2 [==============================] - 0s 11ms/step - loss: 1.3059 - accuracy:
0.6875
Epoch 60/100
2/2 [==============================] - 0s 11ms/step - loss: 1.2751 - accuracy:
0.6875
Epoch 61/100
2/2 [==============================] - 0s 10ms/step - loss: 1.2426 - accuracy:
0.6875
Epoch 62/100
2/2 [==============================] - 0s 10ms/step - loss: 1.2123 - accuracy:
0.6875
Epoch 63/100
2/2 [==============================] - 0s 9ms/step - loss: 1.1822 - accuracy:
0.6875
Epoch 64/100
2/2 [==============================] - 0s 10ms/step - loss: 1.1520 - accuracy:
0.7083
Epoch 65/100
2/2 [==============================] - 0s 11ms/step - loss: 1.1232 - accuracy:
0.7500
Epoch 66/100
2/2 [==============================] - 0s 13ms/step - loss: 1.0940 - accuracy:
0.7500
Epoch 67/100
2/2 [==============================] - 0s 13ms/step - loss: 1.0677 - accuracy:
0.7500
Epoch 68/100
2/2 [==============================] - 0s 11ms/step - loss: 1.0388 - accuracy:
0.7500
Epoch 69/100
2/2 [==============================] - 0s 10ms/step - loss: 1.0130 - accuracy:
0.7500
Epoch 70/100
2/2 [==============================] - 0s 12ms/step - loss: 0.9862 - accuracy:
0.7917
Epoch 71/100
2/2 [==============================] - 0s 12ms/step - loss: 0.9619 - accuracy:
0.8125
Epoch 72/100
2/2 [==============================] - 0s 11ms/step - loss: 0.9377 - accuracy:
0.8333
Epoch 73/100
2/2 [==============================] - 0s 11ms/step - loss: 0.9114 - accuracy:
0.8542
Epoch 74/100
2/2 [==============================] - 0s 12ms/step - loss: 0.8882 - accuracy:
0.8542
Epoch 75/100
2/2 [==============================] - 0s 11ms/step - loss: 0.8656 - accuracy:
0.8750
Epoch 76/100
2/2 [==============================] - 0s 11ms/step - loss: 0.8423 - accuracy:
0.8750
Epoch 77/100
2/2 [==============================] - 0s 19ms/step - loss: 0.8214 - accuracy:
0.8750
Epoch 78/100
2/2 [==============================] - 0s 13ms/step - loss: 0.7991 - accuracy:
0.8750
Epoch 79/100
2/2 [==============================] - 0s 14ms/step - loss: 0.7781 - accuracy:
0.8750
Epoch 80/100
2/2 [==============================] - 0s 13ms/step - loss: 0.7568 - accuracy:
0.8750
Epoch 81/100
2/2 [==============================] - 0s 15ms/step - loss: 0.7386 - accuracy:
0.8750
Epoch 82/100
2/2 [==============================] - 0s 20ms/step - loss: 0.7178 - accuracy:
0.8750
Epoch 83/100
2/2 [==============================] - 0s 17ms/step - loss: 0.7001 - accuracy:
0.8750
Epoch 84/100
2/2 [==============================] - 0s 21ms/step - loss: 0.6814 - accuracy:
0.8750
Epoch 85/100
2/2 [==============================] - 0s 20ms/step - loss: 0.6641 - accuracy:
0.8750
Epoch 86/100
2/2 [==============================] - 0s 18ms/step - loss: 0.6464 - accuracy:
0.8750
Epoch 87/100
2/2 [==============================] - 0s 18ms/step - loss: 0.6290 - accuracy:
0.8750
Epoch 88/100
2/2 [==============================] - 0s 13ms/step - loss: 0.6108 - accuracy:
0.8750
Epoch 89/100
2/2 [==============================] - 0s 16ms/step - loss: 0.5958 - accuracy:
0.8750
Epoch 90/100
2/2 [==============================] - 0s 15ms/step - loss: 0.5799 - accuracy:
0.8750
Epoch 91/100
2/2 [==============================] - 0s 17ms/step - loss: 0.5656 - accuracy:
0.8750
Epoch 92/100
2/2 [==============================] - 0s 31ms/step - loss: 0.5499 - accuracy:
0.8750
Epoch 93/100
2/2 [==============================] - 0s 15ms/step - loss: 0.5347 - accuracy:
0.8750
Epoch 94/100
2/2 [==============================] - 0s 17ms/step - loss: 0.5215 - accuracy:
0.8750
Epoch 95/100
2/2 [==============================] - 0s 16ms/step - loss: 0.5077 - accuracy:
0.8958
Epoch 96/100
2/2 [==============================] - 0s 15ms/step - loss: 0.4954 - accuracy:
0.9583
Epoch 97/100
2/2 [==============================] - 0s 11ms/step - loss: 0.4835 - accuracy:
0.9583
Epoch 98/100
2/2 [==============================] - 0s 12ms/step - loss: 0.4715 - accuracy:
0.9583
Epoch 99/100
2/2 [==============================] - 0s 15ms/step - loss: 0.4588 - accuracy:
0.9583
Epoch 100/100
2/2 [==============================] - 0s 10ms/step - loss: 0.4469 - accuracy:
0.9583
<keras.src.callbacks.History at 0x7bab7ab127d0>

Model Prediction:
Generated text using pre-trained model.

Python3

start_seq = "This is G"generated_text = start_seq for i in range(50): x =


np.array([[char_to_index[char] for char in generated_text[-
seq_length:]]]) x_one_hot = tf.one_hot(x, len(chars)) prediction =
model.predict(x_one_hot) next_index = np.argmax(prediction) next_char =
index_to_char[next_index] generated_text += next_char print("Generated
Text:")print(generated_text)

output:
1/1 [==============================] - 1s 517ms/step
1/1 [==============================] - 0s 75ms/step
1/1 [==============================] - 0s 101ms/step
1/1 [==============================] - 0s 93ms/step
1/1 [==============================] - 0s 132ms/step
1/1 [==============================] - 0s 143ms/step
1/1 [==============================] - 0s 140ms/step
1/1 [==============================] - 0s 144ms/step
1/1 [==============================] - 0s 125ms/step
1/1 [==============================] - 0s 60ms/step
1/1 [==============================] - 0s 38ms/step
1/1 [==============================] - 0s 34ms/step
1/1 [==============================] - 0s 29ms/step
1/1 [==============================] - 0s 34ms/step
1/1 [==============================] - 0s 32ms/step
1/1 [==============================] - 0s 32ms/step
1/1 [==============================] - 0s 38ms/step
1/1 [==============================] - 0s 32ms/step
1/1 [==============================] - 0s 36ms/step
1/1 [==============================] - 0s 31ms/step
1/1 [==============================] - 0s 31ms/step
1/1 [==============================] - 0s 31ms/step
1/1 [==============================] - 0s 31ms/step
1/1 [==============================] - 0s 32ms/step
1/1 [==============================] - 0s 31ms/step
1/1 [==============================] - 0s 23ms/step
1/1 [==============================] - 0s 27ms/step
1/1 [==============================] - 0s 21ms/step
1/1 [==============================] - 0s 23ms/step
1/1 [==============================] - 0s 20ms/step
1/1 [==============================] - 0s 20ms/step
1/1 [==============================] - 0s 22ms/step
1/1 [==============================] - 0s 24ms/step
1/1 [==============================] - 0s 21ms/step
1/1 [==============================] - 0s 22ms/step
1/1 [==============================] - 0s 21ms/step
1/1 [==============================] - 0s 23ms/step
1/1 [==============================] - 0s 24ms/step
1/1 [==============================] - 0s 23ms/step
1/1 [==============================] - 0s 22ms/step
1/1 [==============================] - 0s 22ms/step
1/1 [==============================] - 0s 23ms/step
1/1 [==============================] - 0s 21ms/step
1/1 [==============================] - 0s 22ms/step
1/1 [==============================] - 0s 25ms/step
1/1 [==============================] - 0s 24ms/step
1/1 [==============================] - 0s 21ms/step
1/1 [==============================] - 0s 21ms/step
1/1 [==============================] - 0s 20ms/step
1/1 [==============================] - 0s 20ms/step
Generated Text:
This is Geeks a software training instituteais is is is is
Frequently Asked Questions (FAQs):Q. 1 What is RNN?
Ans. Recurrent neural networks (RNNs) are a type of artificial neural network that
are primarily utilised in NLP (natural language processing) and speech recognition.
RNN is utilised in deep learning and in the creation of models that simulate
neuronal activity in the human brain.
Q. 2 Which type of problem can solved by RNN?
Ans. Modelling time-dependent and sequential data problems, like text generation,
machine translation, and stock market prediction, is possible with recurrent neural
networks. Nevertheless, you will discover that the gradient problem makes RNN
difficult to train. The vanishing gradients issue affects RNNs.
Q. 3 What are the types of RNN?
Ans. There are four types of RNN are:
One to OneOne to ManyMany to OneMany to ManyQ. 4 What is the differences between
RNN and CNN?
Ans. The following are the key distinctions between CNNs and RNNs: CNNs are
frequently employed in the solution of problems involving spatial data, like
images. Text and video data that is temporally and sequentially organised is better
analysed by RNNs. RNNs and CNNs are not designed alike.

Last Updated :
04 Dec, 2023

Like Article

Save Article

Previous

Word Embeddings in NLP

Next

Recurrent Neural Networks Explanation

Share your thoughts in the comments

Add Your Comment


Please Login to comment...

Today, different Machine Learning techniques are used to handle different types of
data. One of the most difficult types of data to handle and the forecast is
sequential data. Sequential data is different from other types of data in the sense
that while all the features of a typical dataset can be assumed to be order-
independent, this cannot be assumed for a sequential dataset. To handle such type
of data, the concept of Recurrent Neural Networks was conceived. It is different
from other Artificial Neural Networks in its structure. While other networks
“travel” in a linear direction during the feed-forward process or the back-
propagation process, the Recurrent Network follows a recurrence relation instead of
a feed-forward pass and uses Back-Propagation through time to learn. The Recurrent
Neural Network consists of multiple fixed activation function units, one for each
time step. Each unit has an internal state which is called the hidden state of the
unit. This hidden state signifies the past knowledge that the network currently
holds at a given time step. This hidden state is updated at every time step to
signify the change in the knowledge of the network about the past. The hidden state
is updated using the following recurrence relation:-
[Tex]- The new hidden state[/Tex][Tex]- The old hidden state[/Tex][Tex]- The
current input[/Tex][Tex]- The fixed function with trainable weights[/Tex]
Note: Typically, to understand the concepts of a Recurrent Neural Network, it is
often illustrated in its unrolled form and this norm will be followed in this
post. At each time step, the new hidden state is calculated using the recurrence
relation as given above. This new generated hidden state is used to generate indeed
a new hidden state and so on. The basic work-flow of a Recurrent Neural Network is
as follows:-

Note that is the initial hidden state of the network. Typically, it is a vector of
zeros, but it can have other values also. One method is to encode the presumptions
about the data into the initial hidden state of the network. For example, for a
problem to determine the tone of a speech given by a renowned person, the person’s
past speeches’ tones may be encoded into the initial hidden state. Another
technique is to make the initial hidden state a trainable parameter. Although these
techniques add little nuances to the network, initializing the hidden state vector
to zeros is typically an effective choice. Working of each Recurrent Unit:
Take input the previously hidden state vector and the current input vector. Note
that since the hidden state and current input are treated as vectors, each element
in the vector is placed in a different dimension which is orthogonal to the other
dimensions. Thus each element when multiplied by another element only gives a non-
zero value when the elements involved are non-zero and the elements are in the same
dimension.Element-wise multiplies the hidden state vector by the hidden state
weights and similarly performs the element-wise multiplication of the current input
vector and the current input weights. This generates the parameterized hidden state
vector and the current input vector. Note that weights for different vectors are
stored in the trainable weight matrix.Perform the vector addition of the two
parameterized vectors and then calculate the element-wise hyperbolic tangent to
generate the new hidden state vector.

During the training of the recurrent network, the network also generates an output
at each time step. This output is used to train the network using gradient
descent.
The Back-Propagation involved is similar to the one used in a typical Artificial
Neural Network with some minor changes. These changes are noted as:- Let the
predicted output of the network at any time step be and the actual output be . Then
the error at each time step is given by:- The total error is given by the summation
of the errors at all the time steps. Similarly, the value can be calculated as the
summation of gradients at each time step. Using the chain rule of calculus and
using the fact that the output at a time step t is a function of the current hidden
state of the recurrent unit, the following expression arises:- Note that the weight
matrix W used in the above expression is different for the input vector and hidden
state vector and is only used in this manner for notational convenience. Thus the
following expression arises:- Thus, Back-Propagation Through Time only differs from
a typical Back-Propagation in the fact the errors at each time step are summed up
to calculate the total error.

Although the basic Recurrent Neural Network is fairly effective, it can suffer from
a significant problem. For deep networks, The Back-Propagation process can lead to
the following issues:-
Vanishing Gradients: This occurs when the gradients become very small and tend
towards zero.Exploding Gradients: This occurs when the gradients become too large
due to back-propagation.
The problem of Exploding Gradients may be solved by using a hack – By putting a
threshold on the gradients being passed back in time. But this solution is not seen
as a solution to the problem and may also reduce the efficiency of the network. To
deal with such problems, two main variants of Recurrent Neural Networks were
developed – Long Short Term Memory Networks and Gated Recurrent Unit Networks.
Recurrent Neural Networks (RNNs) are a type of artificial neural network that is
designed to process sequential data. Unlike traditional feedforward neural
networks, RNNs can take into account the previous state of the sequence while
processing the current state, allowing them to model temporal dependencies in data.
The key feature of RNNs is the presence of recurrent connections between the hidden
units, which allow information to be passed from one time step to the next. This
means that the hidden state at each time step is not only a function of the input
at that time step, but also a function of the previous hidden state.
In an RNN, the input at each time step is typically a vector representing the
current state of the sequence, and the output at each time step is a vector
representing the predicted value or classification at that time step. The hidden
state is also a vector, which is updated at each time step based on the current
input and the previous hidden state.
The basic RNN architecture suffers from the vanishing gradient problem, which can
make it difficult to train on long sequences. To address this issue, several
variants of RNNs have been developed, such as Long Short-Term Memory (LSTM) and
Gated Recurrent Unit (GRU) networks, which use specialized gates to control the
flow of information through the network and address the vanishing gradient problem.
Applications of RNNs include speech recognition, language modeling, machine
translation, sentiment analysis, and stock prediction, among others. Overall, RNNs
are a powerful tool for processing sequential data and modeling temporal
dependencies, making them an important component of many machine learning
applications.
The advantages of Recurrent Neural Networks (RNNs) are:
Ability to Process Sequential Data: RNNs can process sequential data of varying
lengths, making them useful in applications such as natural language processing,
speech recognition, and time-series analysis.Memory: RNNs have the ability to
retain information about the previous inputs in the sequence through the use of
hidden states. This enables RNNs to perform tasks such as predicting the next word
in a sentence or forecasting stock prices.Versatility: RNNs can be used for a wide
variety of tasks, including classification, regression, and sequence-to-sequence
mapping.Flexibility: RNNs can be combined with other neural network architectures,
such as Convolutional Neural Networks (CNNs) or feedforward neural networks, to
create hybrid models for specific tasks.
However, there are also some disadvantages of RNNs:
Vanishing Gradient Problem: The vanishing gradient problem can occur in RNNs,
particularly in those with many layers or long sequences, making it difficult to
learn long-term dependencies.Computationally Expensive: RNNs can be computationally
expensive, particularly when processing long sequences or using complex
architectures.Lack of Interpretability: RNNs can be difficult to interpret,
particularly in terms of understanding how the network is making predictions or
decisions.Overall, while RNNs have some disadvantages, their ability to process
sequential data and retain memory of previous inputs make them a powerful tool for
many machine learning applications.

Last Updated :
21 Apr, 2023

Like Article

Save Article

Previous

Introduction to Recurrent Neural Network

Next

Sentiment Analysis with an Recurrent Neural Networks (RNN)

Share your thoughts in the comments

Add Your Comment


Please Login to comment...

Recurrent Neural Networks (RNN) are to the rescue when the sequence of information
is needed to be captured (another use case may include Time Series, next word
prediction, etc.). Due to its internal memory factor, it remembers past sequences
along with current input which makes it capable to capture context rather than just
individual words. For better understanding, please read the article Introduction to
Recurrent Neural Network and related articles in GeeksforGeeks
We will conduct Sentiment Analysis to understand text classification using
Tensorflow!
Importing Libraries and Dataset

Python3

from tensorflow.keras.layers import SimpleRNN, LSTM, GRU, Bidirectional, Dense,


Embeddingfrom tensorflow.keras.datasets import imdbfrom tensorflow.keras.models
import Sequentialimport numpy as np
We will be using Keras IMDB dataset. vocabulary size is a parameter that is used
the get data containing the given number of most occurring words in the entire
corpus of textual data.

Python3

# Getting reviews with words that come under 5000# most occurring words in the
entire# corpus of textual review datavocab_size = 5000(x_train, y_train), (x_test,
y_test) = imdb.load_data(num_words=vocab_size) print(x_train[0])

Output:
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66,3941, 4, 173, 36, 256,
5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172,
112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50,
16, 6, 147, 2025, 19, 14, 22,
4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22,
17, 515, 17, 12, 16, 626, 18,
2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 2, 16, 480, 66, 3785, 33, 4, 130,
12, 16, 38, 619, 5, 25, 124,
..]
These are the index values of the words and hence we done see any reviews

Python3

# Getting all the words from word_index dictionaryword_idx =


imdb.get_word_index() # Originally the index number of a value and not a key,#
hence converting the index as key and the words as valuesword_idx = {i: word for
word, i in word_idx.items()} # again printing the reviewprint([word_idx[i] for i in
x_train[0]])

Output:
['the', 'as', 'you', 'with', 'out', 'themselves', 'powerful', 'lets', 'loves',
'their', 'becomes', 'reaching', 'had', 'journalist', 'of', 'lot', 'from', 'anyone',
'to', 'have', 'after', 'out', 'atmosphere', 'never', 'more', 'room', 'and', 'it',
'so', 'heart', 'shows', 'to', 'years', 'of', 'every', 'never', 'going', 'and',
'help', 'moments', 'or', 'of', 'every', 'chest', 'visual', 'movie', 'except',
'her', 'was', 'several', 'of', 'enough', 'more', 'with', 'is', 'now', 'current',
'film', 'as', 'you', 'of', 'mine', 'potentially', 'unfortunately', 'of', 'you',
'than', 'him', 'that', 'with', 'out', 'themselves', 'her', 'get', 'for', 'was',
'camp', 'of', 'you', 'movie', 'sometimes', 'movie', 'that', 'with', 'scary', 'but',
'and', 'to', 'story', 'wonderful', 'that', 'in', 'seeing', 'in', 'character', 'to',
'of', '70s', 'and', 'with', 'heart', 'had', 'shadows', 'they', 'of', 'here',
'that', 'with', 'her', 'serious', 'to', 'have', 'does', 'when', 'from', 'why',
'what', 'have', 'critics', 'they', 'is', 'you', 'that', "isn't", 'one', 'will',
'very', 'to', 'as', 'itself', 'with', 'other', 'and', 'in', 'of', 'seen', 'over',
'and', 'for', 'anyone', 'of', 'and', 'br', "show's", 'to', 'whether', 'from',
'than', 'out', 'themselves', 'history', 'he', 'name', 'half', 'some', 'br', 'of',
'and', 'odd', 'was', 'two', 'most', 'of', 'mean', 'for', '1', 'any', 'an', 'boat',
'she', 'he', 'should', 'is', 'thought', 'and', 'but', 'of', 'script', 'you', 'not',
'while', 'history', 'he', 'heart', 'to', 'real', 'at', 'and', 'but', 'when',
'from', 'one', 'bit', 'then', 'have', 'two', 'of', 'script', 'their', 'with',
'her', 'nobody', 'most', 'that', 'with', "wasn't", 'to', 'with', 'armed', 'acting',
'watch', 'an', 'for', 'with', 'and', 'film', 'want', 'an']
Let’s check the range of the reviews we have in this dataset.

Python3

# Get the minimum and the maximum length of reviewsprint("Max length of a review::
", len(max((x_train+x_test), key=len)))print("Min length of a review:: ",
len(min((x_train+x_test), key=len)))

Output:
Max length of a review:: 2697
Min length of a review:: 70
We see that the longest review available is 2697 words and the shortest one is 70.
While working with Neural Networks, it is important to make all the inputs in a
fixed size. To achieve this objective we will pad the review sentences.
Python3

from tensorflow.keras.preprocessing import sequence # Keeping a fixed length of all


reviews to max 400 wordsmax_words = 400 x_train = sequence.pad_sequences(x_train,
maxlen=max_words)x_test = sequence.pad_sequences(x_test, maxlen=max_words) x_valid,
y_valid = x_train[:64], y_train[:64]x_train_, y_train_ = x_train[64:], y_train[64:]

SimpleRNN (also called Vanilla RNN)


They are the most basic form of Recurrent Neural Networks that tries to memorize
sequential information. However, they have the native problems of Exploding and
Vanishing gradients. For a detailed understanding of how RNNs works and its
limitations please read the article Recurrent Neural Networks Explanation.

Python3
# fixing every word's embedding size to be 32embd_len = 32 # Creating a RNN
modelRNN_model =
Sequential(name="Simple_RNN")RNN_model.add(Embedding(vocab_size,
embd_len, input_length=max_words)) # In case of a
stacked(more than one layer of RNN)# use
return_sequences=TrueRNN_model.add(SimpleRNN(128, activation
='tanh', return_sequences=False))RNN_model.add(Dense(1,
activation='sigmoid')) # printing model summaryprint(RNN_model.summary()) #
Compiling
modelRNN_model.compile( loss="binary_crossentropy", optimizer='adam', metr
ics=['accuracy']) # Training the modelhistory = RNN_model.fit(x_train_,
y_train_, batch_size=64, epochs=5,
verbose=1, validation_data=(x_valid,
y_valid)) # Printing model score on test dataprint()print("Simple_RNN Score---> ",
RNN_model.evaluate(x_test, y_test, verbose=0))

Output:

The vanilla form of RNN gave us a Test Accuracy of 64.95%. Limitations of Simple
RNN are it is unable to handle long sentences well because of its vanishing
gradient problems.
Gated Recurrent Units (GRU)
GRUs are lesser know but equally robust algorithms to solve the limitations of
simple RNNs. Please read the article Gated Recurrent Unit Networks for a better
understanding of their work.

Python3
# Defining GRU modelgru_model =
Sequential(name="GRU_Model")gru_model.add(Embedding(vocab_size,
embd_len, input_length=max_words))gru_model.add(GRU(128,
activation='tanh', return_sequences=False))gru_mod
el.add(Dense(1, activation='sigmoid')) # Printing the
Summaryprint(gru_model.summary()) # Compiling the
modelgru_model.compile( loss="binary_crossentropy", optimizer='adam', metr
ics=['accuracy']) # Training the GRU modelhistory2 = gru_model.fit(x_train_,
y_train_, batch_size=64, epochs=5,
verbose=1, validation_data=(x_valid,
y_valid)) # Printing model score on test dataprint()print("GRU model Score---> ",
gru_model.evaluate(x_test, y_test, verbose=0))

Output:

Test Accuracy of GRU was found to be 88.14%. GRU is a form of RNN that are better
than simple RNN and are often faster than LSTM due to its relatively fewer training
parameters.
Long Short Term Memory (LSTM)
LSTM is better in terms of capturing the memory of sequential information better
than simple RNNs. To understand the theoretical aspects of LSTM please visit the
article Long Short Term Memory Networks Explanation. Due to increased complexity
than that of GRU, it is slower to train but in general, LSTMs give better accuracy
than GRUs.

Python3
# Defining LSTM modellstm_model =
Sequential(name="LSTM_Model")lstm_model.add(Embedding(vocab_size,
embd_len, input_length=max_words))lstm_model.add(LSTM
(128, activation='relu', return_sequences=Fal
se))lstm_model.add(Dense(1, activation='sigmoid')) # Printing Model
Summaryprint(lstm_model.summary()) # Compiling the
modellstm_model.compile( loss="binary_crossentropy", optimizer='adam', met
rics=['accuracy']) # Training the modelhistory3 = lstm_model.fit(x_train_,
y_train_, batch_size=64, epochs=5
, verbose=2, validation_data=(x_v
alid, y_valid)) # Displaying the model accuracy on test dataprint()print("LSTM
model Score---> ", lstm_model.evaluate(x_test, y_test, verbose=0))

Output:

LSTM model Provided a test accuracy of 81.95%.


Bi-directional LSTM Model
Bidirectional LSTMS are a derivative of traditional LSTMS. Here, two LSTMs are used
to capture both the forward and backward sequences of the input. This helps in
capturing the context better than normal LSTM. For more information on
Bidirectional LSTM please read the article Emotion Detection using Bidirectional
LSTM.

Python3
# Defining Bidirectional LSTM modelbi_lstm_model =
Sequential(name="Bidirectional_LSTM")bi_lstm_model.add(Embedding(vocab_size,
embd_len, input_length=max_words))bi
_lstm_model.add(Bidirectional(LSTM(128, activat
ion='tanh', return_sequences=False)))bi_lstm_mo
del.add(Dense(1, activation='sigmoid')) # Printing model
summaryprint(bi_lstm_model.summary()) # Compiling model
summarybi_lstm_model.compile( loss="binary_crossentropy", optimizer='adam', metr
ics=['accuracy']) # Training the modelhistory4 = bi_lstm_model.fit(x_train_,
y_train_, batch_size=64, ep
ochs=5, verbose=2, validati
on_data=(x_test, y_test)) # Printing model score on test
dataprint()print("Bidirectional LSTM model Score--->
", bi_lstm_model.evaluate(x_test, y_test, verbose=0))

Output:

Bidirectional LSTM gave a test score of 87.48%.


ConclusionAll the major flavors for Recurrent Neural Networks were tested in their
base forms keeping all the common hyperparameters like number of layers, activation
function, batch size, and epochs to be the same across all the above models. The
model complexity increases as we go from SimpleRNN to Bidirectional LSTM as the
number of trainable parameters goes up. Out of all the models, for the given
dataset of IMDB reviews, the GRU model gave the best result in terms of accuracy.

Last Updated :
14 Oct, 2022

Like Article
Save Article

Previous

Sentiment Classification Using BERT

Next

Autocorrector Feature Using NLP In Python

Share your thoughts in the comments

Add Your Comment

Please Login to comment...

Autocorrect is a way of predicting or making the wrong spellings correct, which


makes the tasks like writing paragraphs, reports, and articles easier. Today there
are a lot of Websites and Social media platforms that use this concept to make web
apps user-friendly.
Autocorrector Feature Using NLP In Python
So, here we are using Machine Learning and NLP to make an autocorrection generator
that will suggest to us the correct spellings for the input word. We will be using
Python Programming Language for this.
Let’s move ahead with the project.
We will be using NTLK Library for the implementation of NLP-related tasks.
To import NLTK use the below command
import nltk
nltk.download('all')
Then the first task is to import the text file we will be using to create the word
list of correct words.
You can download the text file from this link.

Python3

# importing regular expressionimport re # wordsw = [] # reading text filewith


open('final.txt', 'r', encoding="utf8") as f: file_name_data =
f.read() file_name_data = file_name_data.lower() w = re.findall('\w+',
file_name_data) # vocabularymain_set = set(w)

Now we have to count the words and store their frequency. For that we will use
dictionary.
Python3

# Functions to count the frequency# of the words in the whole text file def
counting_words(words): word_count = {} for word in words: if word in
word_count: word_count[word] +=
1 else: word_count[word] = 1 return word_count

Then to calculate the probability of the words prob_cal function is used.

Python3
# Calculating the probability of each worddef prob_cal(word_count_dict): probs =
{} m = sum(word_count_dict.values()) for key in
word_count_dict.keys(): probs[key] = word_count_dict[key] / m return
probs

The further code is divided into 5 main parts, that includes the creation of all
types of different words that are possible.
To do this, we can use :
LemmatizationDeletion of letterSwitching LetterReplace LetterInsert new Letter
Let’s see the code implementation of each point
To do Lemmatization we will be using pattern module. You can install it using the
below command
pip install pattern
Then you can the below code

Python3

# LemmWord: extracting and adding# root word i.e.Lemma using pattern moduleimport
patternfrom pattern.en import lemma, lexemefrom nltk.stem import
WordNetLemmatizer def LemmWord(word): return list(lexeme(wd) for wd in
word.split())[0]

DeleteLetter : Function that Removes a letter from a given word.

Python3

# Deleting letters from the wordsdef DeleteLetter(word): delete_list =


[] split_list = [] # considering letters 0 to i then i to -1 # Leaving
the ith letter for i in range(len(word)): split_list.append((word[0:i],
word[i:])) for a, b in split_list: delete_list.append(a +
b[1:]) return delete_list
Switch_ : This function swaps two letters of the word.

Python3

# Switching two letters in a worddef Switch_(word): split_list = [] switch_l


= [] #creating pair of the words(and breaking them) for i in
range(len(word)): split_list.append((word[0:i], word[i:])) #Printint
the first word (i.e. a) #then replacing the first and second character of
b switch_l = [a + b[1] + b[0] + b[2:] for a, b in split_list if len(b) >=
2] return switch_l

Replace_ : It changes one letter to another.

Python3
def Replace_(word): split_l = [] replace_list = [] # Replacing the letter
one-by-one from the list of alphs for i in
range(len(word)): split_l.append((word[0:i], word[i:])) alphs =
'abcdefghijklmnopqrstuvwxyz' replace_list = [a + l + (b[1:] if len(b) > 1 else
'') for a, b in split_l if b for l in alphs] return
replace_list

insert_: It adds additional characters from the bunch of alphabets (one-by-one).

Python3

def insert_(word): split_l = [] insert_list = [] # Making pairs of the


split words for i in range(len(word) + 1): split_l.append((word[0:i],
word[i:])) # Storing new words in a list # But one new character at each
location alphs = 'abcdefghijklmnopqrstuvwxyz' insert_list = [a + l + b for a,
b in split_l for l in alphs] return insert_list

Now, we have implemented all the five steps. It’s time to merge all the words (i.e.
all functions) formed by those steps.
To implement that we will be using 2 different functions

Python3

# Collecting all the words# in a set(so that no word will repeat)def colab_1(word,
allow_switches=True): colab_1 =
set() colab_1.update(DeleteLetter(word)) if
allow_switches: colab_1.update(Switch_(word)) colab_1.update(Replace_(wor
d)) colab_1.update(insert_(word)) return colab_1 # collecting words using by
allowing switchesdef colab_2(word, allow_switches=True): colab_2 =
set() edit_one = colab_1(word, allow_switches=allow_switches) for w in
edit_one: if w: edit_two = colab_1(w,
allow_switches=allow_switches) colab_2.update(edit_two) return
colab_2
Now, The main task is to extract the correct words among all. To do so we will be
using a get_corrections function.

Python3

# Only storing those values which are in the vocabdef get_corrections(word, probs,
vocab, n=2): suggested_word = [] best_suggestion = [] suggested_word =
list( (word in vocab and word) or
colab_1(word).intersection(vocab) or
colab_2(word).intersection( vocab)) # finding out the words with
high frequencies best_suggestion = [[s, probs[s]] for s in
list(reversed(suggested_word))] return best_suggestion

Now the code is ready, we can test it for any user input by the below code.
Let’s print top 3 suggestions made by the Autocorrect.
Python3

# Inputmy_word = input("Enter any word:") # Counting word functionword_count =


counting_words(main_set) # Calculating probabilityprobs = probab_cal(word_count) #
only storing correct wordstmp_corrections = get_corrections(my_word, probs,
main_set, 2)for i, word_prob in enumerate(tmp_corrections): if(i <
3): print(word_prob[0]) else: break

Output :
Enter any word:daedd
dared
daned
diedConclusion
So, we have implemented the basic auto-corrector using the NLTK Library and Python.
For further steps, we can work on the High level auto-corrector system which uses
the large amount of dataset and works more efficiently.
To enhance accuracy, we can also use transformers and more NLP related techniques
like n-grams, Tf-idf, and so on.

Last Updated :
21 Dec, 2022
Like Article

Save Article

Previous

Sentiment Analysis with an Recurrent Neural Networks (RNN)

Next

Python | NLP analysis of Restaurant reviews

Share your thoughts in the comments

Add Your Comment

Please Login to comment...

Natural language processing (NLP) is an area of computer science and artificial


intelligence concerned with the interactions between computers and human (natural)
languages, in particular how to program computers to process and analyze large
amounts of natural language data. It is the branch of machine learning which is
about analyzing any text and handling predictive analysis.Scikit-learn is a free
software machine learning library for the Python programming language. Scikit-learn
is largely written in Python, with some core algorithms written in Cython to
achieve performance. Cython is a superset of the Python programming language,
designed to give C-like performance with code that is written mostly in
Python.Let’s understand the various steps involved in text processing and the flow
of NLP. This algorithm can be easily applied to any other kind of text like
classify a book into Romance, Friction, but for now, let’s use a restaurant review
dataset to review negative or positive feedback.
Steps involved:
Step 1: Import dataset with setting delimiter as ‘\t’ as columns are separated as
tab space. Reviews and their category(0 or 1) are not separated by any other symbol
but with tab space as most of the other symbols are is the review (like $ for the
price, ….!, etc) and the algorithm might use them as a delimiter, which will lead
to strange behavior (like errors, weird output) in output.

Python3

# Importing Librariesimport numpy as np import pandas as pd # Import


datasetdataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t')

To download the Restaurant_Reviews.tsv dataset used, click here.Step 2: Text


Cleaning or Preprocessing
Remove Punctuations, Numbers: Punctuations, Numbers don’t help much in processing
the given text, if included, they will just increase the size of a bag of words
that we will create as the last step and decrease the efficiency of an
algorithm.Stemming: Take roots of the word Convert each word into its lower case:
For example, it is useless to have some words in different cases (eg ‘good’ and
‘GOOD’).

Python3

# library to clean dataimport re # Natural Language Tool Kitimport nltk


nltk.download('stopwords') # to remove stopwordfrom nltk.corpus import stopwords #
for Stemming propose from nltk.stem.porter import PorterStemmer # Initialize empty
array# to append clean text corpus = [] # 1000 (reviews) rows to cleanfor i in
range(0, 1000): # column : "Review", row ith review = re.sub('[^a-zA-
Z]', ' ', dataset['Review'][i]) # convert all cases to lower
cases review = review.lower() # split to array(default delimiter is "
") review = review.split() # creating PorterStemmer object to # take
main stem of each word ps = PorterStemmer() # loop for stemming each
word # in string array at ith row review = [ps.stem(word) for word in
review if not word in set(stopwords.words('english'))]
# rejoin all string array elements # to create back into a string review = '
'.join(review) # append each string to create # array of clean text
corpus.append(review)
Examples: Before and after applying above code (reviews = > before, corpus =>
after)

Step 3: Tokenization, involves splitting sentences and words from the body of the
text.Step 4: Making the bag of words via sparse matrix
Take all the different words of reviews in the dataset without repeating of
words.One column for each word, therefore there is going to be many columns.Rows
are reviewsIf a word is there in the row of a dataset of reviews, then the count of
the word will be there in the row of a bag of words under the column of the word.
Examples: Let’s take a dataset of reviews of only two reviews
Input : "dam good steak", "good food good service"
Output :

For this purpose we need CountVectorizer class from


sklearn.feature_extraction.text. We can also set a max number of features (max no.
features which help the most via attribute “max_features”). Do the training on the
corpus and then apply the same transformation to the corpus
“.fit_transform(corpus)” and then convert it into an array. If the review is
positive or negative that answer is in the second column of the dataset[:, 1]: all
rows and 1st column (indexing from zero).

Python3

# Creating the Bag of Words modelfrom sklearn.feature_extraction.text import


CountVectorizer # To extract max 1500 feature.# "max_features" is attribute to#
experiment with to get better resultscv = CountVectorizer(max_features = 1500) # X
contains corpus (dependent variable)X = cv.fit_transform(corpus).toarray() # y
contains answers if review# is positive or negativey = dataset.iloc[:, 1].values
Description of the dataset to be used:
Columns separated by \t (tab space)First column is about reviews of peopleIn second
column, 0 is for negative review and 1 is for positive review
Step 5: Splitting Corpus into Training and Test set. For this, we need class
train_test_split from sklearn.cross_validation. Split can be made 70/30 or 80/20 or
85/15 or 75/25, here I choose 75/25 via “test_size”. X is the bag of words, y is 0
or 1 (positive or negative).

Python3

# Splitting the dataset into# the Training set and Test setfrom
sklearn.cross_validation import train_test_split # experiment with "test_size"# to
get better resultsX_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.25)

Step 6: Fitting a Predictive Model (here random forest)


Since Random forest is an ensemble model (made of many trees) from
sklearn.ensemble, import RandomForestClassifier classWith 501 trees or
“n_estimators” and criterion as ‘entropy’Fit the model via .fit() method with
attributes X_train and y_train
Python3

# Fitting Random Forest Classification# to the Training setfrom sklearn.ensemble


import RandomForestClassifier # n_estimators can be said as number of# trees,
experiment with n_estimators# to get better results model =
RandomForestClassifier(n_estimators = 501, criterion =
'entropy') model.fit(X_train, y_train)

Step 7: Predicting Final Results via using .predict() method with attribute
X_test

Python3
# Predicting the Test set resultsy_pred = model.predict(X_test) y_pred

Note: Accuracy with the random forest was 72%.(It may be different when performed
an experiment with different test sizes, here = 0.25).Step 8: To know the accuracy,
a confusion matrix is needed.Confusion Matrix is a 2X2 Matrix.

TRUE POSITIVE : measures the proportion of actual positives that are correctly
identified. TRUE NEGATIVE : measures the proportion of actual positives that are
not correctly identified. FALSE POSITIVE : measures the proportion of actual
negatives that are correctly identified. FALSE NEGATIVE : measures the proportion
of actual negatives that are not correctly identified.

Note: True or False refers to the assigned classification being Correct or


Incorrect, while Positive or Negative refers to assignment to the Positive or the
Negative Category

Python3
# Making the Confusion Matrixfrom sklearn.metrics import confusion_matrix cm =
confusion_matrix(y_test, y_pred) cm

Last Updated :
02 Nov, 2021

Like Article

Save Article

Previous

Autocorrector Feature Using NLP In Python

Next
Restaurant Review Analysis Using NLP and SQLite

Share your thoughts in the comments

Add Your Comment

Please Login to comment...

Normally, a lot of businesses are remained as failures due to lack of profit, lack
of proper improvement measures. Mostly, restaurant owners face a lot of
difficulties to improve their productivity. This project really helps those who
want to increase their productivity, which in turn increases their business
profits. This is the main objective of this project.
What the project does is that the restaurant owner gets to know about drawbacks of
his restaurant such as most disliked food items of his restaurant by customer’s
text review which is processed with ML classification algorithm(Naive Bayes) and
its results gets stored in the database using SQLite.
Tools & Technologies Used:NLTKMachine LearningPythonTkinterSqlite3PandasStep-by-
step Implementation:Step 1: Importing Libraries and Initialization of data
Firstly, we import NumPy, matplotlib, pandas, nltk, re, sklearn, Tkinter, sqlite3
libraries which are used for data manipulation, text data processing, pattern
recognition, training the data, graphical user interfaces and manipulation of data
on the database.

Python3
import numpy as npimport matplotlib.pyplot as pltimport pandas as pdimport reimport
nltkfrom nltk.corpus import stopwordsfrom nltk.stem.porter import PorterStemmerfrom
sklearn.feature_extraction.text import CountVectorizerfrom sklearn.model_selection
import train_test_splitfrom sklearn.naive_bayes import GaussianNBfrom
sklearn.metrics import accuracy_scorefrom tkinter import *import sqlite3 dataset =
pd.read_csv('Restaurant_Reviews.tsv', delimiter='\t',
quoting=3)corpus = []rras_code = "Wyd^H3R"food_rev = {}food_perc = {} conn =
sqlite3.connect('Restaurant_food_data.db')c = conn.cursor() for i in range(0,
1000): review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) review =
review.lower() review = review.split() ps = PorterStemmer() all_stopwords
= stopwords.words('english') all_stopwords.remove('not') review =
[ps.stem(word) for word in review if not word in
set(all_stopwords)] review = ' '.join(review) corpus.append(review) cv =
CountVectorizer(max_features=1500)X = cv.fit_transform(corpus).toarray()y =
dataset.iloc[:, -1].values X_train, X_test, y_train, y_test =
train_test_split( X, y, test_size=0.20, random_state=0) classifier =
GaussianNB()classifier.fit(X_train, y_train) y_pred =
classifier.predict(X_test) variables = []clr_variables = [] foods = ["Idly",
"Dosa", "Vada", "Roti", "Meals", "Veg Biryani", "Egg Biryani", "Chicken
Biryani", "Mutton Biryani", "Ice Cream", "Noodles", "Manchooriya", "Orange
juice", "Apple Juice", "Pineapple juice", "Banana juice"] for i in
foods: food_rev[i] = [] food_perc[i] = [0.0, 0.0] def init_data(): conn
= sqlite3.connect('Restaurant_food_data.db') c = conn.cursor() for i in
range(len(foods)): c.execute("INSERT INTO item
VALUES(:item_name,:no_of_customers,\ :no_of_positives,:no_of_negatives,:pos_
perc,:neg_perc)", { 'item_name':
foods[i], 'no_of_customers':
"0", 'no_of_positives':
"0", 'no_of_negatives': "0", 'pos_perc':
"0.0%", 'neg_perc':
"0.0%" } ) conn.commit() conn.close()

Step 2: Clarifying the user


Initially, our GUI application asks the user whether he is an owner or a customer
to decide what action to be performed.

Python3
root1 = Tk()main = "Restaurant Review Analysis System/"root1.title(main+"Welcome
Page") label = Label(root1, text="RESTAURANT REVIEW ANALYSIS
SYSTEM", bd=2, font=('Arial', 47, 'bold', 'underline')) ques =
Label(root1, text="Are you a Customer or Owner ???") cust = Button(root1,
text="Customer", font=('Arial', 20), padx=80, pady=20,
command=take_review) owner = Button(root1, text="Owner", font=('Arial',
20), padx=100, pady=20,
command=login) '''conn=sqlite3.connect('Restaurant_food_data.db')c=conn.cursor()c.e
xecute("CREATE TABLE item (Item_name text,No_of_customers text,\
No_of_positive_reviews text,No_of_negative_reviews text,Positive_percentage \
text,Negative_percentage text) ")conn.commit()conn.close()'''#c.execute("DELETE
FROM item") root1.attributes("-zoomed", True)label.grid(row=0,
column=0)ques.grid(row=1, column=0, sticky=W+E)ques.config(font=("Helvetica",
30))cust.grid(row=2, column=0)owner.grid(row=3,
column=0)conn.commit()conn.close()root1.mainloop()

Clarifying userStep 3: Collecting Data


Once, the system ensures that the user is a customer, it asks for food review in
text format. The customer must select the food items which he has taken from the
restaurant and then give his review on selected foods. When he clicks on submit
button, the text review is treated under the ML algorithm to predict whether it is
a positive review or negative review. And then, the entire data is inserted into
the database.

Python3
def take_review(): root2 = Toplevel() root2.title(main+"give
review") label = Label(root2, text="RESTAURANT REVIEW ANALYSIS
SYSTEM", bd=2, font=('Arial', 47, 'bold', 'underline')) req1 =
Label(root2, text="Select the item(s) you have taken.....") conn =
sqlite3.connect('Restaurant_food_data.db') c = conn.cursor() chk_btns =
[] selected_foods = [] req2 = Label(root2, text="Give your review
below....") rev_tf = Entry(root2, width=125, borderwidth=5) req3 =
Label(root2, text="NOTE : Use not instead of n't.") global
variables variables = [] chk_btns = [] for i in
range(len(foods)): var = IntVar() chk = Checkbutton(root2,
text=foods[i],
variable=var) variables.append(var) chk_btns.append(chk) label.gr
id(row=0, column=0, columnspan=4) req1.grid(row=1, column=0, columnspan=4,
sticky=W+E) req1.config(font=("Helvetica", 30)) for i in
range(4): for j in range(4): c =
chk_btns[i*4+j] c.grid(row=i+3, column=j, columnspan=1,
sticky=W) selected_foods = [] submit_review = Button(root2, text="Submit
Review", font=( 'Arial', 20), padx=100, pady=20, command=lambda:
[ estimate(rev_tf.get()), root2.destroy()]) root2.attributes("-zoomed",
True) req2.grid(row=7, column=0, columnspan=4,
sticky=W+E) req2.config(font=("Helvetica", 20)) rev_tf.grid(row=8, column=1,
rowspan=3, columnspan=2, sticky=S) req3.grid(row=11, column=1,
columnspan=2) submit_review.grid(row=12, column=0,
columnspan=4) conn.commit() conn.close() # Processing and storing the datadef
estimate(s): conn = sqlite3.connect('Restaurant_food_data.db') c =
conn.cursor() review = re.sub('[^a-zA-Z]', ' ', s) review =
review.lower() review = review.split() ps = PorterStemmer() all_stopwords
= stopwords.words('english') all_stopwords.remove('not') review =
[ps.stem(word) for word in review if not word in
set(all_stopwords)] review = ' '.join(review) X =
cv.transform([review]).toarray() res = classifier.predict(X) # list if
"not" in review: res[0] = abs(res[0]-1) selected_foods = [] for i
in range(len(foods)): if variables[i].get() ==
1: selected_foods.append(foods[i]) c.execute("SELECT *,oid FROM
item") records = c.fetchall() for i in records: rec =
list(i) if rec[0] in selected_foods: n_cust = int(rec[1])
+1 n_pos = int(rec[2]) n_neg =
int(rec[3]) if res[0] == 1: n_pos +=
1 else: n_neg +=
1 pos_percent = round((n_pos/n_cust)*100,
1) neg_percent = round((n_neg/n_cust)*100,
1) c.execute("""UPDATE item SET
Item_name=:item_name,No_of_customers\ =:no_of_customers,No_of_positive_r
eviews=:no_of_positives,\ No_of_negative_reviews=:no_of_negatives,Positi
ve_percentage\ =:pos_perc,Negative_percentage=:neg_perc where
oid=:Oid""", { 'item_name':
rec[0], 'no_of_customers':
str(n_cust), 'no_of_positives':
str(n_pos), 'no_of_negatives':
str(n_neg), 'pos_perc': str(pos_percent)
+"%", 'neg_perc': str(neg_percent)
+"%", 'Oid': foods.index(rec[0])
+1 } ) selected_foods =
[] conn.commit() conn.close()

Taking reviewStep 4: Verifying Ownership


If the current user of the system is the owner of the restaurant, then the system
verifies the owner by asking rras_code(i.e., A code that uniquely identifies a
restaurant all over the world. And it is highly confidential, one should not share
this code with others except co-owners of the restaurant).

Python3

def login(): root3 = Toplevel() root3.title(main+"owner


verification") label = Label(root3, text="RESTAURANT REVIEW ANALYSIS
SYSTEM", bd=2, font=('Arial', 47, 'bold',
'underline')) label2 = Label(root3, text="VERIFY OWNERSHIP",
bd=1, font=('Helvetica', 30, 'bold', 'underline')) label3
= Label(root3, text="To verify your ownership, please \ enter your restaurant's
private rras code....", bd=1, font=('Helvetica', 20,
'bold')) ent = Entry(root3, show="*", borderwidth=2) submit_code =
Button(root3, text="Submit", font=('Arial', 20),
padx=80, pady=20, command=lambda:
[ view_details(ent.get()),
root3.destroy()]) root3.attributes("-zoomed", True) label.grid(row=0,
column=0, columnspan=3) label2.grid(row=1, column=0, sticky=W+E,
columnspan=3) label3.grid(row=2, column=0, sticky=W,
columnspan=3) ent.grid(row=3, column=1, columnspan=1) submit_code.grid(row=4,
column=1, columnspan=1)

Verifying ownershipStep 5: Accessing Data


When the ownership verifies, the owner has 3 options described on a new page as
mentioned below:

Python3

def popup(): messagebox.showerror("Error Message!", "Incorrect code!") def


view_details(s): if(s != rras_code): popup() else: root4 =
Toplevel() root4.title(main+"view_details") label = Label(root4,
text="RESTAURANT REVIEW ANALYSIS SYSTEM", bd=2, font=('Arial',
47, 'bold', 'underline')) sug1 = Label( root4, text="Click the
below button, if you want to view\ the data from your
database....") acc_btn = Button(root4, text="View Data",
font=( 'Arial', 20), padx=100, pady=20,
command=access_data) sug2 = Label(root4, text="Click the below
button, if you want \ to clear specific item
data...") itemclr_btn = Button(root4, text="Clear Item Data",
font=( 'Arial', 20), padx=100, pady=20,
command=clr_itemdata) sug3 = Label(root4, text="Click the below
button, if you want to\ clear all item data...") allclr_btn
= Button(root4, text="Clear All Data", font=( 'Arial', 20), padx=100,
pady=20, command=clr_alldata) exit_btn = Button(root4, text="Exit",
command=root4.destroy) root4.attributes("-zoomed",
True) label.grid(row=0, column=0) sug1.grid(row=1,
column=0) sug1.config(font=("Helvetica", 30)) acc_btn.grid(row=2,
column=0) sug2.grid(row=3, column=0) sug2.config(font=("Helvetica",
30)) itemclr_btn.grid(row=4, column=0) sug3.grid(row=5,
column=0) sug3.config(font=("Helvetica", 30)) allclr_btn.grid(row=6,
column=0) exit_btn.grid(row=9, column=0, sticky=S)

OptionsStep 6: Viewing the data


The owner can view the data in the database where each food having attributes like
number of customers, number of positive reviews, number of negative reviews,
positive rate, negative rate. Highly positive-rated foods are labelled in green
colour, and least positive-rated food items are labelled in red colour to easily
understand the summary of data. Now, the owner can identify low-rated food items
and tries to improve the food_item taste by taking some measurements like calling
to new chefs etc., which definitely improves his business.

Python3
def access_data(): root5 =
Toplevel() root5.title(main+"Restaurant_Database") label = Label(root5,
text="RESTAURANT REVIEW ANALYSIS SYSTEM", bd=2, font=('Arial', 47,
'bold', 'underline')) title1 = Label(root5, text="S.NO", font=('Arial', 10,
'bold', 'underline')) title2 = Label(root5, text="FOOD ITEM",
font=( 'Arial', 10, 'bold', 'underline')) title3 = Label(root5,
text="NO.OF CUSTOMERS", font=('Arial', 10, 'bold',
'underline')) title4 = Label(root5, text="NO.OF POSITIVE
REVIEWS", font=('Arial', 10, 'bold', 'underline')) title5
= Label(root5, text="NO.OF NEGATIVE REVIEWS", font=('Arial', 10,
'bold', 'underline')) title6 = Label(root5, text="POSITIVE
RATE", font=('Arial', 10, 'bold', 'underline')) title7 =
Label(root5, text="NEGATIVE RATE", font=('Arial', 10, 'bold',
'underline')) label.grid(row=0, column=0, columnspan=7) title1.grid(row=1,
column=0) title2.grid(row=1, column=1) title3.grid(row=1,
column=2) title4.grid(row=1, column=3) title5.grid(row=1,
column=4) title6.grid(row=1, column=5) title7.grid(row=1, column=6) conn
= sqlite3.connect('Restaurant_food_data.db') c =
conn.cursor() c.execute("SELECT *,oid from item") records =
c.fetchall() pos_rates = [] for record in records: record =
list(record) pos_rates.append(float(record[-3][:-1])) max_pos =
max(pos_rates) min_pos = min(pos_rates) for i in
range(len(records)): rec_list = list(records[i]) if str(max_pos)+"%"
== rec_list[-3]: rec_lab = [Label(root5, text=str(rec_list[-1]),
fg="green")] for item in rec_list[:-1]: lab =
Label(root5, text=item, fg="green") rec_lab.append(lab) elif
str(min_pos)+"%" == rec_list[-3]: rec_lab = [Label(root5,
text=str(rec_list[-1]), fg="red")] for item in rec_list[:-
1]: lab = Label(root5, text=item,
fg="red") rec_lab.append(lab) else: rec_lab =
[Label(root5, text=str(rec_list[-1]))] for item in rec_list[:-
1]: lab = Label(root5,
text=item) rec_lab.append(lab) for j in
range(len(rec_lab)): rec_lab[j].grid(row=i+2, column=j) exit_btn =
Button(root5, text="Exit", command=root5.destroy) exit_btn.grid(row=len(records)
+5, column=3) conn.commit() conn.close() root5.attributes("-zoomed",
True)

Database ContentStep 7: Clearing the data:


When some adjustments or modifications have been taken the owner, can clear those
specific item data so that he can quickly observe the performance of the food item.
If the owner wants, he can clear all food item data also.

Python3
def clr_itemdata(): root6 =
Toplevel() root6.title(main+"clear_item_data") label = Label(root6,
text="RESTAURANT REVIEW ANALYSIS SYSTEM", bd=2, font=('Arial', 47,
'bold', 'underline')) req1 = Label(root6, text="Pick the items to clear
their corresponding\ item data....") chk_list = [] global
clr_variables clr_variables = [] for i in range(len(foods)): var
= IntVar() chk = Checkbutton(root6, text=foods[i],
variable=var) clr_variables.append(var) chk_list.append(chk)
label.grid(row=0, column=0, columnspan=4) req1.grid(row=1, column=0,
columnspan=4, sticky=W+E) req1.config(font=("Helvetica", 30)) for i in
range(4): for j in range(4): c =
chk_list[i*4+j] c.grid(row=i+3, column=j, columnspan=1,
sticky=W) clr_item = Button(root6, text="Clear", font=( 'Arial',
20), padx=100, pady=20, command=lambda: [ clr_data(),
root6.destroy()]) clr_item.grid(row=8, column=0,
columnspan=4) root6.attributes("-zoomed", True) def clr_alldata(): confirm =
messagebox.askquestion( "Confirmation", "Are you sure to delete all
data??") if confirm == "yes": conn =
sqlite3.connect('Restaurant_food_data.db') c = conn.cursor() for i in
range(len(foods)): c.execute("""UPDATE item SET
Item_name=:item_name,No_of_customers\ =:no_of_customers,No_of_positive_r
eviews=:no_of_positives,\ No_of_negative_reviews=:no_of_negatives,Positi
ve_percentage=:\ pos_perc,Negative_percentage=:neg_perc where
oid=:Oid""", { 'item_name':
foods[i], 'no_of_customers':
"0", 'no_of_positives':
"0", 'no_of_negatives':
"0", 'pos_perc':
"0.0%", 'neg_perc':
"0.0%", 'Oid':
i+1 } ) conn.commit() conn.
close()
Clearing food item data
Finally, this is my idea to increase the productivity of businesses with
technology. With this, the business problems get shut down by the improvement of
productivity.
Project Applications in Real-Life:It can be used in any food
restaurant/hotel.Effective in food improvement measurements that directly improve
one’s business.User-friendly.No chance of business loss.

Last Updated :
05 Oct, 2021

Like Article

Save Article

Previous

Python | NLP analysis of Restaurant reviews

Next

Twitter Sentiment Analysis using Python

Share your thoughts in the comments


Add Your Comment

Please Login to comment...

In today’s era, companies work hard to make their customers happy. They launch new
technologies and services so that customers can use their products more. They try
to be in touch with each of their customers so that they can provide goods
accordingly. But practically, it’s very difficult and non-realistic to keep in
touch with everyone. So, here comes the usage of Customer Segmentation.
Customer Segmentation means the segmentation of customers on the basis of their
similar characteristics, behavior, and needs. This will eventually help the company
in many ways. Like, they can launch the product or enhance the features
accordingly. They can also target a particular sector as per their behaviors. All
of these lead to an enhancement in the overall market value of the company.
Customer Segmentation using Unsupervised Machine Learning in Python
Today we will be using Machine Learning to implement the task of Customer
Segmentation.
Import Libraries
The libraries we will be required are :
Pandas – This library helps to load the data frame in a 2D array format.Numpy –
Numpy arrays are very fast and can perform large computations.Matplotlib / Seaborn
– This library is used to draw visualizations.Sklearn – This module contains
multiple libraries having pre-implemented functions to perform tasks from data
preprocessing to model development and evaluation.

Python3
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn
as sb from sklearn.preprocessing import StandardScaler, LabelEncoderfrom
sklearn.cluster import KMeans import warningswarnings.filterwarnings('ignore')

Importing Dataset
The dataset taken for the task includes the details of customers includes their
marital status, their income, number of items purchased, types of items purchased,
and so on.

Python3

df = pd.read_csv('new.csv')df.head()
Output:

To check the shape of the dataset we can use data.shape method.

Python3

df.shape

Output:
(2240, 25)(2240, 25)
To get the information of the dataset like checking the null values, count of
values, etc. we will use .info() method.
Data Preprocessing

Python3
df.info()

Output:

Python3

df.describe().T
Output:

Improving the values in the Accepted column.

Python3

df['Accepted'] = df['Accepted'].str.replace('Accepted', '')

To check the null values in the dataset.

Python3
for col in df.columns: temp = df[col].isnull().sum() if temp >
0: print(f'Column {col} contains {temp} null values.')

Output:
Column Income contains 24 null values.
Now, once we have the count of the null values and we know the values are very less
we can drop them (it will not affect the dataset much).

Python3
df = df.dropna()print("Total missing values are:", len(df))

Output:
Total missing values are: 2216
To find the total number of unique values in each column we can use data.unique()
method.

Python3

df.nunique()
Output:

Here we can observe that there are columns which contain single values in the whole
column so, they have no relevance in the model development.
Also dataset has a column Dt_Customer which contains the date column, we can
convert into 3 columns i.e. day, month, year.

Python3

parts = df["Dt_Customer"].str.split("-", n=3, expand=True)df["day"] =


parts[0].astype('int')df["month"] = parts[1].astype('int')df["year"] =
parts[2].astype('int')

Now we have all the important features, we can now drop features like
Z_CostContact, Z_Revenue, Dt_Customer.

Python3
df.drop(['Z_CostContact', 'Z_Revenue',
'Dt_Customer'], axis=1, inplace=True)

Data Visualization and Analysis


Data visualization is the graphical representation of information and data in a
pictorial or graphical format. Here we will be using bar plot and count plot for
better visualization.

Python3

floats, objects = [], []for col in df.columns: if df[col].dtype ==


object: objects.append(col) elif df[col].dtype ==
float: floats.append(col) print(objects)print(floats)

Output:
['Education', 'Marital_Status', 'Accepted']
['Income']
To get the count plot for the columns of the datatype – object, refer the code
below.

Python3

plt.subplots(figsize=(15, 10))for i, col in enumerate(objects): plt.subplot(2,


2, i + 1) sb.countplot(df[col])plt.show()
Output:

Let’s check the value_counts of the Marital_Status of the data.

Python3

df['Marital_Status'].value_counts()

Output:

Now lets see the comparison of the features with respect to the values of the
responses.

Python3
plt.subplots(figsize=(15, 10))for i, col in enumerate(objects): plt.subplot(2,
2, i + 1) sb.countplot(df[col], hue=df['Response'])plt.show()

Output:
Label Encoding
Label Encoding is used to convert the categorical values into the numerical values
so that model can understand it.

Python3

for col in df.columns: if df[col].dtype == object: le =


LabelEncoder() df[col] = le.fit_transform(df[col])
Heatmap is the best way to visualize the correlation among the different features
of dataset. Let’s give it the value of 0.8

Python3

plt.figure(figsize=(15, 15))sb.heatmap(df.corr() > 0.8, annot=True,


cbar=False)plt.show()

Output:
Standardization
Standardization is the method of feature scaling which is an integral part of
feature engineering. It scales down the data and making it easier for the machine
learning model to learn from it. It reduces the mean to ‘0’ and the standard
deviation to ‘1’.

Python3

scaler = StandardScaler()data = scaler.fit_transform(df)

Segmentation
We will be using T-distributed Stochastic Neighbor Embedding. It helps in
visualizing high-dimensional data. It converts similarities between data points to
joint probabilities and tries to minimize the values to low-dimensional embedding.

Python3
from sklearn.manifold import TSNEmodel = TSNE(n_components=2,
random_state=0)tsne_data = model.fit_transform(df)plt.figure(figsize=(7,
7))plt.scatter(tsne_data[:, 0], tsne_data[:, 1])plt.show()

Output:

There are certainly some clusters which are clearly visual from the 2-D
representation of the given data. Let’s use the KMeans algorithm to find those
clusters in the high dimensional plane itself
KMeans Clustering can also be used to cluster the different points in a plane.

Python3

error = []for n_clusters in range(1, 21): model = KMeans(init='k-means+


+', n_clusters=n_clusters, max_iter=500,
random_state=22) model.fit(df) error.append(model.inertia_)

Here inertia is nothing but the sum of squared distances within the clusters.

Python3

plt.figure(figsize=(10, 5))sb.lineplot(x=range(1, 21),


y=error)sb.scatterplot(x=range(1, 21), y=error)plt.show()

Output:
Here by using the elbow method we can say that k = 6 is the optimal number of
clusters that should be made as after k = 6 the value of the inertia is not
decreasing drastically.

Python3

# create clustering model with optimal k=5model = KMeans(init='k-means+


+', n_clusters=5, max_iter=500, random_st
ate=22)segments = model.fit_predict(df)

Scatterplot will be used to see all the 6 clusters formed by KMeans Clustering.

Python3
plt.figure(figsize=(7, 7))sb.scatterplot(tsne_data[:, 0], tsne_data[:, 1],
hue=segments)plt.show()

Output:

Last Updated :
21 Nov, 2022

Like Article

Save Article

Previous

Twitter Sentiment Analysis using Python


Next

Music Recommendation System Using Machine Learning

Share your thoughts in the comments

Add Your Comment

Please Login to comment...

When did we see a video on youtube let’s say it was funny then the next time you
open your youtube app you get recommendations of some funny videos in your feed
ever thought about how? This is nothing but an application of Machine Learning
using which recommender systems are built to provide personalized experience and
increase customer engagement.
In this article, we will try to build a very basic recommender system that can
recommend songs based on which songs you hear.
Importing Libraries & Dataset
Python libraries make it very easy for us to handle the data and perform typical
and complex tasks with a single line of code.
Pandas – This library helps to load the data frame in a 2D array format and has
multiple functions to perform analysis tasks in one go.Numpy – Numpy arrays are
very fast and can perform large computations in a very short
time.Matplotlib/Seaborn – This library is used to draw visualizations.Sklearn –
This module contains multiple libraries having pre-implemented functions to perform
tasks from data preprocessing to model development and evaluation.

Python3
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn
as sb from sklearn.metrics.pairwise import cosine_similarityfrom
sklearn.feature_extraction.text import CountVectorizerfrom sklearn.manifold import
TSNE import warningswarnings.filterwarnings('ignore')

The dataset we are going to use contains data about songs released in the span of
around 100 years. Along with some general information about songs some scientific
measures of sound are also provided like loudness, acoustics, speechiness, and so
on.

Python3

tracks = pd.read_csv('tracks_records.csv')tracks.head()
Output:
First five rows of the datasetData Cleaning
Data Cleaning is one of the important steps without which data will be of no use
because the raw data contains a lot of noises that must be removed else the
observations made from it will be inaccurate and if we are building a model upon it
then it’s performance will be poor as well. Steps included in the data cleaning are
outlier removal, null value imputation, and fixing the skewness of the data.

Python3

tracks.shape
Output:
(586672, 19)

Python3

tracks.info()

Output:
Basic information about the columns of the dataset
Now. let’s check if there are null values in the columns of our data frame.

Python3
tracks.isnull().sum()

Output:
Number of null values in each column
The genre of music is a very important indicator of the type of music which is why
we will remove such rows with null values. We could have imputed then as well but
we have a huge dataset of around 6 lakh rows so, removing 50,000 won’t affect much
(depending upon the case).

Python3

tracks.dropna(inplace = True)tracks.isnull().sum().plot.bar()plt.show()
Output:
After removing rows containing null values
Now let’s remove some columns which we won’t be using to build our recommender
system.

Python3

tracks = tracks.drop(['id', 'id_artists'], axis = 1)

Exploratory Data Analysis


EDA is an approach to analyzing the data using visual techniques. It is used to
discover trends, and patterns, or to check assumptions with the help of statistical
summaries and graphical representations.
The dataset we have contains around 14 numerical columns but we cannot visualize
such high-dimensional data. But to solve this problem t-SNE comes to the rescue. t-
SNE is an algorithm that can convert high dimensional data to low dimensions and
uses some non-linear method to do so which is not a concern of this article.

Python3

model = TSNE(n_components = 2, random_state = 0)tsne_data =


model.fit_transform(a.head(500))plt.figure(figsize = (7,
7))plt.scatter(tsne_data[:,0], tsne_data[:,1])plt.show()

Output:
Scatter plot of the output of t-SNE
Here we can observe some clusters.
Formation of clusters in 2-D space
As we know multiple versions of the same song are released hence we need to remove
the different versions of the same sone as we are building a content-based
recommender system behind which the main worker is the cosine similarity function
our system will recommend the versions of the same song if available and that is
not what we want.

Python3
tracks['name'].nunique(), tracks.shape

Output:
(408902, (536847, 17))
So, our concern was right so, let’s remove the duplicate rows based upon the song
names.

Python3
tracks = tracks.sort_values(by=['popularity'],
ascending=False)tracks.drop_duplicates(subset=['name'], keep='first', inplace=True)

Let’s visualize the number of songs released each year.

Python3

plt.figure(figsize = (10,
5))sb.countplot(tracks['release_year'])plt.axis('off')plt.show()

Output:
Countplot of the number of songs in subsequent years
Here we can see a boom in the music industry from the year 1900 to somewhere around
1990.

Python3

floats = []for col in tracks.columns: if tracks[col].dtype ==


'float': floats.append(col) len(floats)

Output:
10
There is a total of 10 such columns with float values in them. Let’s draw their
distribution plot to get insights into the distribution of the data.

Python3
plt.subplots(figsize = (15, 5))for i, col in enumerate(floats): plt.subplot(2, 5,
i + 1) sb.distplot(tracks[col])plt.tight_layout()plt.show()

Output:
Distribution plot of the continuous features
Some of the features have normal distribution while some data distribution is
skewed as well.

Python3

%%capturesong_vectorizer = CountVectorizer()song_vectorizer.fit(tracks['genres'])
As the dataset is too large computation cost/time will to too high so, we will show
the implementation of the recommended system by using the most popular 10,000
songs.

Python3

tracks = tracks.sort_values(by=['popularity'], ascending=False).head(10000)

Below is a helper function to get similarities for the input song with each song in
the dataset.

Python3
def get_similarities(song_name, data): # Getting vector for the input
song. text_array1 = song_vectorizer.transform(data[data['name']==song_name]
['genres']).toarray() num_array1 =
data[data['name']==song_name].select_dtypes(include=np.number).to_numpy() # We
will store similarity for each row of the dataset. sim = [] for idx, row in
data.iterrows(): name = row['name'] # Getting vector for current
song. text_array2 = song_vectorizer.transform(data[data['name']==name]
['genres']).toarray() num_array2 =
data[data['name']==name].select_dtypes(include=np.number).to_numpy() #
Calculating similarities for text as well as numeric features text_sim =
cosine_similarity(text_array1, text_array2)[0][0] num_sim =
cosine_similarity(num_array1, num_array2)[0][0] sim.append(text_sim +
num_sim) return sim

To calculate the similarity between the two vectors we have used the concept of
cosine similarity.

Python3
def recommend_songs(song_name, data=tracks): # Base case if tracks[tracks['name']
== song_name].shape[0] == 0: print('This song is either not so popular or
you\ have entered invalid_name.\n Some songs you may like:\n') for song
in data.sample(n=5)
['name'].values: print(song) return data['similarity_factor'] =
get_similarities(song_name, data) data.sort_values(by=['similarity_factor',
'popularity'], ascending = [False,
False], inplace=True) # First song will be the input song
itself as the similarity will be highest. display(data[['name', 'artists']][2:7])

Now, it’s time to see the recommender system at work. Let’s see which songs are
recommender system will recommend if he/she listens to the famous song ‘Shape of
you’.

Python3
recommend_songs('Shape of You')

Output:
Recommended songs if you hear ‘Shape of you’
Let’s try this on one more song.

Python3

recommend_songs('Love Someone')
Output:
Recommended songs if you hear ‘Love Someone’
Below shown is the case if the song name entered is incorrect.

Python3

recommend_songs('Love me like you do')

Output:
If the input song name is not in the datasetConclusion
Although this model requires a lot of changes before it can be used in any real-
world music app or website. But this is just an overview of how recommendation
systems are built and used.

Last Updated :
01 Nov, 2022
Like Article

Save Article

Previous

Customer Segmentation using Unsupervised Machine Learning in Python

Next

K means Clustering - Introduction

Share your thoughts in the comments

Add Your Comment

Please Login to comment...

K-Means Clustering is an Unsupervised Machine Learning algorithm, which groups the


unlabeled dataset into different clusters. The article aims to explore the
fundamentals and working of k mean clustering along with the implementation.
Table of Content
What is K-means Clustering?What is the objective of k-means clustering?How k-means
clustering works?Implementation of K-Means Clustering in PythonWhat is K-means
Clustering?Unsupervised Machine Learning is the process of teaching a computer to
use unlabeled, unclassified data and enabling the algorithm to operate on that data
without supervision. Without any previous data training, the machine’s job in this
case is to organize unsorted data according to parallels, patterns, and
variations.
What is the objective of k-means clustering?The goal of clustering is to divide the
population or set of data points into a number of groups so that the data points
within each group are more comparable to one another and different from the data
points within the other groups. It is essentially a grouping of things based on how
similar and different they are to one another.
How k-means clustering works?We are given a data set of items, with certain
features, and values for these features (like a vector). The task is to categorize
those items into groups. To achieve this, we will use the K-means algorithm, an
unsupervised learning algorithm. ‘K’ in the name of the algorithm represents the
number of groups/clusters we want to classify our items into.
(It will help if you think of items as points in an n-dimensional space). The
algorithm will categorize the items into k groups or clusters of similarity. To
calculate that similarity, we will use the Euclidean distance as a measurement.
The algorithm works as follows:
First, we randomly initialize k points, called means or cluster centroids.We
categorize each item to its closest mean, and we update the mean’s coordinates,
which are the averages of the items categorized in that cluster so far.We repeat
the process for a given number of iterations and at the end, we have our
clusters.The “points” mentioned above are called means because they are the mean
values of the items categorized in them. To initialize these means, we have a lot
of options. An intuitive method is to initialize the means at random items in the
data set. Another method is to initialize the means at random values between the
boundaries of the data set (if for a feature x, the items have values in [0,3], we
will initialize the means with values for x at [0,3]).
The above algorithm in pseudocode is as follows:
Initialize k means with random values--> For a given number of iterations:
--> Iterate through items: --> Find the mean closest to the item by
calculating the euclidean distance of the item with each of the means
--> Assign item to mean --> Update mean by shifting it to the
average of the items in that clusterImplementation of K-Means Clustering in
PythonExample 1 Import the necessary LibrariesWe are importing Numpy for
statistical computations, Matplotlib to plot the graph, and make_blobs from
sklearn.datasets.

Python3
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.datasets import
make_blobs

Create the custom dataset with make_blobs and plot it

Python3

X,y = make_blobs(n_samples = 500,n_features = 2,centers = 3,random_state = 23) fig


= plt.figure(0)plt.grid(True)plt.scatter(X[:,0],X[:,1])plt.show()
Output:
Clustering datasetInitialize the random centroidsThe code initializes three
clusters for K-means clustering. It sets a random seed and generates random cluster
centers within a specified range, and creates an empty list of points for each
cluster.

Python3

k = 3 clusters = {}np.random.seed(23) for idx in range(k): center =


2*(2*np.random.random((X.shape[1],))-1) points = [] cluster =
{ 'center' : center, 'points' : [] } clusters[idx] =
cluster clusters

Output:
{0: {'center': array([0.06919154, 1.78785042]), 'points': []}, 1: {'center':
array([ 1.06183904, -0.87041662]), 'points': []}, 2: {'center': array([-1.11581855,
0.74488834]), 'points': []}}Plot the random initialize center with data points

Python3
plt.scatter(X[:,0],X[:,1])plt.grid(True)for i in clusters: center = clusters[i]
['center'] plt.scatter(center[0],center[1],marker = '*',c = 'red')plt.show()

Output:
Data points with random centerThe plot displays a scatter plot of data points
(X[:,0], X[:,1]) with grid lines. It also marks the initial cluster centers (red
stars) generated for K-means clustering.
Define Euclidean distance

Python3

def distance(p1,p2): return np.sqrt(np.sum((p1-p2)**2))


Create the function to Assign and Update the cluster centerThe E-step assigns data
points to the nearest cluster center, and the M-step updates cluster centers based
on the mean of assigned points in K-means clustering.

Python3

#Implementing E step def assign_clusters(X, clusters): for idx in


range(X.shape[0]): dist = [] curr_x =
X[idx] for i in range(k): dis =
distance(curr_x,clusters[i]['center']) dist.append(dis) curr_clus
ter = np.argmin(dist) clusters[curr_cluster]
['points'].append(curr_x) return clusters #Implementing the M-Stepdef
update_clusters(X, clusters): for i in range(k): points =
np.array(clusters[i]['points']) if points.shape[0] >
0: new_center = points.mean(axis =0) clusters[i]['center'] =
new_center clusters[i]['points'] = [] return clusters
Step 7: Create the function to Predict the cluster for the datapoints

Python3

def pred_cluster(X, clusters): pred = [] for i in


range(X.shape[0]): dist = [] for j in
range(k): dist.append(distance(X[i],clusters[j]['center'])) pred.
append(np.argmin(dist)) return pred

Assign, Update, and predict the cluster center

Python3
clusters = assign_clusters(X,clusters)clusters = update_clusters(X,clusters)pred =
pred_cluster(X,clusters)

Plot the data points with their predicted cluster center

Python3

plt.scatter(X[:,0],X[:,1],c = pred)for i in clusters: center = clusters[i]


['center'] plt.scatter(center[0],center[1],marker = '^',c = 'red')plt.show()
Output:
K-means ClusteringThe plot shows data points colored by their predicted clusters.
The red markers represent the updated cluster centers after the E-M steps in the K-
means clustering algorithm.
Example 2Import the necessary libraries

Python3

import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot


as pltimport matplotlib.cm as cmfrom sklearn.datasets import load_irisfrom
sklearn.cluster import KMeans

Load the Dataset

Python3
X, y = load_iris(return_X_y=True)

Elbow Method Finding the ideal number of groups to divide the data into is a basic
stage in any unsupervised algorithm. One of the most common techniques for figuring
out this ideal value of k is the elbow approach.

Python3

#Find optimum number of clustersse = [] #SUM OF SQUARED ERRORfor k in


range(1,11): km = KMeans(n_clusters=k,
random_state=2) km.fit(X) sse.append(km.inertia_)
Plot the Elbow graph to find the optimum number of cluster

Python3

sns.set_style("whitegrid")g=sns.lineplot(x=range(1,11), y=sse) g.set(xlabel


="Number of cluster (k)", ylabel = "Sum Squared Error", title ='Elbow
Method') plt.show()

Output:
Elbow MethodFrom the above graph, we can observe that at k=2 and k=3 elbow-like
situation. So, we are considering K=3
Build the Kmeans clustering model

Python3
kmeans = KMeans(n_clusters = 3, random_state = 2)kmeans.fit(X)

Output:
KMeansKMeans(n_clusters=3, random_state=2)Find the cluster center

Python3
kmeans.cluster_centers_

Output:
array([[5.006 , 3.428 , 1.462 , 0.246 ], [5.9016129 ,
2.7483871 , 4.39354839, 1.43387097], [6.85 , 3.07368421, 5.74210526,
2.07105263]])Predict the cluster group:

Python3

pred = kmeans.fit_predict(X)pred

Output:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,
1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1,
1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1],
dtype=int32)Plot the cluster center with data points

Python3

plt.figure(figsize=(12,5))plt.subplot(1,2,1)plt.scatter(X[:,0],X[:,1],c = pred,
cmap=cm.Accent)plt.grid(True)for center in kmeans.cluster_centers_: center =
center[:2] plt.scatter(center[0],center[1],marker = '^',c =
'red')plt.xlabel("petal length (cm)")plt.ylabel("petal width
(cm)") plt.subplot(1,2,2) plt.scatter(X[:,2],X[:,3],c = pred,
cmap=cm.Accent)plt.grid(True)for center in kmeans.cluster_centers_: center =
center[2:4] plt.scatter(center[0],center[1],marker = '^',c =
'red')plt.xlabel("sepal length (cm)")plt.ylabel("sepal width (cm)")plt.show()

Output:
K-means clusteringThe subplot on the left display petal length vs. petal width with
data points colored by clusters, and red markers indicate K-means cluster centers.
The subplot on the right show sepal length vs. sepal width similarly.
ConclusionIn conclusion, K-means clustering is a powerful unsupervised machine
learning algorithm for grouping unlabeled datasets. Its objective is to divide data
into clusters, making similar data points part of the same group. The algorithm
initializes cluster centroids and iteratively assigns data points to the nearest
centroid, updating centroids based on the mean of points in each cluster.
Frequently Asked Questions (FAQs)1. What is k-means clustering for data analysis?K-
means is a partitioning method that divides a dataset into ‘k’ distinct, non-
overlapping subsets (clusters) based on similarity, aiming to minimize the variance
within each cluster.
2.What is an example of k-means in real life?Customer segmentation in marketing,
where k-means groups customers based on purchasing behavior, allowing businesses to
tailor marketing strategies for different segments.
3. What type of data is k-means clustering model?K-means works well with numerical
data, where the concept of distance between data points is meaningful. It’s
commonly applied to continuous variables.
4.Is K-means used for prediction?K-means is primarily used for clustering and
grouping similar data points. It does not predict labels for new data; it assigns
them to existing clusters based on similarity.
5.What is the objective of k-means clustering?The objective is to partition data
into ‘k’ clusters, minimizing the intra-cluster variance. It seeks to form groups
where data points within each cluster are more similar to each other than to those
in other clusters.

Last Updated :
21 Dec, 2023

Like Article

Save Article

Previous

Music Recommendation System Using Machine Learning

Image Segmentation: In computer vision, image segmentation is the process of


partitioning an image into multiple segments. The goal of segmenting an image is to
change the representation of an image into something that is more meaningful and
easier to analyze. It is usually used for locating objects and creating
boundaries.
It is not a great idea to process an entire image because many parts in an image
may not contain any useful information. Therefore, by segmenting the image, we can
make use of only the important segments for processing.
An image is basically a set of given pixels. In image segmentation, pixels which
have similar attributes are grouped together. Image segmentation creates a pixel-
wise mask for objects in an image which gives us a more comprehensive and granular
understanding of the object.
Uses:
Used in self-driving cars. Autonomous driving is not possible without object
detection which involves segmentation.Used in the healthcare industry. Helpful in
segmenting cancer cells and tumours using which their severity can be gauged.
There are many more uses of image segmentation.
In this article, we will perform segmentation on an image of the monarch butterfly
using a clustering method called K Means Clustering.
K Means Clustering Algorithm:
K Means is a clustering algorithm. Clustering algorithms are unsupervised
algorithms which means that there is no labelled data available. It is used to
identify different classes or clusters in the given data based on how similar the
data is. Data points in the same group are more similar to other data points in
that same group than those in other groups.
K-means clustering is one of the most commonly used clustering algorithms. Here, k
represents the number of clusters.
Let’s see how does K-means clustering work –
Choose the number of clusters you want to find which is k.Randomly assign the data
points to any of the k clusters.Then calculate the center of the clusters.Calculate
the distance of the data points from the centers of each of the clusters.Depending
on the distance of each data point from the cluster, reassign the data points to
the nearest clusters.Again calculate the new cluster center.Repeat steps 4,5 and 6
till data points don’t change the clusters, or till we reach the assigned number of
iterations.
Requirements:
Make sure you have Python, Numpy, Matplotlib and OpenCV installed.
Code: Read in the image and convert it to an RGB image.

python3
import numpy as npimport matplotlib.pyplot as pltimport cv2 %matplotlib inline #
Read in the imageimage = cv2.imread('images/monarch.jpg') # Change color to RGB
(from BGR)image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) plt.imshow(image)

Now we have to prepare the data for K means. The image is a 3-dimensional shape but
to apply k-means clustering on it we need to reshape it to a 2-dimensional array.
Code:

python3

# Reshaping the image into a 2D array of pixels and 3 color values (RGB)pixel_vals
= image.reshape((-1,3)) # Convert to float typepixel_vals = np.float32(pixel_vals)
Now we will implement the K means algorithm for segmenting an image.
Code: Taking k = 3, which means that the algorithm will identify 3 clusters in the
image.

python3

#the below line of code defines the criteria for the algorithm to stop running,
#which will happen is 100 iterations are run or the epsilon (which is the required
accuracy) #becomes 85%criteria = (cv2.TERM_CRITERIA_EPS +
cv2.TERM_CRITERIA_MAX_ITER, 100, 0.85) # then perform k-means clustering with
number of clusters defined as 3#also random centres are initially choosed for k-
means clusteringk = 3retval, labels, centers = cv2.kmeans(pixel_vals, k, None,
criteria, 10, cv2.KMEANS_RANDOM_CENTERS) # convert data into 8-bit valuescenters =
np.uint8(centers)segmented_data = centers[labels.flatten()] # reshape data into the
original image dimensionssegmented_image =
segmented_data.reshape((image.shape)) plt.imshow(segmented_image)

Output:

Now if we change the value of k to 6, we get the following Output:

As you can see with an increase in the value of k, the image becomes clearer and
distinct because the K-means algorithm can classify more classes/cluster of colors.
K-means clustering works well when we have a small dataset. It can segment objects
in images and also give better results. But when it is applied on large datasets
(more number of images), it looks at all the samples in one iteration which leads
to a lot of time being taken up.

You might also like