0% found this document useful (0 votes)
16 views194 pages

Need For Machine Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views194 pages

Need For Machine Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 194

Machine Learning

Machine learning is a growing technology which enables computers to learn


automatically from past data. Machine learning uses various algorithms
for building mathematical models and making predictions using historical
data or information. Currently, it is being used for various tasks such as image
recognition, speech recognition, email filtering, Facebook auto-
tagging, recommender system, and many more.

Machine learning enables a machine to automatically learn from data, improve performance from
experiences, and predict things without being explicitly programmed.

Need for Machine Learning


The need for machine learning is increasing day by day. The reason behind
the need for machine learning is that it is capable of doing tasks that are too
complex for a person to implement directly. As a human, we have some
limitations as we cannot access the huge amount of data manually, so for
this, we need some computer systems and here comes the machine learning
to make things easy for us.

We can train machine learning algorithms by providing them the huge


amount of data and let them explore the data, construct the models, and
predict the required output automatically. The performance of the machine
learning algorithm depends on the amount of data, and it can be determined
by the cost function. With the help of machine learning, we can save both
time and money.

The importance of machine learning can be easily understood by its uses


cases, Currently, machine learning is used in self-driving cars, cyber
fraud detection, face recognition, and friend suggestion by
Facebook, etc. Various top companies such as Netflix and Amazon have
build machine learning models that are using a vast amount of data to
analyze the user interest and recommend product accordingly.

History of Machine Learning


Before some years (about 40-50 years), machine learning was science
fiction, but today it is the part of our daily life. Machine learning is making
our day to day life easy from self-driving cars to Amazon virtual
assistant "Alexa". However, the idea behind machine learning is so old and
has a long history. Below some milestones are given which have occurred in
the history of machine learning:
The early history of Machine Learning (Pre-1940):
o 1834: In 1834, Charles Babbage, the father of the computer, conceived a
device that could be programmed with punch cards. However, the machine
was never built, but all modern computers rely on its logical structure.
o 1936: In 1936, Alan Turing gave a theory that how a machine can determine
and execute a set of instructions.

The era of stored program computers:


o 1940: In 1940, the first manually operated computer, "ENIAC" was invented,
which was the first electronic general-purpose computer. After that stored
program computer such as EDSAC in 1949 and EDVAC in 1951 were invented.
o 1943: In 1943, a human neural network was modeled with an electrical
circuit. In 1950, the scientists started applying their idea to work and
analyzed how human neurons might work.

Computer machinery and intelligence:


o 1950: In 1950, Alan Turing published a seminal paper, "Computer
Machinery and Intelligence," on the topic of artificial intelligence. In his
paper, he asked, "Can machines think?"

Machine intelligence in Games:


o 1952: Arthur Samuel, who was the pioneer of machine learning, created a
program that helped an IBM computer to play a checkers game. It performed
better more it played.
o 1959: In 1959, the term "Machine Learning" was first coined by Arthur
Samuel.

The first "AI" winter:


o The duration of 1974 to 1980 was the tough time for AI and ML researchers,
and this duration was called as AI winter.
o In this duration, failure of machine translation occurred, and people had
reduced their interest from AI, which led to reduced funding by the
government to the researches.

Machine Learning from theory to reality


o 1959: In 1959, the first neural network was applied to a real-world problem
to remove echoes over phone lines using an adaptive filter.
o 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a neural
network NETtalk, which was able to teach itself how to correctly pronounce
20,000 words in one week.
o 1997: The IBM's Deep blue intelligent computer won the chess game
against the chess expert Garry Kasparov, and it became the first computer
which had beaten a human chess expert.

Machine Learning at 21st century


o 2006: In the year 2006, computer scientist Geoffrey Hinton has given a new
name to neural net research as "deep learning," and nowadays, it has
become one of the most trending technologies.
o 2012: In 2012, Google created a deep neural network which learned to
recognize the image of humans and cats in YouTube videos.
o 2014: In 2014, the Chabot "Eugen Goostman" cleared the Turing Test. It
was the first Chabot who convinced the 33% of human judges that it was not
a machine.
o 2014: DeepFace was a deep neural network created by Facebook, and they
claimed that it could recognize a person with the same precision as a human
can do.
o 2016: AlphaGo beat the world's number second player Lee sedol at Go
game. In 2017 it beat the number one player of this game Ke Jie.
o 2017: In 2017, the Alphabet's Jigsaw team built an intelligent system that
was able to learn the online trolling. It used to read millions of comments of
different websites to learn to stop online trolling.
Machine Learning at present:
Now machine learning has got a great advancement in its research, and it is
present everywhere around us, such as self-driving cars, Amazon
Alexa, Catboats, recommender system, and many more. It
includes Supervised, unsupervised, and reinforcement learning with
clustering, classification, decision tree, SVM algorithms, etc.

Modern machine learning models can be used for making various


predictions, including weather prediction, disease prediction, stock
market analysis, etc.

Applications of Machine learning


Machine learning is a buzzword for today's technology, and it is growing very
rapidly day by day. We are using machine learning in our daily life even
without knowing it such as Google Maps, Google assistant, Alexa, etc. Below
are some most trending real-world applications of Machine Learning:
1. Image Recognition:
Image recognition is one of the most common applications of machine
learning. It is used to identify objects, persons, places, digital images, etc.
The popular use case of image recognition and face detection is, Automatic
friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever


we upload a photo with our Facebook friends, then we automatically get a
tagging suggestion with name, and the technology behind this is machine
learning's face detection and recognition algorithm.

It is based on the Facebook project named "Deep Face," which is


responsible for face recognition and person identification in the picture.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under
speech recognition, and it's a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text,


and it is also known as "Speech to text", or "Computer speech
recognition." At present, machine learning algorithms are widely used by
various applications of speech recognition. Google
assistant, Siri, Cortana, and Alexa are using speech recognition
technology to follow the voice instructions.

3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us
the correct path with the shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-


moving, or heavily congested with the help of two ways:

o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It
takes information from the user and sends back to its database to improve
the performance.

4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment
companies such as Amazon, Netflix, etc., for product recommendation to
the user. Whenever we search for some product on Amazon, then we started
getting an advertisement for the same product while internet surfing on the
same browser and this is because of machine learning.

Google understands the user interest using various machine learning


algorithms and suggests the product as per customer interest.

As similar, when we use Netflix, we find some recommendations for


entertainment series, movies, etc., and this is also done with the help of
machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars.
Machine learning plays a significant role in self-driving cars. Tesla, the most
popular car manufacturing company is working on self-driving car. It is using
unsupervised learning method to train the car models to detect people and
objects while driving.

6. Email Spam and Malware Filtering:


Whenever we receive a new email, it is filtered automatically as important,
normal, and spam. We always receive an important mail in our inbox with
the important symbol and spam emails in our spam box, and the technology
behind this is Machine learning. Below are some spam filters used by Gmail:

o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters

Some machine learning algorithms such as Multi-Layer


Perceptron, Decision tree, and Naïve Bayes classifier are used for email
spam filtering and malware detection.

7. Virtual Personal Assistant:


We have various virtual personal assistants such as Google
assistant, Alexa, Cortana, Siri. As the name suggests, they help us in
finding the information using our voice instruction. These assistants can help
us in various ways just by our voice instructions such as Play music, call
someone, Open an email, Scheduling an appointment, etc.

These virtual assistants use machine learning algorithms as an important


part.

These assistant record our voice instructions, send it over the server on a
cloud, and decode it using ML algorithms and act accordingly.
8. Online Fraud Detection:
Machine learning is making our online transaction safe and secure by
detecting fraud transaction. Whenever we perform some online transaction,
there may be various ways that a fraudulent transaction can take place such
as fake accounts, fake ids, and steal money in the middle of a
transaction. So to detect this, Feed Forward Neural network helps us by
checking whether it is a genuine transaction or a fraud transaction.

For each genuine transaction, the output is converted into some hash values,
and these values become the input for the next round. For each genuine
transaction, there is a specific pattern which gets change for the fraud
transaction hence, it detects it and makes our online transactions more
secure.

9. Stock Market trading:


Machine learning is widely used in stock market trading. In the stock market,
there is always a risk of up and downs in shares, so for this machine
learning's long short term memory neural network is used for the
prediction of stock market trends.

10. Medical Diagnosis:


In medical science, machine learning is used for diseases diagnoses. With
this, medical technology is growing very fast and able to build 3D models
that can predict the exact position of lesions in the brain.

It helps in finding brain tumors and other brain-related diseases easily.

11. Automatic Language Translation:


Nowadays, if we visit a new place and we are not aware of the language then
it is not a problem at all, as for this also machine learning helps us by
converting the text into our known languages. Google's GNMT (Google
Neural Machine Translation) provide this feature, which is a Neural Machine
Learning that translates the text into our familiar language, and it called as
automatic translation.

The technology behind the automatic translation is a sequence to sequence


learning algorithm, which is used with image recognition and translates the
text from one language to another language.
Advantages of Machine Learning

1. Automation

Machine Learning is one of the driving forces behind automation, and it is


cutting down time and human workload. Automation can now be seen
everywhere, and the complex algorithm does the hard work for the user.
Automation is more reliable, efficient, and quick. With the help of machine
learning, now advanced computers are being designed. Now this advanced
computer can handle several machine-learning models and complex
algorithms. However, automation is spreading faster in the industry but, a lot
of research and innovation are required in this field.

2. Scope of Improvement

Machine Learning is a field where things keep evolving. It gives many


opportunities for improvement and can become the leading technology in
the future. A lot of research and innovation is happening in this technology,
which helps improve software and hardware.

3. Enhanced Experience in Online Shopping and Quality Education

Machine Learning is going to be used in the education sector extensively,


and it will be going to enhance the quality of education and student
experience. It has emerged in China; machine learning has improved student
focus. In the e-commerce field, Machine Learning studies your search feed
and give suggestion based on them. Depending upon search and browsing
history, it pushes targeted advertisements and notifications to users.

4. Wide Range of Applicability

This technology has a very wide range of applications. Machine learning


plays a role in almost every field, like hospitality, ed-tech, medicine,
science, banking, and business. It creates more opportunities.

Disadvantages of the Machine Learning


Nothing is perfect in the world. Machine Learning has some serious
limitations, which are bigger than human errors.

1. Data Acquisition
The whole concept of machine learning is about identifying useful data. The
outcome will be incorrect if a credible data source is not provided. The
quality of the data is also significant. If the user or institution needs more
quality data, wait for it. It will cause delays in providing the output. So,
machine learning significantly depends on the data and its quality.

2. Time and Resources

The data that machines process remains huge in quantity and differs greatly.
Machines require time so that their algorithm can adjust to the environment
and learn it. Trials runs are held to check the accuracy and reliability of the
machine. It requires massive and expensive resources and high-quality
expertise to set up that quality of infrastructure. Trials runs are costly as
they would cost in terms of time and expenses.

3. Results Interpretations

One of the biggest advantages of Machine learning is that interpreted data


that we get from the cannot be hundred percent accurate. It will have some
degree of inaccuracy. For a high degree of accuracy, algorithms should be
developed so that they give reliable results.

4. High Error Chances

The error committed during the initial stages is huge, and if not corrected at
that time, it creates havoc. Biasness and wrongness have to be dealt with
separately; they are not interconnected. Machine learning depends on two
factors, i.e., data and algorithm. All the errors are dependent on the two
variables. Any incorrectness in any variables would have huge repercussions
on the output.

5. Social Changes

Machine learning is bringing numerous social changes in society. The role of


machine learning-based technology in society has increased multifold. It is
influencing the thought process of society and creating unwanted problems
in society. Character assassination and sensitive details are disturbing the
social fabric of society.

6. Elimination of Human Interface

Automation, Artificial Intelligence, and Machine Learning have eliminated


human interface from some work. It has eliminated employment
opportunities. Now, all those works are conducted with the help of artificial
intelligence and machine learning.
7. Changing Nature of Jobs

With the advancement of machine learning, the nature of the job is


changing. Now, all the work are done by machine, and it is eating up the jobs
for human which were done earlier by them. It is difficult for those without
technical education to adjust to these changes.

8. Highly Expensive

This software is highly expensive, and not everybody can own it.
Government agencies, big private firms, and enterprises mostly own it. It
needs to be made accessible to everybody for wide use.

9. Privacy Concern

As we know that one of the pillars of machine learning is data. The collection
of data has raised the fundamental question of privacy. The way data is
collected and used for commercial purposes has always been a contentious
issue. In India, the Supreme court of India has declared privacy a
fundamental right of Indians. Without the user's permission, data cannot be
collected, used, or stored. However, many cases have come up that big firms
collect the data without the user's knowledge and using it for their
commercial gains.

10. Research and Innovations

Machine learning is evolving concept. This area has not seen any major
developments yet that fully revolutionized any economic sector. The area
requires continuous research and innovation.

Challenges of Machine Learning


The advancement of machine learning technology in recent years certainly
has improved our lives. However, the implementation of machine learning in
companies has also brought up several ethical issues regarding AI
technology. A few of them are:

Technological Singularity:
Although this topic attracts lots of attention from the many public, scientists
are not interested in the notion of AI exceeding humans' intelligence anytime
in the immediate future. This is often referred to as superintelligence and
superintelligence, which Nick Bostrum defines as "any intelligence that far
surpasses the top human brains in virtually every field, which includes
general wisdom, scientific creativity and social abilities." In spite of the fact
that the concept of superintelligence and strong AI isn't a reality in the world,
the concept poses some interesting questions when we contemplate the
potential use of autonomous systems, such as self-driving vehicles. It's
impossible to imagine that a car with no driver would never be involved in a
car accident, but who would be accountable and accountable in those
situations? Do we need to continue to explore autonomous vehicles, or
should we restrict the use of this technology to produce semi-autonomous
cars that encourage the safety of drivers? The jury isn't yet out on this issue.
However, these kinds of ethical debates are being fought as new and
genuine AI technology is developed.

AI Impact on Jobs:
While the majority of public opinion about artificial intelligence revolves
around job loss, the issue should likely be changed. With each new and
disruptive technology, we can see shifts in demand for certain job positions.
For instance, when we consider the automotive industry, a lot of
manufacturers like GM are focusing their efforts on electric vehicles to be in
line with green policies. The energy sector isn't going away, but the primary
source that fuels it is changing from an energy economy based on fuel to an
electrical one. Artificial intelligence must be seen as a way to think about it,
as artificial intelligence is expected to shift the need for jobs to different
areas. There will be people who can control these systems as data expands
and changes each day. It is still necessary resources in order to solve more
complicated issues within sectors that are more likely to suffer from demand
shifts, including customer service. The most important element of artificial
intelligence and its impact on the employment market will be in helping
individuals adapt to the new realms that are a result of the market.

Privacy:
Privacy is often frequently discussed in relation to data privacy security, data
protection, and security. These concerns have helped policymakers advance
their efforts recently. For instance, in 2016, GDPR legislation was introduced
to safeguard the personal information of individuals within Europe's
European Union and European Economic Area, which gives individuals more
control over their data. Within the United States, individual states are
creating policies, including the California Consumer Privacy Act (CCPA), that
require companies to inform their customers about the processing of their
data. This legislation is forcing companies to think about how they handle
and store personally identifiable information (PII). In the process, security
investments have become a business priority to remove any potential
vulnerabilities or opportunities to hack, monitor, and cyber-attacks.
Bias and Discrimination:
Discrimination and bias in different intelligent machines have brought up
several ethical issues about using artificial intelligence. How can we protect
ourselves from bias and discrimination when training data could be biased?
While most companies have well-meaning intentions with regard to their
automation initiatives, Reuters highlights the unexpected effects of
incorporating AI in hiring practices. As they tried to automate and make it
easier to do so, Amazon unintentionally biased potential candidates based on
gender in positions in the technical field, which led them to end the project.
When events like these come to light, Harvard Business Review (link located
outside of IBM) has raised pertinent questions about the application of AI in
hiring practices. For example, what kind of data could you analyse when
evaluating a candidate for a particular job.

Discrimination and bias aren't just limited to the human resource function.
They are present in a variety of applications ranging from software for facial
recognition to algorithms for social media.

Accountability:
There isn't a significant law to control AI practices. There's no mechanism for
enforcement to make sure that ethical AI is being used. Companies' primary
motivations to adhere to these standards are the negative effects of an
untrustworthy AI system on their bottom lines. To address the issue, ethical
frameworks have been developed in a partnership between researchers and
ethicists to regulate the creation and use of AI models. But, for the time
being, they only serve as a provide guidance the development of AI models.
Research has shown that shared responsibility and insufficient awareness of
potential effects aren't ideal for protecting society from harm.

Types of Machine Learning


Machine learning is a subset of AI, which enables the machine to
automatically learn from data, improve performance from past
experiences, and make predictions. Machine learning contains a set of
algorithms that work on a huge amount of data. Data is fed to these
algorithms to train them, and on the basis of training, they build the model &
perform a specific task.
These ML algorithms help to solve different business problems like
Regression, Classification, Forecasting, Clustering, and Associations, etc.

Based on the methods and way of learning, machine learning is divided into
mainly four types, which are:

1. Supervised Machine Learning


2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning

In this topic, we will provide a detailed description of the types of Machine


Learning along with their respective algorithms:

1. Supervised Machine Learning


As its name suggests, Supervised machine learning is based on supervision.
It means in the supervised learning technique, we train the machines using
the "labelled" dataset, and based on the training, the machine predicts the
output. Here, the labelled data specifies that some of the inputs are already
mapped to the output. More preciously, we can say; first, we train the
machine with the input and corresponding output, and then we ask the
machine to predict the output using the test dataset.

Let's understand supervised learning with an example. Suppose we have an


input dataset of cats and dog images. So, first, we will provide the training to
the machine to understand the images, such as the shape & size of the
tail of cat and dog, Shape of eyes, colour, height (dogs are taller,
cats are smaller), etc. After completion of training, we input the picture of
a cat and ask the machine to identify the object and predict the output. Now,
the machine is well trained, so it will check all the features of the object,
such as height, shape, colour, eyes, ears, tail, etc., and find that it's a cat.
So, it will put it in the Cat category. This is the process of how the machine
identifies the objects in Supervised Learning.

The main goal of the supervised learning technique is to map the


input variable(x) with the output variable(y). Some real-world
applications of supervised learning are Risk Assessment, Fraud
Detection, Spam filtering, etc.

Categories of Supervised Machine Learning


Supervised machine learning can be classified into two types of problems,
which are given below:

o Classification
o Regression

a) Classification

Classification algorithms are used to solve the classification problems in


which the output variable is categorical, such as "Yes" or No, Male or
Female, Red or Blue, etc. The classification algorithms predict the
categories present in the dataset. Some real-world examples of classification
algorithms are Spam Detection, Email filtering, etc.

Some popular classification algorithms are given below:

o Random Forest Algorithm


o Decision Tree Algorithm
o Logistic Regression Algorithm
o Support Vector Machine Algorithm

b) Regression

Regression algorithms are used to solve regression problems in which there


is a linear relationship between input and output variables. These are used to
predict continuous output variables, such as market trends, weather
prediction, etc.
Some popular Regression algorithms are given below:

o Simple Linear Regression Algorithm


o Multivariate Regression Algorithm
o Decision Tree Algorithm
o Lasso Regression

Advantages and Disadvantages of Supervised Learning


Advantages:

o Since supervised learning work with the labelled dataset so we can have an
exact idea about the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior
experience.

Disadvantages:

o These algorithms are not able to solve complex tasks.


o It may predict the wrong output if the test data is different from the training
data.
o It requires lots of computational time to train the algorithm.

Applications of Supervised Learning


Some common applications of Supervised Learning are given below:

o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this
process, image classification is performed on different image data with pre-
defined labels.
o Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis
purposes. It is done by using medical images and past labelled data with
labels for disease conditions. With such a process, the machine can identify a
disease for the new patients.
o Fraud Detection - Supervised Learning classification algorithms are used
for identifying fraud transactions, fraud customers, etc. It is done by using
historic data to identify the patterns that can lead to possible fraud.
o Spam detection - In spam detection & filtering, classification algorithms are
used. These algorithms classify an email as spam or not spam. The spam
emails are sent to the spam folder.
o Speech Recognition - Supervised learning algorithms are also used in
speech recognition. The algorithm is trained with voice data, and various
identifications can be done using the same, such as voice-activated
passwords, voice commands, etc.

2. Unsupervised Machine Learning


Unsupervised learning is different from the Supervised learning technique; as
its name suggests, there is no need for supervision. It means, in
unsupervised machine learning, the machine is trained using the unlabeled
dataset, and the machine predicts the output without any supervision.

In unsupervised learning, the models are trained with the data that is neither
classified nor labelled, and the model acts on that data without any
supervision.

The main aim of the unsupervised learning algorithm is to group or


categories the unsorted dataset according to the similarities,
patterns, and differences. Machines are instructed to find the hidden
patterns from the input dataset.

Let's take an example to understand it more preciously; suppose there is a


basket of fruit images, and we input it into the machine learning model. The
images are totally unknown to the model, and the task of the machine is to
find the patterns and categories of the objects.

So, now the machine will discover its patterns and differences, such as colour
difference, shape difference, and predict the output when it is tested with the
test dataset.

Categories of Unsupervised Machine Learning


Unsupervised Learning can be further classified into two types, which are
given below:
o Clustering
o Association

1) Clustering

The clustering technique is used when we want to find the inherent groups
from the data. It is a way to group the objects into a cluster such that the
objects with the most similarities remain in one group and have fewer or no
similarities with the objects of other groups. An example of the clustering
algorithm is grouping the customers by their purchasing behaviour.

Some of the popular clustering algorithms are given below:

o K-Means Clustering algorithm


o Mean-shift algorithm
o DBSCAN Algorithm
o Principal Component Analysis
o Independent Component Analysis

2) Association

Association rule learning is an unsupervised learning technique, which finds


interesting relations among variables within a large dataset. The main aim of
this learning algorithm is to find the dependency of one data item on another
data item and map those variables accordingly so that it can generate
maximum profit. This algorithm is mainly applied in Market Basket
analysis, Web usage mining, continuous production, etc.

Some popular algorithms of Association rule learning are Apriori


Algorithm, Eclat, FP-growth algorithm.

Advantages and Disadvantages of Unsupervised Learning


Algorithm
Advantages:

o These algorithms can be used for complicated tasks compared to the


supervised ones because these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the
unlabeled dataset is easier as compared to the labelled dataset.
Disadvantages:

o The output of an unsupervised algorithm can be less accurate as the dataset


is not labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the
unlabelled dataset that does not map with the output.

Applications of Unsupervised Learning


o Network Analysis: Unsupervised learning is used for identifying plagiarism
and copyright in document network analysis of text data for scholarly articles.
o Recommendation Systems: Recommendation systems widely use
unsupervised learning techniques for building recommendation applications
for different web applications and e-commerce websites.
o Anomaly Detection: Anomaly detection is a popular application of
unsupervised learning, which can identify unusual data points within the
dataset. It is used to discover fraudulent transactions.
o Singular Value Decomposition: Singular Value Decomposition or SVD is
used to extract particular information from the database. For example,
extracting information of each user located at a particular location.

3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm
that lies between Supervised and Unsupervised machine learning. It
represents the intermediate ground between Supervised (With Labelled
training data) and Unsupervised learning (with no labelled training data)
algorithms and uses the combination of labelled and unlabeled datasets
during the training period.

Although Semi-supervised learning is the middle ground between supervised


and unsupervised learning and operates on the data that consists of a few
labels, it mostly consists of unlabeled data. As labels are costly, but for
corporate purposes, they may have few labels. It is completely different from
supervised and unsupervised learning as they are based on the presence &
absence of labels.

To overcome the drawbacks of supervised learning and


unsupervised learning algorithms, the concept of Semi-supervised
learning is introduced. The main aim of semi-supervised learning is to
effectively use all the available data, rather than only labelled data like in
supervised learning. Initially, similar data is clustered along with an
unsupervised learning algorithm, and further, it helps to label the unlabeled
data into labelled data. It is because labelled data is a comparatively more
expensive acquisition than unlabeled data.

We can imagine these algorithms with an example. Supervised learning is


where a student is under the supervision of an instructor at home and
college. Further, if that student is self-analysing the same concept without
any help from the instructor, it comes under unsupervised learning. Under
semi-supervised learning, the student has to revise himself after analyzing
the same concept under the guidance of an instructor at college.

Advantages and disadvantages of Semi-supervised Learning


Advantages:

o It is simple and easy to understand the algorithm.


o It is highly efficient.
o It is used to solve drawbacks of Supervised and Unsupervised Learning
algorithms.

Disadvantages:

o Iterations results may not be stable.


o We cannot apply these algorithms to network-level data.
o Accuracy is low.

4. Reinforcement Learning
Reinforcement learning works on a feedback-based process, in
which an AI agent (A software component) automatically explore its
surrounding by hitting & trail, taking action, learning from
experiences, and improving its performance. Agent gets rewarded for
each good action and get punished for each bad action; hence the goal of
reinforcement learning agent is to maximize the rewards.

In reinforcement learning, there is no labelled data like supervised learning,


and agents learn from their experiences only.

The reinforcement learning process is similar to a human being; for example,


a child learns various things by experiences in his day-to-day life. An
example of reinforcement learning is to play a game, where the Game is the
environment, moves of an agent at each step define states, and the goal of
the agent is to get a high score. Agent receives feedback in terms of
punishment and rewards.

Due to its way of working, reinforcement learning is employed in different


fields such as Game theory, Operation Research, Information theory,
multi-agent systems.

A reinforcement learning problem can be formalized using Markov Decision


Process(MDP). In MDP, the agent constantly interacts with the environment
and performs actions; at each action, the environment responds and
generates a new state.

Categories of Reinforcement Learning


Reinforcement learning is categorized mainly into two types of
methods/algorithms:

o Positive Reinforcement Learning: Positive reinforcement learning


specifies increasing the tendency that the required behaviour would occur
again by adding something. It enhances the strength of the behaviour of the
agent and positively impacts it.
o Negative Reinforcement Learning: Negative reinforcement learning
works exactly opposite to the positive RL. It increases the tendency that the
specific behaviour would occur again by avoiding the negative condition.

Real-world Use cases of Reinforcement Learning


o Video Games:
RL algorithms are much popular in gaming applications. It is used to gain
super-human performance. Some popular games that use RL algorithms
are AlphaGO and AlphaGO Zero.
o Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper
showed that how to use RL in computer to automatically learn and schedule
resources to wait for different jobs in order to minimize average job
slowdown.
o Robotics:
RL is widely being used in Robotics applications. Robots are used in the
industrial and manufacturing area, and these robots are made more powerful
with reinforcement learning. There are different industries that have their
vision of building intelligent robots using AI and Machine learning technology.
o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented
with the help of Reinforcement Learning by Salesforce company.

Advantages and Disadvantages of Reinforcement Learning


Advantages

o It helps in solving complex real-world problems which are difficult to be


solved by general techniques.
o The learning model of RL is similar to the learning of human beings; hence
most accurate results can be found.
o Helps in achieving long term results.

Disadvantage

o RL algorithms are not preferred for simple problems.


o RL algorithms require huge data and computations.
o Too much reinforcement learning can lead to an overload of states which can
weaken the results.

Mathematical foundations:

Machine Learning Math


We could learn many topics from the math subject, but if we
want to focus on the math used in machine learning, we need
to specify it. In this case, I like to use the necessary math
references explained in the Machine Learning Math book by
M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021.
In their book, there are math foundations that are important
for Machine Learning. The math subject is:

Image created by Author

Six math subjects become the foundation for machine


learning. Each subject is intertwined to develop our machine
learning model and reach the “best” model for generalizing
the dataset.

Let’s dive deeper for each subject to know what they are.

Linear Algebra
What is Linear Algebra? This is a branch of mathematic that
concerns the study of the vectors and certain rules to
manipulate the vector. When we are formalizing intuitive
concepts, the common approach is to construct a set of
objects (symbols) and a set of rules to manipulate these
objects. This is what we knew as algebra.

If we talk about Linear Algebra in machine learning, it is


defined as the part of mathematics that uses vector space
and matrices to represent linear equations.

When talking about vectors, people might flashback to their


high school study regarding the vector with direction, just
like the image below.

Geometric Vector (Image by Author)

This is a vector, but not the kind of vector discussed in the


Linear Algebra for Machine Learning. Instead, it would be
this image below we would talk about.
Vector 4x1 Matrix (Image by Author)

What we had above is also a Vector, but another kind of


vector. You might be familiar with matrix form (the image
below). The vector is a matrix with only 1 column, which is
known as a column vector. In other words, we can think of a
matrix as a group of column vectors or row vectors. In
summary, vectors are special objects that can be added
together and multiplied by scalars to produce another object
of the same kind. We could have various objects called
vectors.

Matrix (Image by Author)


Linear algebra itself s a systematic representation of data
that computers can understand, and all the operations in
linear algebra are systematic rules. That is why in modern
time machine learning, Linear algebra is an important study.

An example of how linear algebra is used is in the linear


equation. Linear algebra is a tool used in the Linear Equation
because so many problems could be presented systematically
in a Linear way. The typical Linear equation is presented in
the form below.

Linear Equation (Image by Author)

To solve the linear equation problem above, we use Linear


Algebra to present the linear equation in a systematical
representation. This way, we could use the matrix
characterization to look for the most optimal solution.

Linear Equation in Matrix Representation (Image by Author)


To summary the Linear Algebra subject, there are three
terms you might want to learn more as a starting point
within this subject:

 Vector

 Matrix

 Linear Equation

Analytic Geometry (Coordinate Geometry)


Analytic geometry is a study in which we learn the data
(point) position using an ordered pair of coordinates. This
study is concerned with defining and representing
geometrical shapes numerically and extracting numerical
information from the shapes numerical definitions and
representations. We project the data into the plane in a
simpler term, and we receive numerical information from
there.
Cartesian Coordinate (Image by Author)

Above is an example of how we acquired information from


the data point by projecting the dataset into the plane. How
we acquire the information from this representation is the
heart of Analytical Geometry. To help you start learning this
subject, here are some important terms you might need.

 Distance Function

A distance function is a function that provides numerical


information for the distance between the elements of a set. If
the distance is zero, then elements are equivalent. Else, they
are different from each other.
An example of the distance function is Euclidean Distance
which calculates the linear distance between two data points.

Euclidean Distance Equation (Image by Author)

 Inner Product

The inner product is a concept that introduces intuitive


geometrical concepts, such as the length of a vector and
the angle or distance between two vectors. It is often
denoted as ⟨x,y⟩ (or occasionally (x,y) or ⟨x|y⟩).

Matrix Decomposition
Matrix Decomposition is a study that concerning the way to
reducing a matrix into its constituent parts. Matrix
Decomposition aims to simplify more complex matrix
operations on the decomposed matrix rather than on its
original matrix.

A common analogy for matrix decomposition is like factoring


numbers, such as factoring 8 into 2 x 4. This is why matrix
decomposition is synonymical to matrix factorization. There
are many ways to decompose a matrix, so there is a range of
different matrix decomposition techniques. An example is the
LU Decomposition in the image below.

LU Decomposition (Image by Author)

Vector Calculus
Calculus is a mathematical study that concern with
continuous change, which mainly consists of functions and
limits. Vector calculus itself is concerned with the
differentiation and integration of the vector fields. Vector
Calculus is often called multivariate calculus, although it
has a slightly different study case. Multivariate calculus
deals with calculus application functions of the multiple
independent variables.

There are a few important terms I feel people need to know


when starting learning the Vector Calculus, they are:

 Derivative and Differentiation

The derivative is a function of real numbers that measure


the change of the function value (output value) concerning a
change in its argument (input value). Differentiation is the
action of computing a derivative.

Derivative Equation (Image by Author)

 Partial Derivative

The partial derivative is a derivative function where


several variables are calculated within the derivative
function with respect to one of those variables could be
varied, and the other variable are held constant (as opposed
to the total derivative, in which all variables are allowed to
vary).

 Gradient

The gradient is a word related to the derivative or the rate


of change of a function; you might consider that gradient is a
fancy word for derivative. The term gradient is typically used
for functions with several inputs and a single output (scalar).
The gradient has a direction to move from their current
location, e.g., up, down, right, left.
Probability and Distribution
Probability is a study of uncertainty (loosely terms). The
probability here can be thought of as a time where the event
occurs or the degree of belief about an event's occurrence.
The probability distribution is a function that measures
the probability of a particular outcome (or probability set of
outcomes) that would occur associated with the random
variable. The common probability distribution function is
shown in the image below.

Normal Distribution Probability Function (Image by Author)

Probability theory and statistics are often associated with


a similar thing, but they concern different aspects of
uncertainty:

•In math, we define probability as a model of some process


where random variables capture the underlying uncertainty,
and we use the rules of probability to summarize what
happens.
•In statistics, we try to figure out the underlying process
observe of something that has happened and tries to explain
the observations.

When we talk about machine learning, it is close to statistics


because its goal is to construct a model that adequately
represents the process that generated the data.

Optimization
In the learning objective, training a machine learning model
is all about finding a good set of parameters. What we
consider “good” is determined by the objective function or
the probabilistic models. This is what optimization
algorithms are for; given an objective function, we try to
find the best value.

Commonly, objective functions in machine learning are


trying to minimize the function. It means the best value is
the minimum value. Intuitively, if we try to find the best
value, it would like finding the valleys of the objective
function where the gradients point us uphill. That is why we
want to move downhill (opposite to the gradient) and hope to
find the lowest (deepest) point. This is the concept
of gradient descent.
Gradient Descent (Image by Author)

There are few terms as a starting point when learning


optimization. They are:

 Local Minima and Global Minima

The point at which a function best values takes the minimum


value is called the global minima. However, when the goal
is to minimize the function and solved it using optimization
algorithms such as gradient descent, the function could
have a minimum value at different points. Those several
points which appear to be minima but are not the point
where the function actually takes the minimum value are
called local minima.
Local and Global Minima (Image by Author)

 Unconstrained Optimization and Constrained


Optimization

Unconstrained Optimization is an optimization function


where we find a minimum of a function under the assumption
that the parameters can take any possible value (no
parameter limitation). Constrained
Optimization simply limits the possible value by introducing
a set of constraints.

Gradient descent is an Unconstrained optimization if there is


no parameter limitation. If we set some limit, for example, x
> 1, it is an unconstrained optimization.

Conclusion
Machine Learning is an everyday tool that Data scientists use
to obtain the valuable pattern we need. Learning the math
behind machine learning could provide you an edge in your
work. There are many math subjects out there, but there are
6 subjects that matter the most when we are starting
learning machine learning math, and that is:

 Linear Algebra

 Analytic Geometry

 Matrix Decomposition

 Vector Calculus

 Probability and Distribution

 Optimization
What is Bayes Theorem?
Bayes theorem is one of the most popular machine learning concepts that
helps to calculate the probability of occurring one event with uncertain
knowledge while other one has already occurred.

Bayes' theorem can be derived using product rule and conditional probability
of event X with known event Y:

o According to the product rule we can express as the probability of event X


with known event Y as follows;

1. P(X ? Y)= P(X|Y) P(Y) {equation 1}


o Further, the probability of event Y with known event X:

1. P(X ? Y)= P(Y|X) P(X) {equation 2}


Mathematically, Bayes theorem can be expressed by combining both
equations on right hand side. We will get:

Here, both events X and Y are independent events which means probability of
outcome of both events does not depends one another.

The above equation is called as Bayes Rule or Bayes Theorem.

o P(X|Y) is called as posterior, which we need to calculate. It is defined as


updated probability after considering the evidence.
o P(Y|X) is called the likelihood. It is the probability of evidence when
hypothesis is true.
o P(X) is called the prior probability, probability of hypothesis before
considering the evidence
o P(Y) is called marginal probability. It is defined as the probability of evidence
under any consideration.

Hence, Bayes Theorem can be written as:

posterior = likelihood * prior / evidence

decision theory in machine learning?

Decision theory is a study of an agent's rational choices that supports all kinds of
progress in technology such as work on machine learning and artificial intelligence.
Decision theory looks at how decisions are made, how multiple decisions influence one
another, and how decision-making parties deal with uncertainty.

There are two branches of decision theory – Normative Decision Theory and Optimal Decision
Theory.
There are 4 basic elements in decision theory: acts, events, outcomes, and payoffs.

A very fast intro to decision theory

There are 4 basic elements in decision theory: acts, events,


outcomes, and payoffs. Acts are the actions being considered by the agent -in the
example elow, taking the raincoat or not; events are occurrences taking place outside
the control of the agent (rain or lack thereof); outcomes are the result of the
occurrence (or lack of it) of acts and events (staying dry or not; being burdened by the
raincoat or not); payoffs are the values the decision maker is placing on the
occurrences (for example, how much being free of the nuisance of carrying an
raincoat is worth to one). Payoffs can be positive (staying dry) or negative (the
raincoat nuisance). It is often useful to represent a decision problem by a tree.
Here a square indicates a node in the tree where a decision is made and a circle where
events take place. The tree does not contain payoffs yet, but they can easily be placed
by the outcomes.
In general, we can note two things. First, the nature of the payoffs depends on
one’s objectives. If one is interested only in making money, then payoffs are best
accounted for in terms of money. However, if one is interested in, say, safety, then
the payoffs are best accounted for in terms of risk of accident, for example. If any
numerical approach is possible when disparate objectives are involved, there must be
some universal measurable quantity making them comparable. (In fact, utility, of
which more later, is such a quantity). Second, decision making trees can become
unmanageable very fast if one tries to account for too many possibilities. For
example, it would be physically impossible to account for all the possibilities involved
in the decision of which 50 out of 200 gadgets should be intensively marketed, as the
number of possible combinations, 200!/(50! ∙ 150!) is simply astronomical. Hence
one must use good judgment in limiting the options considered; this is potentially
problematic, as one may unwittingly fail to consider a possible action which would
produce very good outcomes.

What is the formula for decision theory?


The expected value of perfect information (EVPI), that is the loss resulting from not
having perfect information, is the difference between the expected value under certainty
and the expected value under uncertainty. That is, EV P I = EMVUC − EMV.
Information theory
Information theory is the mathematical treatment of the concepts, parameters and rules
governing the transmission of messages through communication systems.
Information theory is the study of how much information is present in the
signals or data we receive from our environment. AI / Machine learning (ML) is
about extracting interesting representations/information from data which are
then used for building the models. Thus, information theory fundamentals are
key to processing information while building machine learning models. In this
blog post, we will provide examples of information theory concepts and entropy
concepts so that you can better understand them. We will also discuss how
concepts of information theory, entropy, etc are related to machine learning.
What is information theory and what
are its key concepts?
Information theory is the study of encoding, decoding, transmitting, and
manipulating information. Information theory provides tools & techniques to
compare and measure the information present in a signal. In simpler words,
how much information is present in one or more statements is a field of study
called information theory.

The greater the degree of surprise in the statements, the greater the
information contained in the statements. For example, let’s say commuting
from place A to B takes 3 hours on average and is known to everyone. If
somebody makes this statement, the statement provides no information at all
as this is already known to everyone. Now, if someone says that it takes 2
hours to go from place A to B provided a specific route is taken, then this
statement consists of good bits of information as there is an element of surprise
in the statement.
The extent of information required to describe an event depends upon the
possibility of occurrence of that event. If the event is a common event, not
much information is required to describe the event. However, for unusual
events, a good amount of information will be needed to describe such
events. Unusual events have a higher degree of surprises and hence greater
associated information.
The amount of information associated with event outcomes depends upon the
probability distribution associated with that event. In other words, the amount
of information is related to the probability distribution of event outcomes.
Recall that the event and its outcomes can be represented as the different
values of the random variable, X from the given sample space. And, the random
variable has an associated probability distribution with a probability associated
with each outcome including the common outcomes consisting of less
information and rare outcomes consisting of a lot of information. The higher
the probability of an event outcome, the lesser the
information contained if that outcome happens. The smaller the probability
of an event outcome, the greater the information contained if that
outcome with lesser probability happens.
How do we measure the information?
There are the following requirements for measuring the information associated
with events:

 Information (or degree of surprise) associated with a single discrete event:


The information associated with a single discrete event can be measured in
terms of the number of bits. Shannon introduced the term bits as the unit
of information. This information is also called self-information.
 Information (or degree of surprise) associated with the random variable
whose values represent different event outcomes where the values can be
discrete or continuous. Information associated with the random variable is
related to probability distribution as described in the previous section. The
amount of information associated with the random variable is measured
using Entropy (or Shannon Entropy).
 The entropy of the event representing the random variable equals the
average self-information from observing each outcome of the event.

What is Entropy?
Entropy represents the amount of information associated with the random
variable as the function of the probability distribution for that random variable,
be the probability distribution be probability density function (PDF) or
probability mass function (PMF). The following is the formula for the entropy
for a discrete random variable.

Where Pi represents the probability of a specific value of the random variable


X. The following represents the entropy of a continuous random variable. It is
also termed differential entropy.
How are information theory, entropy, and
machine learning related?
Machine learning (ML) models are about building models using the
representation of data in the way that the representations consist of a lot of
information. The representation is also termed features. These representations
are crafted manually by data scientists or using deep learning algorithms such
as autoencoders. However, the goal is to come up with representations that
consist the most of the information useful for building models that generalize
well by making accurate predictions on the unseen data.

Information theory is the study of extracting information from the data or


signals related to an event. The information represents the degree of surprise
associated with the data, signal, or statements. The greater the degree of
surprise, the greater the information. The key concept in information theory is
entropy which represents the amount of information present in the data or
signals. The entropy is associated with the probability distribution related to
the outcomes of the event (random variable). The higher the probability of
occurrence of an event, the lesser the information from that outcome.

The performance of the machine learning models depends upon how close is
the estimated probability distribution of the random variable (representing the
response variable of ML models) against their true probability distribution.
This can be measured in terms of the entropy loss between the true probability
distribution and the estimated probability distribution of the response variable.
This is also termed cross-entropy loss as it represents entropy loss between
two probability distributions. Recall that entropy can be calculated as a
function of probability distribution related to different outcomes of the random
variables.
The goal of training a classification machine learning model is to come up with
a model which predicts the probability of the response variable belonging to
different classes, as close to true probability. If the model predicts the class as
0 when the true class is 1, the entropy is very high. If the model predicts the
class as 0 when the true class is 0, the entropy is very low. The goal is to
minimize the difference between the estimated probability and true probability
that a particular data set belongs to a specific class. In other words, the goal is
to minimize the cross-entropy loss – the difference between the true and
estimated probability distribution of the response variable (random variable).

The goal is to maximize the occurrence of the data set including the predictor
dataset and response data/label. In other words, the goal is to estimate the
parameters of the models that maximize the occurrence of the data set. The
occurrence of the dataset can be represented in terms of probability. Thus,
maximizing the occurrence of the data set can be represented as maximizing
the probability of occurrence of the data set including class labels and
predictor dataset. The following represents the probability that needs to be
maximized based on the estimation of parameters. This is also
called maximum likelihood estimation. The probability of occurrence of data
can be represented as the joint probability of occurrence of each class label.
Assuming that every outcome of the event is independent of others, the
probability of occurrence of the data can be represented as the following:

According to the maximum likelihood estimation, maximizing the above


equation is equivalent to minimizing the negative log-likelihood of the
following:

In the case of softmax regression, for any pair of true label vs predicted label
for Q classes, the loss function can be calculated as the following:

While training machine learning models for the classification problems, the
goal remains to minimize the loss function across all pairs of true and predicted
labels. The goal is to minimize the cross-entropy loss.

UNIT-2

Introduction

Supervised machine learning is a type of machine learning that learns the relationship
between input and output. The inputs are known as features or ‘X variables’ and output
is generally referred to as the target or ‘y variable’. The type of data which contains both
the features and the target is known as labeled data. It is the key difference between
supervised and unsupervised machine learning, two prominent types of machine
learning. In this tutorial you will learn:

 What is Supervised Machine Learning


 Supervised vs. Unsupervised Machine Learning
 Semi-Supervised Machine Learning
 Supervised Machine Learning Algorithms:
 Linear Regression
 Decision Tree
 K Nearest Neighbors
 Random Forest
 Naive Bayes
 Supervised Machine Learning Python Code Example

What is Supervised Machine Learning?

Supervised machine learning learns patterns and relationships between input


and output data. It is defined by its use of labeled data. A labeled data is a
dataset that contains a lot of examples of Features and Target. Supervised
learning uses algorithms that learn the relationship of Features and Target
from the dataset. This process is referred to as Training or Fitting.

Discriminative Models?

The discriminative model refers to a class of models used in Statistical


Classification, mainly used for supervised machine learning. These types of
models are also known as conditional models since they learn the
boundaries between classes or labels in a dataset.

Discriminative models focus on modeling the decision boundary between


classes in a classification problem. The goal is to learn a function that maps
inputs to binary outputs, indicating the class label of the input. Maximum
likelihood estimation is often used to estimate the parameters of the
discriminative model, such as the coefficients of a logistic regression model or
the weights of a neural network.
Discriminative models (just as in the literal meaning) separate classes instead
of modeling the conditional probability and don’t make any assumptions about
the data points. But these models are not capable of generating new data
points. Therefore, the ultimate objective of discriminative models is to
separate one class from another.

If we have some outliers present in the dataset, discriminative models work


better compared to generative models i.e., discriminative models are more
robust to outliers. However, one major drawback of these models is
the misclassification problem, i.e., wrongly classifying a data point.

Image Source: medium.com


The Mathematics of Discriminative Models
‌Training discriminative classifiers or discriminant analysis involves estimating
a function f: X -> Y, or probability P(Y|X)

 Assume some functional form for the probability, such as P(Y|X)


 With the help of training data, we estimate the parameters of P(Y|X)

Examples of Discriminative Models

 ‌Logistic regression
 Support vector machines(SVMs)
 ‌Traditional neural networks
 ‌Nearest neighbor
 Conditional Random Fields (CRFs)
 Decision Trees and Random Forest

What Are Generative Models?

Generative models are considered a class of statistical models that can


generate new data instances. These models are used in unsupervised
machine learning as a means to perform tasks such as

 Probability and Likelihood estimation,


 Modeling data points
 To describe the phenomenon in data,
 To distinguish between classes based on these probabilities.
Since these models often rely on the Bayes theorem to find the joint
probability, generative models can tackle a more complex task than
analogous discriminative models.

So, the Generative approach focuses on the distribution of individual classes


in a dataset, and the learning algorithms tend to model the underlying patterns
or distribution of the data points (e.g., gaussian). These models use the
concept of joint probability and create instances where a given feature (x) or
input and the desired output or label (y) exist simultaneously.

These models use probability estimates and likelihood to model data points
and differentiate between different class labels present in a dataset. Unlike
discriminative models, these models can also generate new data points.

However, they also have a major drawback – If there is a presence of outliers


in the dataset, then it affects these types of models to a significant extent.
Image
Source: medium.com

The Mathematics of Generative Models


‌Training generative classifiers involve estimating a function f: X -> Y, or
probability P(Y|X):

 Assume some functional form for the probabilities such as P(Y), P(X|Y)
 With the help of training data, we estimate the parameters of P(X|Y), P(Y)
 Use the Bayes theorem to calculate the posterior probability P(Y |X)

Examples of Generative Models

 ‌Naïve Bayes
 Bayesian networks
 Markov random fields
 ‌Hidden Markov Models (HMMs)
 Latent Dirichlet Allocation (LDA)
 Generative Adversarial Networks (GANs)
 Autoregressive Model

Linear Regression

Linear regression is one of the simplest machine learning algorithms


available, it is used to learn to predict continuous value (dependent variable)
based on the features (independent variable) in the training dataset. The
value of the dependent variable which represents the effect, is influenced by
changes in the value of the independent variable.

If you recall the “line of best fit” from school days, this is exactly what linear
regression is. Predicting a person's weight based on their height is a
straightforward example of this concept.
PROS CONS

Simple, easy to understand and interpret Easy to overfit

Performs exceptionally well for linearly separable Assumes linearity between features and target
data variable.

Least Squares Method?


The least squares method is a form of mathematical regression analysis used
to determine the line of best fit for a set of data, providing a visual
demonstration of the relationship between the data points. Each point of data
represents the relationship between a known independent variable and an
unknown dependent variable. This method is commonly used by statisticians
and traders who want to identify trading opportunities and trends.

KEY TAKEAWAYS

 The least squares method is a statistical procedure to find the best fit
for a set of data points.
 The method works by minimizing the sum of the offsets or residuals of
points from the plotted curve.
 Least squares regression is used to predict the behavior of dependent
variables.
 The least squares method provides the overall rationale for the
placement of the line of best fit among the data points being studied.
 Traders and analysts can use the least squares method to identify
trading opportunities and economic or financial trends.
Understanding the Least Squares Method
The least squares method is a form of regression analysis that provides the
overall rationale for the placement of the line of best fit among the data points
being studied. It begins with a set of data points using two variables, which
are plotted on a graph along the x- and y-axis. Traders and analysts can use
this as a tool to pinpoint bullish and bearish trends in the market along with
potential trading opportunities.

The most common application of this method is sometimes referred to as


linear or ordinary. It aims to create a straight line that minimizes the sum of
squares of the errors generated by the results of the associated equations,
such as the squared residuals resulting from differences in the observed
value and the value anticipated based on that model.

For instance, an analyst may use the least squares method to generate a line
of best fit that explains the potential relationship between independent and
dependent variables. The line of best fit determined from the least squares
method has an equation that highlights the relationship between the data
points.

If the data shows a lean relationship between two variables, it results in a


least-squares regression line. This minimizes the vertical distance from the
data points to the regression line. The term least squares is used because it
is the smallest sum of squares of errors, which is also called the variance. A
non-linear least-squares problem, on the other hand, has no closed solution
and is generally solved by iteration.

Overfitting
Overfitting occurs when our machine learning model tries to cover all the
data points or more than the required data points present in the given
dataset. Because of this, the model starts caching noise and inaccurate
values present in the dataset, and all these factors reduce the efficiency and
accuracy of the model. The overfitted model has low bias and high
variance.
PlayNext

Unmute

Current Time 0:00

Duration 18:10

Loaded: 0.37%

Fullscreen

Backward Skip 10sPlay VideoForward Skip 10s

The chances of occurrence of overfitting increase as much we provide


training to our model. It means the more we train our model, the more
chances of occurring the overfitted model.

Overfitting is the main problem that occurs in supervised learning.

Example: The concept of the overfitting can be understood by the below


graph of the linear regression output:
As we can see from the above graph, the model tries to cover all the data
points present in the scatter plot. It may look efficient, but in reality, it is not
so. Because the goal of the regression model to find the best fit line, but here
we have not got any best fit, so, it will generate the prediction errors.

How to avoid the Overfitting in Model


Both overfitting and underfitting cause the degraded performance of the
machine learning model. But the main cause is overfitting, so there are some
ways by which we can reduce the occurrence of overfitting in our model.

o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling

Underfitting
Underfitting occurs when our machine learning model is not able to capture
the underlying trend of the data. To avoid the overfitting in the model, the
fed of training data can be stopped at an early stage, due to which the model
may not learn enough from the training data. As a result, it may fail to find
the best fit of the dominant trend in the data.

In the case of underfitting, the model is not able to learn enough from the
training data, and hence it reduces the accuracy and produces unreliable
predictions.

An underfitted model has high bias and low variance.

Example: We can understand the underfitting using below output of the


linear regression model:
As we can see from the above diagram, the model is unable to capture the data
points present in the plot.

How to avoid underfitting:


o By increasing the training time of the model.
o By increasing the number of features.

Cross Validation in Machine Learning


In machine learning, we couldn’t fit the model on the training data and
can’t say that the model will work accurately for the real data. For this,
we must assure that our model got the correct patterns from the data,
and it is not getting up too much noise. For this purpose, we use the
cross-validation technique.
Cross validation is a technique used in machine learning to evaluate
the performance of a model on unseen data. It involves dividing the
available data into multiple folds or subsets, using one of these folds as
a validation set, and training the model on the remaining folds. This
process is repeated multiple times, each time using a different fold as
the validation set. Finally, the results from each validation step are
averaged to produce a more robust estimate of the model’s
performance.
The main purpose of cross validation is to prevent overfitting, which
occurs when a model is trained too well on the training data and
performs poorly on new, unseen data. By evaluating the model on
multiple validation sets, cross validation provides a more realistic
estimate of the model’s generalization performance, i.e., its ability to
perform well on new, unseen data.
There are several types of cross validation techniques, including k-fold
cross validation, leave-one-out cross validation, and stratified cross
validation. The choice of technique depends on the size and nature of
the data, as well as the specific requirements of the modeling problem.
In summary, cross validation is an important step in the machine
learning process and helps to ensure that the model selected for
deployment is robust and generalizes well to new data.
Cross-Validation
Cross-validation is a technique in which we train our model using the
subset of the data-set and then evaluate using the complementary
subset of the data-set. The three steps involved in cross-validation are
as follows :
1. Reserve some portion of sample data-set.
2. Using the rest data-set train the model.
3. Test the model using the reserve portion of the data-set.
Methods of Cross Validation
Validation In this method, we perform training on the 50% of the
given data-set and rest 50% is used for the testing purpose. The major
drawback of this method is that we perform training on the 50% of the
dataset, it may possible that the remaining 50% of the data contains
some important information which we are leaving while training our
model i.e higher bias. LOOCV (Leave One Out Cross Validation) In
this method, we perform training on the whole data-set but leaves only
one data-point of the available data-set and then iterates for each
data-point. It has some advantages as well as disadvantages also. An
advantage of using this method is that we make use of all data points
and hence it is low bias. The major drawback of this method is that it
leads to higher variation in the testing model as we are testing against
one data point. If the data point is an outlier it can lead to higher
variation. Another drawback is it takes a lot of execution time as it
iterates over ‘the number of data points’ times. K-Fold Cross
Validation In this method, we split the data-set into k number of
subsets(known as folds) then we perform training on the all the
subsets but leave one(k-1) subset for the evaluation of the trained
model. In this method, we iterate k times with a different subset
reserved for testing purpose each time.
Note:
It is always suggested that the value of k should be 10 as the
lower value
of k is takes towards validation and higher value of k leads to
LOOCV method.
Example The diagram below shows an example of the training subsets
and evaluation subsets generated in k-fold cross-validation. Here, we
have total 25 instances. In first iteration we use the first 20 percent of
data for evaluation, and the remaining 80 percent for training([1-5]
testing and [5-25] training) while in the second iteration we use the
second subset of 20 percent for evaluation, and the remaining three
subsets of the data for training([5-10] testing and [1-5 and 10-25]

training), and so on.


Total instances: 25
Value of k : 5

No. Iteration Training set observations


Testing set observations
1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24] [0 1 2 3 4]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24] [5 6 7 8 9]
3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22
23 24] [10 11 12 13 14]
4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22
23 24] [15 16 17 18 19]
5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19] [20 21 22 23 24]
Comparison of train/test split to cross-validation
Advantages of train/test split:
1. This runs K times faster than Leave One Out cross-validation
because K-fold cross-validation repeats the train/test split K-times.
2. Simpler to examine the detailed results of the testing process.
Advantages of cross-validation:
1. More accurate estimate of out-of-sample accuracy.
2. More “efficient” use of data as every observation is used for both
training and testing.
Advantages of Cross Validation:
1. Overcoming Overfitting: Cross validation helps to prevent overfitting
by providing a more robust estimate of the model’s performance on
unseen data.
2. Model Selection: Cross validation can be used to compare different
models and select the one that performs the best on average.
3. Hyperparameter tuning: Cross validation can be used to optimize
the hyperparameters of a model, such as the regularization
parameter, by selecting the values that result in the best
performance on the validation set.
4. Data Efficient: Cross validation allows the use of all the available
data for both training and validation, making it a more data-efficient
method compared to traditional validation techniques.

Disadvantages of Cross Validation:

1. Computationally Expensive: Cross validation can be computationally


expensive, especially when the number of folds is large or when the
model is complex and requires a long time to train.
2. Time-Consuming: Cross validation can be time-consuming,
especially when there are many hyperparameters to tune or when
multiple models need to be compared.
3. Bias-Variance Tradeoff: The choice of the number of folds in cross
validation can impact the bias-variance tradeoff, i.e., too few folds
may result in high variance, while too many folds may result in high
bias.

Lasso Regression :
Lasso regression stands for Least Absolute Shrinkage and Selection Operator.
It adds penalty term to the cost function. This term is the absolute sum of the
coefficients. As the value of coefficients increases from 0 this term penalizes,
cause model, to decrease the value of coefficients in order to reduce loss. The
difference between ridge and lasso regression is that it tends to make
coefficients to absolute zero as compared to Ridge which never sets the value
of coefficient to absolute zero.
Limitation of Lasso Regression:
 Lasso sometimes struggles with some types of data. If the number of
predictors (p) is greater than the number of observations (n), Lasso will pick
at most n predictors as non-zero, even if all predictors are relevant (or may
be used in the test set).
 If there are two or more highly collinear variables then LASSO regression
select one of them randomly which is not good for the interpretation of data
Elastic Net :
Sometimes, the lasso regression can cause a small bias in the model where the
prediction is too dependent upon a particular variable. In these cases, elastic
Net is proved to better it combines the regularization of both lasso and Ridge.
The advantage of that it does not easily eliminate the high collinearity
coefficient.

Logistic Regression

Logistic Regression is a classification algorithm that uses the Sigmoid function instead of a linear
function to model data.

The Sigmoid curve is shown in the figure below.


We can also use Logistic Regression for multi-class tasks by modeling each class separately. Therefore,
the Regression's outcome must be a discrete or categorical value. (e.g., Yes/No, True/False) The model's
output is a probabilistic value in the range [0,1][0,1]. The modeled curve that the logistic function
uses indicates the likelihood of the binary decision. The following equation can mathematically
represent Logistic Regression.

log[1−yy]=b0+b1x1+b2x2+…+bnxn

Gradient Descent
Gradient Descent is an iterative optimization algorithm that tries to find the
optimum value (Minimum/Maximum) of an objective function. It is one of the
most used optimization techniques in machine learning projects for updating the
parameters of a model in order to minimize a cost function.
The main aim of gradient descent is to find the best parameters of a model
which gives the highest accuracy on training as well as testing datasets. In
gradient descent, The gradient is a vector that points in the direction of the
steepest increase of the function at a specific point. Moving in the opposite
direction of the gradient allows the algorithm to gradually descend towards
lower values of the function, and eventually reaching to the minimum of the
function.
Steps Required in Gradient Descent Algorithm

 Step 1 we first initialize the parameters of the model randomly


 Step 2 Compute the gradient of the cost function with respect to each
parameter. It involves making partial differentiation of cost function with
respect to the parameters.
 Step 3 Update the parameters of the model by taking steps in the opposite
direction of the model. Here we choose a hyperparameter learning rate which is
denoted by alpha. It helps in deciding the step size of the gradient.
 Step 4 Repeat steps 2 and 3 iteratively to get the best parameter for the
defined model

Pseudocode for Gradient Descent

t ← 0
max_iterations ← 1000
w, b ← initialize randomly

while t < max_iterations do


t ← t + 1
w_t+1 ← w_t − η ∇w_t
b_t+1 ← b_t − η ∇b_t
end
Here max_iterations is the number of iteration we want to do to update our
parameter
W,b are the weights and bias parameter
η is the learning parameter aslo denoted by alpha
To apply this gradient descent on data using any programming language we
have to make four new functions using which we can update our parameter and
apply it to data to make a prediction. We will see each function one by one and
understand it
1. gradient_descent – In the gradient descent function we will make the
prediction on a dataset and compute the difference between the predicted
and actual target value and accordingly we will update the parameter and
hence it will return the updated parameter.
2. compute_predictions – In this function, we will compute the prediction
using the parameters at each iteration.
3. compute_gradient – In this function we will compute the error which is the
difference between the actual and predicted target value and then compute
the gradient using this error and training data.
4. update_parameters – In this separate function we will update the parameter
using learning rate and gradient that we got from the compute_gradient
function.
function gradient_descent(X, y, learning_rate, num_iterations):
Initialize parameters = θ
for iter in range(num_iterations):
predictions = compute_predictions(X, θ)
gradient = compute_gradient(X, y, predictions)
update_parameters(θ, gradient, learning_rate)
return θ

function compute_predictions(X, θ):


return X*θ

function compute_gradient(X, y, predictions):


error = predictions - y
gradient = Xᵀ * error / m
return gradient

function update_parameters(θ, gradient, learning_rate):


θ = θ - learning_rate ⨉ gradient
Mathematics Behind Gradient Descent
In the Machine Learning Regression problem, our model targets to get the best-
fit regression line to predict the value y based on the given input value (x).
While training the model, the model calculates the cost function like Root Mean
Squared error between the predicted value (pred) and true value (y). Our model
targets to minimize this cost function.
To minimize this cost function, the model needs to have the best value of θ 1 and
θ2(for Univariate linear regression problem). Initially model selects θ 1 and
θ2 values randomly and then iteratively update these value in order to minimize
the cost function until it reaches the minimum. By the time model achieves the
minimum cost function, it will have the best θ 1 and θ2 values. Using these
updated values of θ1 and θ2 in the hypothesis equation of linear equation, our
model will predict the output value y.
How do θ1 and θ2 values get updated?
Linear Regression Cost

Function:
so our model aim is to Minimize \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) –
y^{(i)})^2 and store the parameters which makes it minimum.
Gradient Descent Algorithm For Linear Regression

Gradient descent algorithm for linear regression


-> θj : Weights of the hypothesis.
-> hθ(xi) : predicted y value for i th input.
-> i : Feature index number (can be 0, 1, 2, ......, n).
-> α : Learning Rate of Gradient Descent.

How Does Gradient Descent Work

Gradient descent works by moving downward toward the pits or valleys in the
graph to find the minimum value. This is achieved by taking the derivative of the
cost function, as illustrated in the figure below. During each iteration, gradient
descent step-downs the cost function in the direction of the steepest descent. By
adjusting the parameters in this direction, it seeks to reach the minimum of the
cost function and find the best-fit values for the parameters. The size of each
step is determined by parameter α known as Learning Rate.
In the Gradient Descent algorithm, one can infer two points :
 If slope is +ve : θj = θj – (+ve value). Hence the value of θ j decreases.

If slope is +ve in Gradient Descent

 If slope is -ve : θj = θj – (-ve value). Hence the value of θ j increases.


If slope is -ve in Gradient Descent

How To Choose Learning Rate

The choice of correct learning rate is very important as it ensures that Gradient
Descent converges in a reasonable time. :
 If we choose α to be very large, Gradient Descent can overshoot the
minimum. It may fail to converge or even diverge.

Effect of large alpha value on Gradient Descent

 If we choose α to be very small, Gradient Descent will take small steps to


reach local minima and will take a longer time to reach minima.
Effect of small alpha value on Gradient Descent

Support Vector Machine

A Support Vector Machine (SVM) is a supervised classification and regression algorithm that
uses the concept of hyperplanes. These hyperplanes can be understood as multi-dimensional
linear decision boundaries that separate groups of unequal data points. An example of a
hyperplane is shown below.

An optimal fit of the SVM occurs when a hyperplane is furthest from the training data points of
any of the classes—the larger this distance margin, the lower the classifier's error.

To better understand how the SVM works, consider a group of data points like the one shown in
the diagram. It is a good fit if the hyperplane separates the points in the space, so they are
clustered according to their labels. If not, further iterations of the algorithm are performed.
Kernel Methods
Kernels or kernel methods (also called Kernel functions) are sets of different

types of algorithms that are being used for pattern analysis. They are used

to solve a non-linear problem by using a linear classifier. Kernels Methods

are employed in SVM (Support Vector Machines) which are used in

classification and regression problems. The SVM uses what is called a “Kernel

Trick” where the data is transformed and an optimal boundary is found for

the possible outputs.

The Need for Kernel Method and its


Working
Before we get into the working of the Kernel Methods, it is more important to

understand support vector machines or the SVMs because kernels are

implemented in SVM models. So, Support Vector Machines are

supervised machine learning algorithms that are used in classification

and regression problems such as classifying an apple to class fruit while

classifying a Lion to the class animal.

To demonstrate, below is what support vector machines look like:


Here we can see a hyperplane which is separating green dots from the blue

ones. A hyperplane is one dimension less than the ambient plane. E.g. in the

above figure, we have 2 dimension which represents the ambient space but

the lone which divides or classifies the space is one dimension less than the

ambient space and is called hyperplane.

But what if we have input like this:


It is very difficult to solve this classification using a linear classifier as there is

no good linear line that should be able to classify the red and the green dots

as the points are randomly distributed. Here comes the use of kernel

function which takes the points to higher dimensions, solves the problem

over there and returns the output. Think of this in this way, we can see that

the green dots are enclosed in some perimeter area while the red one lies

outside it, likewise, there could be other scenarios where green dots might

be distributed in a trapezoid-shaped area.

So what we do is to convert the two-dimensional plane which was first

classified by one-dimensional hyperplane (“or a straight line”) to the three-

dimensional area and here our classifier i.e. hyperplane will not be a straight

line but a two-dimensional plane which will cut the area.


In order to get a mathematical understanding of kernel, let us understand

the Lili Jiang’s equation of kernel which is:

K(x, y)=<f(x), f(y)> where,

K is the kernel function,

X and Y are the dimensional inputs,

f is the map from n-dimensional to m-dimensional space and,

< x, y > is the dot product.

Illustration with the help of an example.


Let us say that we have two points, x= (2, 3, 4) and y= (3, 4, 5)

As we have seen, K(x, y) = < f(x), f(y) >.

Let us first calculate < f(x), f(y) >

f(x)=(x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3)

f(y)=(y1y1, y1y2, y1y3, y2y1, y2y2, y2y3, y3y1, y3y2, y3y3)

so,

f(2, 3, 4)=(4, 6, 8, 6, 9, 12, 8, 12, 16)and

f(3 ,4, 5)=(9, 12, 15, 12, 16, 20, 15, 20, 25)

so the dot product,

f (x). f (y) = f(2,3,4) . f(3,4,5)=

(36 + 72 + 120 + 72 +144 + 240 + 120 + 240 + 400)=


1444

And,

K(x, y) = (2*3 + 3*4 + 4*5) ^2=(6 + 12 + 20)^2=38*38=1444.

This as we find out, f(x).f(y) and K(x, y) give us the same result, but the

former method required a lot of calculations(because of projecting 3

dimensions into 9 dimensions) while using the kernel, it was much easier.

Types of Kernel and methods in SVM


Let us see some of the kernel function or the types that are being used in

SVM:

1. Liner Kernel
Let us say that we have two vectors with name x1 and Y1, then the linear

kernel is defined by the dot product of these two vectors:

K(x1, x2) = x1 . x2

2. Polynomial Kernel
A polynomial kernel is defined by the following equation:

K(x1, x2) = (x1 . x2 + 1)d,

Where,
d is the degree of the polynomial and x1 and x2 are vectors

3. Gaussian Kernel
4. Exponential Kernel
5. Laplacian Kernel
This type of kernel is less prone for changes and is totally equal to previously

discussed exponential function kernel, the equation of Laplacian kernel is gi

6. Hyperbolic or the Sigmoid Kernel


This kernel is used in neural network areas of machine learning. The

activation function for the sigmoid kernel is the bipolar sigmoid function.

This kernel is very much used and popular among support vector machines.

7. Anova radial basis kernel


This kernel is known to perform very well in multidimensional regression

problems just like the Gaussian and Laplacian kernels. This also comes under

the category of radial basis kernel.

There are a lot more types of Kernel Method and we have discussed the

mostly used kernels. It purely depends on the type of problem which will

decide the kernel function to be used.

Instance-based learning
The Machine Learning systems which are categorized as instance-
based learning are the systems that learn the training examples by
heart and then generalizes to new instances based on some similarity
measure. It is called instance-based because it builds the hypotheses
from the training instances. It is also known as memory-based
learning or lazy-learning (because they delay processing until a new
instance must be classified). The time complexity of this algorithm
depends upon the size of training data. Each time whenever a new
query is encountered, its previously stores data is examined. And
assign to a target function value for the new instance.
The worst-case time complexity of this algorithm is O (n), where n is
the number of training instances. For example, If we were to create a
spam filter with an instance-based learning algorithm, instead of just
flagging emails that are already marked as spam emails, our spam
filter would be programmed to also flag emails that are very similar to
them. This requires a measure of resemblance between two emails. A
similarity measure between two emails could be the same sender or
the repetitive use of the same keywords or something else.
Advantages:
1. Instead of estimating for the entire instance set, local
approximations can be made to the target function.
2. This algorithm can adapt to new data easily, one which is collected
as we go .
Disadvantages:
1. Classification costs are high
2. Large amount of memory required to store the data, and each query
involves starting the identification of a local model from scratch.
Some of the instance-based learning algorithms are :
1. K Nearest Neighbor (KNN)
2. Self-Organizing Map (SOM)
3. Learning Vector Quantization (LVQ)
4. Locally Weighted Learning (LWL)
5. Case-Based Reasoning

K Nearest Neighbors

K-Nearest Neighbors is a statistical method that evaluates the proximity of


one data point to another data point in order to decide whether or not the
two data points can be grouped together. The proximity of the data points
represents the degree to which they are comparable to one another.

For instance, suppose we had a graph with two distinct groups of data points
that were located in close proximity to one another and named Group A and
Group B, respectively. Each of these groups of data points would be
represented by a point on the graph. When we add a new data point, the
group of that instance will depend on which group the new point is closer to.

PROS CONS

Makes no assumption
Takes long time for training
about the data

KNN works well with a small number of features but as the numbers of
Intuitive and simple
features grow it struggles to predict accurately.

Tree based methods:

Decision Tree

Decision trees are tree-based decision models that use a root node internal structure followed by
successive child leaf nodes. The leaf nodes are a placeholder for the classification label, and the
branches show the outcomes of the decision. The paths from the tree's root to the leaves
represent the classifier rules. Each tree and sub-tree models a single decision and enumerates all
the possible decisions to choose the best one. A Decision tree can be optimal if it represents most
of the data with the least number of levels.
Decision trees are helpful for classification but can be extended for Regression using different
algorithms. These trees are computationally efficient, and many tree-based optimizations have
been created over the years to make them perform even faster.

An example of such a tree is shown

below.

ID3 algorithm:

It stands for Iterative Dichotomiser 3, is a classification


algorithm that follows a greedy approach of building a decision
tree by selecting a best attribute that yields maximum Information
Gain (IG) or minimum Entropy (H).
In this article, we will use the ID3 algorithm to build a decision tree
based on a weather data and illustrate how we can use this procedure
to make a decision on an action (like whether to play outside) based on the
current data using the previously collected data.

What is a Decision Tree?


A Supervised Machine Learning Algorithm, used to build classification and
regression models in the form of a tree structure.

A decision tree is a tree where each -


 Node - a feature(attribute)
 Branch - a decision(rule)
 Leaf - an outcome(categorical or continuous)
There are many algorithms to build decision trees, here we are going to
discuss ID3 algorithm with an example.

What is an ID3 Algorithm?


ID3 stands for Iterative Dichotomiser 3
It is a classification algorithm that follows a greedy approach by selecting a
best attribute that yields maximum Information Gain(IG) or minimum
Entropy(H).

What is Entropy and Information gain?


Entropy is a measure of the amount of uncertainty in the
dataset S. Mathematical Representation of Entropy is shown
here -
�(�)=∑�∈�−�(�)���2�(�)

Where,

 S - The current dataset for which entropy is being


calculated(changes every iteration of the ID3 algorithm).
 C - Set of classes in S {example - C ={yes, no}}
 p(c) - The proportion of the number of elements in class c to the
number of elements in set S.
In ID3, entropy is calculated for each remaining attribute. The attribute with
the smallest entropy is used to split the set S on that particular iteration.

Entropy = 0 implies it is of pure class, that means all are of same category.

Information Gain IG(A) tells us how much uncertainty in S


was reduced after splitting set S on attribute A.
Mathematical representation of Information gain is shown
here -
��(�,�)=�(�)−∑�∈��(�)�(�)
Where,

 H(S) - Entropy of set S.


 T - The subsets created from splitting set S by attribute A such that
�=⋃�ϵ��
 p(t) - The proportion of the number of elements in t to the number of
elements in set S.
 H(t) - Entropy of subset t.
In ID3, information gain can be calculated (instead of entropy) for each
remaining attribute. The attribute with the largest information gain is used to
split the set S on that particular iteration.

What are the steps in ID3


algorithm?
The steps in ID3 algorithm are as follows:
1. Calculate entropy for dataset.
2. For each attribute/feature.
2.1. Calculate entropy for all its categorical values.
2.2. Calculate information gain for the feature.
3. Find the feature with maximum information gain.
4. Repeat it until we get the desired tree.
Characteristics of ID3 Algorithm are as follows:

1. ID3 uses a greedy approach that's why it does not guarantee an


optimal solution; it can get stuck in local optimums.
2. ID3 can overfit to the training data (to avoid overfitting, smaller
decision trees should be preferred over larger ones).
3. This algorithm usually produces small trees, but it does not always
produce the smallest possible tree.
4. ID3 is harder to use on continuous data (if the values of any given
attribute is continuous, then there are many more places to split the
data on this attribute, and searching for the best value to split by
can be time consuming).
Ensemble Methods, what are they? Ensemble methods is a
machine learning technique that combines several base
models in order to produce one optimal predictive model. To
better understand this definition lets take a step back into
ultimate goal of machine learning and model building. This is
going to make more sense as I dive into specific examples
and why Ensemble methods are used.

I will largely utilize Decision Trees to outline the definition


and practicality of Ensemble Methods (however it is
important to note that Ensemble Methods do not only pertain
to Decision Trees).

A Decision Tree determines the predictive value based on


series of questions and conditions. For instance, this simple
Decision Tree determining on whether an individual should
play outside or not. The tree takes several weather factors
into account, and given each factor either makes a decision
or asks another question. In this example, every time it is
overcast, we will play outside. However, if it is raining, we
must ask if it is windy or not? If windy, we will not play. But
given no wind, tie those shoelaces tight because were going
outside to play.

Decision Trees can also solve quantitative problems as well


with the same format. In the Tree to the left, we want to
know wether or not to invest in a commercial real estate
property. Is it an office building? A Warehouse? An
Apartment building? Good economic conditions? Poor
Economic Conditions? How much will an investment return?
These questions are answered and solved using this decision
tree.

When making Decision Trees, there are several factors we


must take into consideration: On what features do we make
our decisions on? What is the threshold for classifying each
question into a yes or no answer? In the first Decision Tree,
what if we wanted to ask ourselves if we had friends to play
with or not. If we have friends, we will play every time. If not,
we might continue to ask ourselves questions about the
weather. By adding an additional question, we hope to
greater define the Yes and No classes.
This is where Ensemble Methods come in handy! Rather than
just relying on one Decision Tree and hoping we made the
right decision at each split, Ensemble Methods allow us to
take a sample of Decision Trees into account, calculate which
features to use or questions to ask at each split, and make a
final predictor based on the aggregated results of the
sampled Decision Trees.

Types of Ensemble Methods


1. BAGGing, or Bootstrap AGGregating. BAGGing gets
its name because it combines Bootstrapping
and Aggregation to form one ensemble model. Given a
sample of data, multiple bootstrapped subsamples are
pulled. A Decision Tree is formed on each of the
bootstrapped subsamples. After each subsample
Decision Tree has been formed, an algorithm is used
to aggregate over the Decision Trees to form the most
efficient predictor. The image below will help explain:
Given a Dataset, bootstrapped subsamples are pulled. A Decision Tree is formed on each
bootstrapped sample. The results of each tree are aggregated to yield the strongest, most accurate
predictor.

2. Random Forest Models. Random Forest Models can be


thought of as BAGGing, with a slight tweak. When deciding
where to split and how to make decisions, BAGGed Decision
Trees have the full disposal of features to choose from.
Therefore, although the bootstrapped samples may be
slightly different, the data is largely going to break off at the
same features throughout each model. In contrary, Random
Forest models decide where to split based on a random
selection of features. Rather than splitting at similar features
at each node throughout, Random Forest models implement
a level of differentiation because each tree will split based on
different features. This level of differentiation provides a
greater ensemble to aggregate over, ergo producing a more
accurate predictor. Refer to the image for a better
understanding.

Similar to BAGGing, bootstrapped subsamples are pulled


from a larger dataset. A decision tree is formed on each
subsample. HOWEVER, the decision tree is split on different
features (in this diagram the features are represented by
shapes).

CART (Classification And Regression


Tree) in Machine Learning
CART( Classification And Regression Tree) is a variation of the
decision tree algorithm. It can handle both classification and
regression tasks. Scikit-Learn uses the Classification And Regression
Tree (CART) algorithm to train Decision Trees (also called “growing”
trees). CART was first produced by Leo Breiman, Jerome Friedman,
Richard Olshen, and Charles Stone in 1984.
CART Algorithm
CART is a predictive algorithm used in Machine learning and it explains
how the target variable’s values can be predicted based on other
matters. It is a decision tree where each fork is split into a predictor
variable and each node has a prediction for the target variable at the
end.
In the decision tree, nodes are split into sub-nodes on the basis of a
threshold value of an attribute. The root node is taken as the training
set and is split into two by considering the best attribute and threshold
value. Further, the subsets are also split using the same logic. This
continues till the last pure sub-set is found in the tree or the maximum
number of leaves possible in that growing tree.
The CART algorithm works via the following process:
 The best split point of each input is obtained.
 Based on the best split points of each input in Step 1, the new
“best” split point is identified.
 Split the chosen input according to the “best” split point.
 Continue splitting until a stopping rule is satisfied or no further
desirable splitting is available.
CART algorithm uses Gini Impurity to split the dataset into a decision
tree .It does that by searching for the best homogeneity for the sub
nodes, with the help of the Gini index criterion.
Gini index/Gini impurity
The Gini index is a metric for the classification tasks in CART. It stores
the sum of squared probabilities of each class. It computes the degree
of probability of a specific variable that is wrongly being classified
when chosen randomly and a variation of the Gini coefficient. It works
on categorical variables, provides outcomes either “successful” or
“failure” and hence conducts binary splitting only.
The degree of the Gini index varies from 0 to 1,
 Where 0 depicts that all the elements are allied to a certain class, or
only one class exists there.
 The Gini index of value 1 signifies that all the elements are
randomly distributed across various classes, and
 A value of 0.5 denotes the elements are uniformly distributed into
some classes.
Mathematically, we can write Gini Impurity as follows:

where pi is the probability of an object being classified to a particular


class.
Classification tree
A classification tree is an algorithm where the target variable is
categorical. The algorithm is then used to identify the “Class” within
which the target variable is most likely to fall. Classification trees are
used when the dataset needs to be split into classes that belong to the
response variable(like yes or no)
Regression tree
A Regression tree is an algorithm where the target variable is
continuous and the tree is used to predict its value. Regression trees
are used when the response variable is continuous. For example, if the
response variable is the temperature of the day.
Pseudo-code of the CART algorithm
d = 0, endtree = 0
Note(0) = 1, Node(1) = 0, Node(2) = 0
while endtree < 1
if Node(2d-1) + Node(2d) + .... + Node(2d+1-2) = 2 - 2d+1
endtree = 1
else
do i = 2d-1, 2d, .... , 2d+1-2
if Node(i) > -1
Split tree
else
Node(2i+1) = -1
Node(2i+2) = -1
end if
end do
end if
d = d + 1
end while
CART model representation
CART models are formed by picking input variables and evaluating split
points on those variables until an appropriate tree is produced.
Steps to create a Decision Tree using the CART algorithm:
 Greedy algorithm: In this The input space is divided using the
Greedy method which is known as a recursive binary spitting. This
is a numerical method within which all of the values are aligned and
several other split points are tried and assessed using a cost
function.
 Stopping Criterion: As it works its way down the tree with the
training data, the recursive binary splitting method described above
must know when to stop splitting. The most frequent halting method
is to utilize a minimum amount of training data allocated to every
leaf node. If the count is smaller than the specified threshold, the
split is rejected and also the node is considered the last leaf node.
 Tree pruning: Decision tree’s complexity is defined as the number
of splits in the tree. Trees with fewer branches are recommended as
they are simple to grasp and less prone to cluster the data. Working
through each leaf node in the tree and evaluating the effect of
deleting it using a hold-out test set is the quickest and simplest
pruning approach.
 Data preparation for the CART: No special data preparation is
required for the CART algorithm.
Advantages of CART
 Results are simplistic.
 Classification and regression trees are Nonparametric and
Nonlinear.
 Classification and regression trees implicitly perform feature
selection.
 Outliers have no meaningful effect on CART.
 It requires minimal supervision and produces easy-to-understand
models.
Limitations of CART
 Overfitting.
 High Variance.
 low bias.
 the tree structure may be unstable.
Applications of the CART algorithm
 For quick Data insights.
 In Blood Donors Classification.
 For environmental and ecological data.
 In the financial sectors.

Random Forest

Random Forest models use a forest of Decision Trees to make better decisions by combining
each tree's decisions. The most popular decision across the trees for a task is the best after the
aggregation. This technique of aggregating multiple results from similar processes is
called Ensembling.
The second component of the Random Forest pertains to another technique called Bagging.
Bagging differs from Ensembling because, in Bagging, the data is different for every model,
while in Ensembling, the different models are run on the same data.

In Bagging, a random sample with replacement is chosen multiple times to create a data sample.
These data samples are then used to train the model independently. After training all these
models, the majority vote is taken to find a better data estimate.

Random forests combine the concepts of Bagging and Ensembling to decide the best feature
splits and select subsets of the same. This algorithm is better than a single Decision Tree as it
reduces bias and the net variance, generating better predictions.

Bagging and Ensembling might seem like they help model the joint probability distribution, but
that is not the case. Understanding the difference between Generative and Discriminative models
can clear this confusion.

Classification Algorithm
The Classification algorithm is a Supervised Learning technique that is used
to identify the category of new observations on the basis of training data. In
Classification, a program learns from the given dataset or observations and
then classifies new observation into a number of classes or groups. Such
as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can
be called as targets/labels or categories.

Unlike regression, the output variable of Classification is a category, not a


value, such as "Green or Blue", "fruit or animal", etc. Since the Classification
algorithm is a Supervised learning technique, hence it takes labeled input
data, which means it contains input with the corresponding output.

In classification algorithm, a discrete output function(y) is mapped to input


variable(x).

1. y=f(x), where y = categorical output

The best example of an ML classification algorithm is Email Spam


Detector.

The main goal of the Classification algorithm is to identify the category of a


given dataset, and these algorithms are mainly used to predict the output for
the categorical data.

Classification algorithms can be better understood using the below diagram.


In the below diagram, there are two classes, class A and Class B. These
classes have features that are similar to each other and dissimilar to other
classes.

The algorithm which implements the classification on a dataset is known as a


classifier. There are two types of Classifications:

o Binary Classifier: If the classification problem has only two possible


outcomes, then it is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG,
etc.
o Multi-class Classifier: If a classification problem has more than two
outcomes, then it is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.

Learners in Classification Problems:


In the classification problems, there are two types of learners:

1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until
it receives the test dataset. In Lazy learner case, classification is done on the
basis of the most related data stored in the training dataset. It takes less time
in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners:Eager Learners develop a classification model based on a
training dataset before receiving a test dataset. Opposite to Lazy learners,
Eager Learner takes more time in learning, and less time in
prediction. Example: Decision Trees, Naïve Bayes, ANN.

Types of ML Classification Algorithms:


Classification Algorithms can be further divided into the Mainly two category:

o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification

Evaluating a Classification model:


Once our model is completed, it is necessary to evaluate its performance;
either it is a Classification or Regression model. So for evaluating a
Classification model, we have the following ways:

1. Log Loss or Cross-Entropy Loss:

o It is used for evaluating the performance of a classifier, whose output is a


probability value between the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to
0.
o The value of log loss increases if the predicted value deviates from the actual
value.
o The lower log loss represents the higher accuracy of the model.
o For Binary classification, cross-entropy can be calculated as:

1. ?(ylog(p)+(1?y)log(1?p))

Where y= Actual output, p= predicted output.

2. Confusion Matrix:

o The confusion matrix provides us a matrix/table as output and describes the


performance of the model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a
total number of correct predictions and incorrect predictions. The matrix
looks like as below table:
Actual Positive Actual Negative

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

3. AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics Curve and AUC
stands for Area Under the Curve.
o It is a graph that shows the performance of the classification model at
different thresholds.
o To visualize the performance of the multi-class classification model, we use
the AUC-ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on
Y-axis and FPR(False Positive Rate) on X-axis.

Use cases of Classification Algorithms


Classification algorithms can be used in different places. Below are some
popular use cases of Classification Algorithms:

o Email Spam Detection


o Speech Recognition
o Identifications of Cancer tumor cells.
o Drugs Classification
o Biometric Identification, etc.

UNIT-3

Unsupervised learning:
Unsupervised learning is when it can provide a set of unlabelled data, which it is
required to analyze and find patterns inside. The examples are dimension reduction and
clustering.

Clustering in Machine Learning


Clustering Algorithms
The Clustering algorithms can be divided based on their models that are
explained above. There are different types of clustering algorithms
published, but only a few are commonly used. The clustering algorithm is
based on the kind of data that we are using. Such as, some algorithms need
to guess the number of clusters in the given dataset, whereas some are
required to find the minimum distance between the observation of the
dataset.

Here we are discussing mainly popular Clustering algorithms that are widely
used in machine learning:

1. K-Means algorithm: The k-means algorithm is one of the most popular


clustering algorithms. It classifies the dataset by dividing the samples into
different clusters of equal variances. The number of clusters must be
specified in this algorithm. It is fast with fewer computations required, with
the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in
the smooth density of data points. It is an example of a centroid-based
model, that works on updating the candidates for centroid to be the center of
the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of
Applications with Noise. It is an example of a density-based model similar
to the mean-shift, but with some remarkable advantages. In this algorithm,
the areas of high density are separated by the areas of low density. Because
of this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be
used as an alternative for the k-means algorithm or for those cases where K-
means can be failed. In GMM, it is assumed that the data points are Gaussian
distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical
algorithm performs the bottom-up hierarchical clustering. In this, each data
point is treated as a single cluster at the outset and then successively
merged. The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it
does not require to specify the number of clusters. In this, each data point
sends a message between the pair of data points until convergence. It has
O(N2T) time complexity, which is the main drawback of this algorithm.

Applications of Clustering
Below are some commonly known applications of clustering technique in
Machine Learning:

o In Identification of Cancer Cells: The clustering algorithms are widely


used for the identification of cancerous cells. It divides the cancerous and
non-cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering technique.
The search result appears based on the closest object to the search query. It
does it by grouping similar data objects in one group that is far from the
other dissimilar objects. The accurate result of a query depends on the
quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the
customers based on their choice and preferences.
o In Biology: It is used in the biology stream to classify different species of
plants and animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of
similar lands use in the GIS database. This can be very useful to find that for
what purpose the particular land should be used, that means for which
purpose it is more suitable.

Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It
is also known as the centroid-based method. The most common example
of partitioning clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to
define the number of pre-defined groups. The cluster center is created in
such a way that the distance between the data points of one cluster is
minimum as compared to another cluster centroid.

Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned
clustering as there is no requirement of pre-specifying the number of
clusters to be created. In this technique, the dataset is divided into clusters
to create a tree-like structure, which is also called a dendrogram. The
observations or any number of clusters can be selected by cutting the tree at
the correct level. The most common example of this method is
the Agglomerative Hierarchical algorithm.
Cluster validity:

The term cluster validation is used to design the procedure of evaluating the goodness of
clustering algorithm results. This is important to avoid finding patterns in a random data, as
well as, in the situation where you want to compare two clustering algorithms.
Generally, clustering validation statistics can be categorized into 3 classes

1. Internal cluster validation, which uses the internal information of the clustering process to
evaluate the goodness of a clustering structure without reference to external information. It
can be also used for estimating the number of clusters and the appropriate clustering
algorithm without any external data.
2. External cluster validation, which consists in comparing the results of a cluster analysis to
an externally known result, such as externally provided class labels. It measures the extent to
which cluster labels match externally supplied class labels. Since we know the “true” cluster
number in advance, this approach is mainly used for selecting the right clustering algorithm
for a specific data set.
3. Relative cluster validation, which evaluates the clustering structure by varying different
parameter values for the same algorithm (e.g.,: varying the number of clusters k). It’s
generally used for determining the optimal number of clusters.
Internal measures for cluster validation

In this section, we describe the most widely used clustering validation indices. Recall that the
goal of partitioning clustering algorithms (Part @ref(partitioning-clustering)) is to split the
data set into clusters of objects, such that:
 the objects in the same cluster are similar as much as possible,
 and the objects in different clusters are highly distinct
That is, we want the average distance within cluster to be as small as possible; and the
average distance between clusters to be as large as possible.

Internal validation measures reflect often the compactness, the connectedness and
the separation of the cluster partitions.
1. Compactness or cluster cohesion: Measures how close are the objects within the same
cluster. A lower within-cluster variation is an indicator of a good compactness (i.e., a good
clustering). The different indices for evaluating the compactness of clusters are base on
distance measures such as the cluster-wise within average/median distances between
observations.

2. Separation: Measures how well-separated a cluster is from other clusters. The indices used
as separation measures include:
 distances between cluster centers
3. the pairwise minimum distances between objects in different clusters

4. Connectivity: corresponds to what extent items are placed in the same cluster as their
nearest neighbors in the data space. The connectivity has a value between 0 and infinity and
should be minimized.
Generally most of the indices used for internal clustering validation combine compactness
and separation measures as follow:

Index=(α×Separation)/(β×Compactness)

Where α and β are weights.


In this section, we’ll describe the two commonly used indices for assessing the goodness of
clustering: the silhouette width and the Dunn index. These internal measure can be used also
to determine the optimal number of clusters in the data.
Silhouette coefficient

The silhouette analysis measures how well an observation is clustered and it estimates
the average distance between clusters. The silhouette plot displays a measure of how close
each point in one cluster is to points in the neighboring clusters.
For each observation i, the silhouette width si is calculated as follows:
1. For each observation i, calculate the average dissimilarity ai between i and all other points
of the cluster to which i belongs.
2. For all other clusters C, to which i does not belong, calculate the average
dissimilarity d(i,C) of i to all observations of C. The smallest of these d(i,C) is defined
as bi=minCd(i,C)=min. The value of bi can be seen as the dissimilarity between i and
its “neighbor” cluster, i.e., the nearest one to which it does not belong.
3. Finally the silhouette width of the observation i is defined by the
formula: Si=(bi−ai)/max(ai,bi)
Silhouette width can be interpreted as follow:
 Observations with a large Si (almost 1) are very well clustered.
 A small Si (around 0) means that the observation lies between two clusters.
 Observations with a negative Si are probably placed in the wrong cluster.
Dunn index

The Dunn index is another internal clustering validation measure which can be computed as
follow:
1. For each cluster, compute the distance between each of the objects in the cluster and the
objects in the other clusters
2. Use the minimum of this pairwise distance as the inter-cluster separation (min.separation)
3. For each cluster, compute the distance between the objects in the same cluster.
4. Use the maximal intra-cluster distance (i.e maximum diameter) as the intra-cluster
compactness
5. Calculate the Dunn index (D) as follow:
D=min.separation/max.diameter

If the data set contains compact and well-separated clusters, the diameter of the clusters is
expected to be small and the distance between the clusters is expected to be large. Thus,
Dunn index should be maximized.

External measures for clustering validation

The aim is to compare the identified clusters (by k-means, pam or hierarchical clustering) to
an external reference.
It’s possible to quantify the agreement between partitioning clusters and external reference
using either the corrected Rand index and Meila’s variation index VI, which are implemented
in the R function cluster.stats()[fpc package].
The corrected Rand index varies from -1 (no agreement) to 1 (perfect agreement).
External clustering validation, can be used to select suitable clustering algorithm for a given
data set.

Dimensionality Reduction
The number of input features, variables, or columns present in a given
dataset is known as dimensionality, and the process to reduce these features
is called dimensionality reduction.

A dataset contains a huge number of input features in various cases, which


makes the predictive modeling task more complicated. Because it is very
difficult to visualize or make predictions for the training dataset with a high
number of features, for such cases, dimensionality reduction techniques are
required to use.

Dimensionality reduction technique can be defined as, "It is a way of


converting the higher dimensions dataset into lesser dimensions
dataset ensuring that it provides similar information." These
techniques are widely used in machine learning for obtaining a better fit
predictive model while solving the classification and regression problems.

It is commonly used in the fields that deal with high-dimensional data, such
as speech recognition, signal processing, bioinformatics, etc. It can
also be used for data visualization, noise reduction, cluster analysis,
etc.
The Curse of Dimensionality
Handling the high-dimensional data is very difficult in practice, commonly
known as the curse of dimensionality. If the dimensionality of the input
dataset increases, any machine learning algorithm and model becomes more
complex. As the number of features increases, the number of samples also
gets increased proportionally, and the chance of overfitting also increases. If
the machine learning model is trained on high-dimensional data, it becomes
overfitted and results in poor performance.

Hence, it is often required to reduce the number of features, which can be


done with dimensionality reduction.

Benefits of applying Dimensionality Reduction


Some benefits of applying dimensionality reduction technique to the given
dataset are given below:

o By reducing the dimensions of the features, the space required to store the
dataset also gets reduced.
o Less Computation training time is required for reduced dimensions of
features.
o Reduced dimensions of features of the dataset help in visualizing the data
quickly.
o It removes the redundant features (if present) by taking care of
multicollinearity.

Disadvantages of dimensionality Reduction


There are also some disadvantages of applying the dimensionality reduction,
which are given below:

o Some data may be lost due to dimensionality reduction.


o In the PCA dimensionality reduction technique, sometimes the principal
components required to consider are unknown.

Approaches of Dimension Reduction


There are two ways to apply the dimension reduction technique, which are
given below:

Feature Selection
Feature selection is the process of selecting the subset of the relevant
features and leaving out the irrelevant features present in a dataset to build
a model of high accuracy. In other words, it is a way of selecting the optimal
features from the input dataset.

Three methods are used for the feature selection:

1. Filters Methods

In this method, the dataset is filtered, and a subset that contains only the
relevant features is taken. Some common techniques of filters method are:

o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.

2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a
machine learning model for its evaluation. In this method, some features are
fed to the ML model, and evaluate the performance. The performance
decides whether to add those features or remove to increase the accuracy of
the model. This method is more accurate than the filtering method but
complex to work. Some common techniques of wrapper methods are:

o Forward Selection
o Backward Selection
o Bi-directional Elimination

3. Embedded Methods: Embedded methods check the different training


iterations of the machine learning model and evaluate the importance of
each feature. Some common techniques of Embedded methods are:

o LASSO
o Elastic Net
o Ridge Regression, etc.

Feature Extraction:
Feature extraction is the process of transforming the space containing many
dimensions into space with fewer dimensions. This approach is useful when
we want to keep the whole information but use fewer resources while
processing the information.

Some common feature extraction techniques are:

a. Principal Component Analysis


b. Linear Discriminant Analysis
c. Kernel PCA
d. Quadratic Discriminant Analysis

Common techniques of Dimensionality Reduction


a. Principal Component Analysis
b. Backward Elimination
c. Forward Selection
d. Score comparison
e. Missing Value Ratio
f. Low Variance Filter
g. High Correlation Filter
h. Random Forest
i. Factor Analysis
j. Auto-Encoder

Principal Component Analysis (PCA)


Principal Component Analysis is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated
features with the help of orthogonal transformation. These new transformed
features are called the Principal Components. It is one of the popular tools
that is used for exploratory data analysis and predictive modeling.

PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduces the
dimensionality. Some real-world applications of PCA are image processing,
movie recommendation system, optimizing the power allocation in
various communication channels.

Backward Feature Elimination


The backward feature elimination technique is mainly used while developing
Linear Regression or Logistic Regression model. Below steps are performed
in this technique to reduce the dimensionality or in feature selection:

o In this technique, firstly, all the n variables of the given dataset are taken to
train the model.
o The performance of the model is checked.
o Now we will remove one feature each time and train the model on n-1
features for n times, and will compute the performance of the model.
o We will check the variable that has made the smallest or no change in the
performance of the model, and then we will drop that variable or features;
after that, we will be left with n-1 features.

Recommendation systems
Product recommendation is a popular application of machine learning that
aims to personalize the customer shopping experience. By analyzing
customer behavior, preferences, and purchase history, a recommendation
engine can suggest products more likely to interest a particular customer.

The task of proposing a product or products to a consumer based on his


purchasing history is known as "product recommendation" in machine
learning. A machine learning model called a product recommender system
suggests products, content, or services to a specific consumer. Here, we've
developed a C#.NET Core console application that serves as a product
recommender system using data from Amazon's product co-purchasing
network.

Different product recommendation algorithms can be used to generate


personalized product recommendations. One popular approach is
collaborative filtering, which makes recommendations based on the behavior
and preferences of similar users. For example, if two customers have
purchased similar products in the past, the algorithm may suggest similar
products to both customers.

Another approach is content-based filtering, which makes recommendations


based on the products' attributes. For example, the algorithm may suggest
other products if a customer has purchased a particular clothing brand.

A more advanced approach is a hybrid recommendation, combining


collaborative and content-based filtering strengths. The hybrid approach
considers both the behavior and preferences of similar users and the
attributes of the products themselves. This can result in more accurate and
relevant recommendations.

A recommendation engine must first be trained on a customer behavior and


product information dataset to generate personalized product
recommendations. This dataset can include purchase history, browsing
history, and customer ratings and reviews.

Once the recommendation engine has been trained, it can generate


recommendations for individual customers. The recommendations can be
presented in various ways, such as a list of products or personalized product
recommendations.

Using product recommendations in the e-commerce industry is becoming


increasingly popular to increase sales and customer satisfaction. A
recommendation engine can increase the chances that a customer will make
a purchase by suggesting products that are more likely to be of interest to a
particular customer.
One of the key benefits of product recommendation is that it can help
increase a customer's average order value (AOV). A recommendation engine
can increase the number of items that a customer purchases during a single
transaction by suggesting additional products that are likely to be of interest
to a customer.

In addition to increasing sales and AOV, product recommendations can help


improve the customer experience. By suggesting products that are more
likely to interest a particular customer, a recommendation engine can help
save customers time and effort when browsing for products.

Moreover, it can also help to increase customer loyalty and retention. By


suggesting products that are more likely to be of interest to a particular
customer, a recommendation engine can help to build a stronger relationship
with the customer and increase the chances that the customer will return to
the website in the future.

Recommender systems are used in various contexts, including movies,


music, news, books, research articles, search queries, social tagging, and
items in general. They have grown in popularity in recent years. The bulk of
today's E-Commerce sites, including eBay, Amazon, Alibaba, etc., employ
their proprietary recommendation algorithms to better match customers with
the goods they are likely to like. These algorithms are mostly used in the
digital space.

Another important aspect of product recommendation is the ability to handle


the cold-start problem. A cold-start problem occurs when a new customer
visits a website, and the recommendation engine needs more information
about the customer to make personalized recommendations.

A hybrid recommender system uses several different recommendation


methods to produce the output. The suggestion accuracy is typically greater
in hybrid recommender systems compared to collaborative or content-based
systems. Knowledge about collaborative filtering's domain dependencies and
people's preferences in content-based systems is the cause.

Both factors work together to increase shared knowledge, which improves


suggestions. Exploring novel approaches to integrate content data into
content-based algorithms and collaborative filtering algorithms with user
activity data is especially intriguing, given the increase in knowledge.

One way to handle the cold-start problem is to use a hybrid approach that
combines content-based filtering and demographic information. For example,
suppose a new customer is browsing for men's clothing. In that case, the
recommendation engine can suggest products based on the most popular
men's clothing items and the customer's age and location.
Types of recommendation systems
There are several types of recommendation systems in machine learning,
including:

Content-based filtering: Recommends items based on their similarity to


items the user has previously liked.

Collaborative filtering: Recommends items based on the preferences of


similar users.

Hybrid: combines both content-based and collaborative filtering to make


recommendations.

Hybrid with memory-based and model-based: Memory-based


recommendation is a way to make recommendations based on the similarity
between items and the users' past behavior, whereas model-based
recommendation uses machine learning algorithms to model the user
behavior and make recommendations.

Hybrid with demographic and user-based: Demographic-based


recommendation is a way to make recommendations based on user
demographic information, and user-based recommendation is a way to make
recommendations based on the similarity of users.

Hybrid with demographic and item-based: Demographic-based


recommendation is a way to make recommendations based on user
demographic information, and item-based recommendation is a way to make
recommendations based on the similarity of items.

o Repeat the complete process until no feature can be dropped.

In this technique, by selecting the optimum performance of the model and


maximum tolerable error rate, we can define the optimal number of features
require for the machine learning algorithms.

EM algorithm
The Expectation-Maximization (EM) algorithm is defined as the combination
of various unsupervised machine learning algorithms, which is used to
determine the local maximum likelihood estimates (MLE) or maximum
a posteriori estimates (MAP) for unobservable variables in statistical
models. Further, it is a technique to find maximum likelihood estimation
when the latent variables are present. It is also referred to as the latent
variable model.

A latent variable model consists of both observable and unobservable


variables where observable can be predicted while unobserved are inferred
from the observed variable. These unobservable variables are known as
latent variables.

Key Points:

o It is known as the latent variable model to determine MLE and MAP


parameters for latent variables.
o It is used to predict values of parameters in instances where data is missing
or unobservable for learning, and this is done until convergence of the values
occurs.

EM Algorithm
The EM algorithm is the combination of various unsupervised ML algorithms,
such as the k-means clustering algorithm. Being an iterative approach, it
consists of two modes. In the first mode, we estimate the missing or latent
variables. Hence it is referred to as the Expectation/estimation step (E-
step). Further, the other mode is used to optimize the parameters of the
models so that it can explain the data more clearly. The second mode is
known as the maximization-step or M-step.
o Expectation step (E - step): It involves the estimation (guess) of all
missing values in the dataset so that after completing this step, there should
not be any missing value.
o Maximization step (M - step): This step involves the use of estimated data
in the E-step and updating the parameters.
o Repeat E-step and M-step until the convergence of the values occurs.

The primary goal of the EM algorithm is to use the available observed data of
the dataset to estimate the missing data of the latent variables and then use
that data to update the values of the parameters in the M-step.

What is Convergence in the EM algorithm?


Convergence is defined as the specific situation in probability based
on intuition, e.g., if there are two random variables that have very less
difference in their probability, then they are known as converged. In other
words, whenever the values of given variables are matched with each other,
it is called convergence.

Steps in EM Algorithm
The EM algorithm is completed mainly in 4 steps, which include
Initialization Step, Expectation Step, Maximization Step, and
convergence Step. These steps are explained as follows:
o 1st Step: The very first step is to initialize the parameter values. Further, the
system is provided with incomplete observed data with the assumption that
data is obtained from a specific model.

o 2nd Step: This step is known as Expectation or E-Step, which is used to


estimate or guess the values of the missing or incomplete data using the
observed data. Further, E-step primarily updates the variables.
o 3rd Step: This step is known as Maximization or M-step, where we use
complete data obtained from the 2 nd step to update the parameter values.
Further, M-step primarily updates the hypothesis.
o 4th step: The last step is to check if the values of latent variables are
converging or not. If it gets "yes", then stop the process; else, repeat the
process from step 2 until the convergence occu

Reinforcement Learning
o Reinforcement Learning is a feedback-based Machine learning technique in
which an agent learns to behave in an environment by performing the actions
and seeing the results of actions. For each good action, the agent gets
positive feedback, and for each bad action, the agent gets negative feedback
or penalty.
o In Reinforcement Learning, the agent learns automatically using feedbacks
without any labeled data, unlike supervised learning.
o Since there is no labeled data, so the agent is bound to learn by its
experience only.
o RL solves a specific type of problem where decision making is sequential, and
the goal is long-term, such as game-playing, robotics, etc.
o The agent interacts with the environment and explores it by itself. The
primary goal of an agent in reinforcement learning is to improve the
performance by getting the maximum positive rewards.
o The agent learns with the process of hit and trial, and based on the
experience, it learns to perform the task in a better way. Hence, we can say
that "Reinforcement learning is a type of machine learning method
where an intelligent agent (computer program) interacts with the
environment and learns to act within that." How a Robotic dog learns
the movement of his arms is an example of Reinforcement learning.
o It is a core part of Artificial intelligence, and all AI agent works on the concept
of reinforcement learning. Here we do not need to pre-program the agent, as
it learns from its own experience without any human intervention.
o Example: Suppose there is an AI agent present within a maze environment,
and his goal is to find the diamond. The agent interacts with the environment
by performing some actions, and based on those actions, the state of the
agent gets changed, and it also receives a reward or penalty as feedback.
o The agent continues doing these three things (take action, change
state/remain in the same state, and get feedback), and by doing these
actions, he learns and explores the environment.
o The agent learns that what actions lead to positive feedback or rewards and
what actions lead to negative feedback penalty. As a positive reward, the
agent gets a positive point, and as a penalty, it gets a negative point.
Elements of Reinforcement Learning
There are four main elements of Reinforcement Learning, which are given
below:

1. Policy
2. Reward Signal
3. Value Function
4. Model of the environment

1) Policy: A policy can be defined as a way how an agent behaves at a


given time. It maps the perceived states of the environment to the actions
taken on those states. A policy is the core element of the RL as it alone can
define the behavior of the agent. In some cases, it may be a simple function
or a lookup table, whereas, for other cases, it may involve general
computation as a search process. It could be deterministic or a stochastic
policy:

For deterministic policy: a = π(s)


For stochastic policy: π(a | s) = P[At =a | St = s]
2) Reward Signal: The goal of reinforcement learning is defined by the
reward signal. At each state, the environment sends an immediate signal to
the learning agent, and this signal is known as a reward signal. These
rewards are given according to the good and bad actions taken by the agent.
The agent's main objective is to maximize the total number of rewards for
good actions. The reward signal can change the policy, such as if an action
selected by the agent leads to low reward, then the policy may change to
select other actions in the future.

3) Value Function: The value function gives information about how good
the situation and action are and how much reward an agent can expect. A
reward indicates the immediate signal for each good and bad action,
whereas a value function specifies the good state and action for the
future. The value function depends on the reward as, without reward, there
could be no value. The goal of estimating values is to achieve more rewards.

4) Model: The last element of reinforcement learning is the model, which


mimics the behavior of the environment. With the help of the model, one can
make inferences about how the environment will behave. Such as, if a state
and an action are given, then a model can predict the next state and reward.

The model is used for planning, which means it provides a way to take a
course of action by considering all future situations before actually
experiencing those situations. The approaches for solving the RL
problems with the help of the model are termed as the model-based
approach. Comparatively, an approach without using a model is called
a model-free approach.

Model-Based Machine Learning


Hundreds of learning algorithms have been developed in the field of machine
learning. Scientists typically select from among these algorithms to answer
specific issues. Their options are frequently restricted by their expertise with
these systems. In this classical/traditional machine learning framework,
scientists are forced to make some assumptions to employ an existing
algorithm.

 The model-based learning in machine learning is a technique that tries to


generate a custom solution for each new challenge
MBML’s purpose is to offer a single development framework that facilitates
the building of a diverse variety of custom models. This paradigm evolved
as a result of a significant confluence of three main ideas:
 Factor graphs
 Bayesian perspective,
 Probabilistic Programming
The essential principle is that in the form of a model, all assumptions about the
issue domain are made clear. Model-based deep learning is just a collection of
assumptions stated in a graphical manner.

Factor Graphs

The usage of PGM- Probabilistic Graphical Models, particularly factor graphs,


is the pillar of MBML. A PGM is a graph-based diagrammatic representation of
the joint probability distribution across all random variables in a model.

They are a form of PGM with round nodes and square nodes representing
variable probability distributions (factors), and vertices expressing conditional
relationships between nodes. They offer a broad framework for simulating the
combined dispersion of a set of random variables.

In factor graphs, we consider implicit parameters as random variables and


discover their probability distributions throughout the network using Bayesian
inference techniques. Inference/learning is just the product of factors across a
subset of the graph’s variables. This makes it simple to develop local message
forwarding algorithms.

Bayesian Methods

The first essential concept allowing this new machine learning architecture
is Bayesian inference/learning. Latent/hidden parameters are represented in
MBML as random variables with probability distributions. This provides for a
consistent and rational approach to quantifying uncertainty in model
parameters. Again when the observed variables in the model are locked to
their values, the Bayes’ theorem is used to update the previously assumed
probability distributions.
In contrast, the classical ML framework assigns model parameters to average
values derived by maximizing an objective function. Bayesian inference on big
models with millions of variables is accomplished similarly, but in a more
complicated way, employing the Bayes’ theorem. This is because Bayes’
theory is an accurate inference approach that is intractable when applied to
huge datasets. The rise in the processing capacity of computers over the last
decade has enabled the research and innovation of algorithms that can scale
to enormous data sets.

Probabilistic Programming

Probabilistic programming (PP) is a breakthrough in computer science in which


programming languages are now created to compute with uncertainty in
addition to logic. Current programming languages can already handle random
variables, variable restrictions, and inference packages. You may now express
a model-based reinforcement learning of your problem concisely with a few
lines of code using a PP language. So an inference engine is invoked to
produce inference procedures to solve the problem automatically.

Model-Based ML Developmental Stages


It consists of three rules-based models in machine learning:

 Describe the Model: Using factor graphs, describe the process that created
the data.
 Condition on Reported Data: Make the observed variables equal to their
known values.
 Backward reasoning is used to update the prior distribution across the latent
constructs or parameters. Estimate the Bayesian probability distributions of
latent constructs based on observable variables.

Temporal difference learning


Temporal Difference Learning is an unsupervised learning technique that is
very commonly used in reinforcement learning for the purpose of predicting
the total reward expected over the future. They can, however, be used to
predict other quantities as well. It is essentially a way to learn how to predict a
quantity that is dependent on the future values of a given signal. It is a method
that is used to compute the long-term utility of a pattern of behaviour from a
series of intermediate rewards.
Essentially, Temporal Difference Learning (TD Learning) focuses on
predicting a variable's future value in a sequence of states. Temporal
difference learning was a major breakthrough in solving the problem of reward
prediction. You could say that it employs a mathematical trick that allows it to
replace complicated reasoning with a simple learning procedure that can be
used to generate the very same results.
The trick is that rather than attempting to calculate the total future reward,
temporal difference learning just attempts to predict the combination of
immediate reward and its own reward prediction at the next moment in time.
Now when the next moment comes and brings fresh information with it, the
new prediction is compared with the expected prediction. If these two
predictions are different from each other, the Temporal Difference Learning
algorithm will calculate how different the predictions are from each other and
make use of this temporal difference to adjust the old prediction toward the
new prediction.
The temporal difference algorithm always aims to bring the expected
prediction and the new prediction together, thus matching expectations with
reality and gradually increasing the accuracy of the entire chain of prediction.
Temporal Difference Learning aims to predict a combination of the immediate
reward and its own reward prediction at the next moment in time.
In TD Learning, the training signal for a prediction is a future prediction. This
method is a combination of the Monte Carlo (MC) method and the Dynamic
Programming (DP) method. Monte Carlo methods adjust their estimates only
after the final outcome is known, but temporal difference methods tend to
adjust predictions to match later, more accurate, predictions for the future,
much before the final outcome is clear and know. This is essentially a type of
bootstrapping.
Temporal difference learning in machine learning got its name from the way it
uses changes, or differences, in predictions over successive time steps for the
purpose of driving the learning process.
The prediction at any particular time step gets updated to bring it nearer to the
prediction of the same quantity at the next time step.
What are the parameters used in
temporal difference learning?

Parameters used in temporal difference learning

 Alpha (α): learning rate


It shows how much our estimates should be adjusted, based on the
error. This rate varies between 0 and 1.
 Gamma (γ): the discount rate
This indicates how much future rewards are valued. A larger discount
rate signifies that future rewards are valued to a greater extent. The
discount rate also varies between 0 and 1.
 e: the ratio reflective of exploration vs. exploitation.
This involves exploring new options with probability e and staying at the
current max with probability 1-e. A larger e signifies that more
exploration is carried out during training

How is temporal difference learning


used in neuroscience?
Around the late 1980s and the early 1990s, neuroscientists were trying to
understand the manner in which dopamine neurons behave. These dopamine
neurons are clustered in the mid-brain, but they send projections to several
areas of the brain, potentially even broadcasting some globally relevant
messages. It was obvious that the firing of these neurons were related to
rewards in some way, but their responses were also dependent on sensory
input and they changed as the animals gained more experience in a particular
task.
Luckily, some researchers had a good idea about the recent developments in
neuroscience as well as artificial intelligence. They noticed that responses in
some dopamine neurons represented reward prediction errors. Their firing
signified the points when the animal received greater or lesser rewards than it
was trained to expect.
The firing rate of the dopamine cells did not increase when the animal
received the predicted reward, but the firing rate for the dopamine cells fell
below the normal activation levels when the reward was less than that which
was expected.
This very closely mimics the way in which the error function in temporal
difference is used for reinforcement learning.
These researchers saw that and then proposed that the brain makes use of a
temporal difference algorithm - a reward prediction error gets calculated, it is
then broadcast to the brain through the dopamine signal and employed to
drive learning.
After that, the reward prediction error theory has been widely tested and
validated in thousands of experiments, and has since turned into one of the
most successful quantitative theories in neuroscience.
The relationship between the temporal difference model and potential
neurological function has generated research that attempts to use temporal
difference to explain several aspects of behavioural research. Temporal
difference learning in machine learning has also been utilized to study and
understand conditions like schizophrenia or the consequences of
pharmacological manipulations of dopamine on learning.

benefit of temporal difference learning


The advantages of temporal difference learning in machine learning are:

 TD learning methods are able to learn in each step, online or offline.


 These methods are capable of learning from incomplete sequences,
which means that they can also be used in continuous problems.
 Temporal difference learning can function in non-terminating
environments.
 TD Learning has less variance than the Monte Carlo method, because it
depends on one random action, transition, reward.
 It tends to be more efficient than the Monte Carlo method.
 Temporal Difference Learning exploits the Markov property, which
makes it more effective in Markov environments.

disadvantages of temporal difference


learning
There are two main disadvantages:

 It has greater sensitivity towards the initial value.


 It is a biased estimation.

What is the temporal difference error?


TD error arises in various forms throughout reinforcement learning and δt =

error is the difference between the current estimate for 𝑉𝑡, the discounted
rt+1 + γV(st+1) − V(st) value is commonly called the TD Error. Here the TD

value estimate of 𝑉𝑡+1, and the actual reward gained from transitioning
between 𝑠𝑡 and 𝑠𝑡+1. The TD error at each time is the error in the calculation
made at that time. Because the TD error at step t relies on the next state and
next reward, it is not available until step t + 1. When we update the value
function with the TD error, it is called a backup. The TD error is related to the
Bellman equation.

Different algorithms in temporal difference


learning
There are predominantly three different categories of TD algorithms which are
as follows:
1. TD(1) Algorithm
2. TD(0) Algorithm
3. TD(λ) Algorithm

UNIT-4

Probabilistic Models in Machine


Learning
Machine learning algorithms today rely heavily on probabilistic models,
which take into consideration the uncertainty inherent in real-world
data. These models make predictions based on probability
distributions, rather than absolute values, allowing for a more nuanced
and accurate understanding of complex systems. One common
approach is Bayesian inference, where prior knowledge is combined
with observed data to make predictions. Another approach
is maximum likelihood estimation, which seeks to find the model that
best fits observational data.
What are Probabilistic Models?
Probabilistic models are an essential component of machine learning,
which aims to learn patterns from data and make predictions on new,
unseen data. They are statistical models that capture the inherent
uncertainty in data and incorporate it into their predictions.
Probabilistic models are used in various applications such as image
and speech recognition, natural language processing, and
recommendation systems. In recent years, significant progress has
been made in developing probabilistic models that can handle large
datasets efficiently.
Categories Of Probabilistic Models
These models can be classified into the following categories:
 Generative models
 Discriminative models.
 Graphical models
Generative models:
Generative models aim to model the joint distribution of the input and
output variables. These models generate new data based on the
probability distribution of the original dataset. Generative models are
powerful because they can generate new data that resembles the
training data. They can be used for tasks such as image and speech
synthesis, language translation, and text generation.
Discriminative models
The discriminative model aims to model the conditional distribution of
the output variable given the input variable. They learn a decision
boundary that separates the different classes of the output variable.
Discriminative models are useful when the focus is on making accurate
predictions rather than generating new data. They can be used for
tasks such as image recognition, speech recognition, and sentiment
analysis.
Graphical models
These models use graphical representations to show the conditional
dependence between variables. They are commonly used for tasks
such as image recognition, natural language processing, and causal
inference.
Naive Bayes Algorithm in Probabilistic Models
The Naive Bayes algorithm is a widely used approach in probabilistic
models, demonstrating remarkable efficiency and effectiveness in
solving classification problems. By leveraging the power of the Bayes
theorem and making simplifying assumptions about feature
independence, the algorithm calculates the probability of the target
class given the feature set. This method has found diverse applications
across various industries, ranging from spam filtering to medical
diagnosis. Despite its simplicity, the Naive Bayes algorithm has proven
to be highly robust, providing rapid results in a multitude of real-world
problems.
Naive Bayes is a probabilistic algorithm that is used for classification
problems. It is based on the Bayes theorem of probability and assumes
that the features are conditionally independent of each other given the
class. The Naive Bayes Algorithm is used to calculate the probability of
a given sample belonging to a particular class. This is done by
calculating the posterior probability of each class given the sample and
then selecting the class with the highest posterior probability as the
predicted class.
The algorithm works as follows:
1. Collect a labeled dataset of samples, where each sample has a set
of features and a class label.
2. For each feature in the dataset, calculate the conditional probability
of the feature given the class.
3. This is done by counting the number of times the feature occurs in
samples of the class and dividing by the total number of samples in
the class.
4. Calculate the prior probability of each class by counting the number
of samples in each class and dividing by the total number of
samples in the dataset.
5. Given a new sample with a set of features, calculate the posterior
probability of each class using the Bayes theorem and the
conditional probabilities and prior probabilities calculated in steps 2
and 3.
6. Select the class with the highest posterior probability as the
predicted class for the new sample.
Probabilistic Models in Deep Learning
Deep learning, a subset of machine learning, also relies on probabilistic
models. Probabilistic models are used to optimize complex models with
many parameters, such as neural networks. By incorporating
uncertainty into the model training process, deep learning algorithms
can provide higher accuracy and generalization capabilities. One
popular technique is variational inference, which allows for efficient
estimation of posterior distributions.
Importance of Probabilistic Models
 Probabilistic models play a crucial role in the field of machine
learning, providing a framework for understanding the underlying
patterns and complexities in massive datasets.
 Probabilistic models provide a natural way to reason about the
likelihood of different outcomes and can help us understand the
underlying structure of the data.
 Probabilistic models help enable researchers and practitioners to
make informed decisions when faced with uncertainty.
 Probabilistic models allow us to perform Bayesian inference, which
is a powerful method for updating our beliefs about a hypothesis
based on new data. This can be particularly useful in situations
where we need to make decisions under uncertainty.
Advantages Of Probabilistic Models
 Probabilistic models are an increasingly popular method in many
fields, including artificial intelligence, finance, and healthcare.
 The main advantage of these models is their ability to take into
account uncertainty and variability in data. This allows for more
accurate predictions and decision-making, particularly in complex
and unpredictable situations.
 Probabilistic models can also provide insights into how different
factors influence outcomes and can help identify patterns and
relationships within data.
Disadvantages Of Probabilistic Models
There are also some disadvantages to using probabilistic models.
 One of the disadvantages is the potential for overfitting, where the
model is too specific to the training data and doesn’t perform well
on new data.
 Not all data fits well into a probabilistic framework, which can limit
the usefulness of these models in certain applications.
 Another challenge is that probabilistic models can be
computationally intensive and require significant resources to
develop and implement.

Maximum Likelihood in Machine Learning

Introduction
Maximum likelihood is an approach commonly used for such density estimation
problems, in which a likelihood function is defined to get the probabilities of the
distributed data. It is imperative to study and understand the concept of maximum
likelihood as it is one of the primary and core concepts essential for learning other
advanced machine learning and deep learning techniques and algorithms.
In this article, we will discuss the likelihood function, the core idea behind that, and how
it works with code examples. This will help one to understand the concept better and
apply the same when needed.
Let us dive into the likelihood first to understand the maximum likelihood estimation.

What is the Likelihood?


In machine learning, the likelihood is a measure of the data observations up to which it
can tell us the results or the target variables value for particular data points. In simple
words, as the name suggests, the likelihood is a function that tells us how likely the
specific data point suits the existing data distribution.
For example. Suppose there are two data points in the dataset. The likelihood of the
first data point is greater than the second. In that case, it is assumed that the first data
point provides accurate information to the final model, hence being likable for the model
being informative and precise.
After this discussion, a gentle question may appear in your mind, If the working of the
likelihood function is the same as the probability function, then what is the difference?

Difference Between Probability and Likelihood


Although the working and intuition of both probability and likelihood appear to be the
same, there is a slight difference, here the possibility is a function that defines or tells
us how accurate the particular data point is valuable and contributes to the final
algorithm in data distribution and how likely is to the machine learning algorithm.
Whereas probability, in simple words is a term that describes the chance of some event
or thing happening concerning other circumstances or conditions, mostly known as
conditional probability.
Also, the sum of all the probabilities associated with a particular problem is one and can
not exceed it, whereas the likelihood can be greater than one.

What is Maximum Likelihood Estimation?


After discussing the intuition of the likelihood function, it is clear to us that a higher
likelihood is desired for every model to get an accurate model and has accurate results.
So here, the term maximum likelihood represents that we are maximizing the likelihood
function, called the Maximization of the Likelihood Function.
Let us try to understand the same with an example.
Let us suppose that we have a classification dataset in which the independent column
is the marks of the students that they achieved in the particular exam, and the target or
dependent column is categorical, which has yes and No attributes representing if
students are placed on the campus placements or not.
Noe here, if we try to solve the same problem with the help of maximum likelihood
estimation, the function will first calculate the probability of every data point according
to every suitable condition for the target variable. In the next step, the function will plot
all the data points in the two-dimensional plots and try to find the line that best fits the
dataset to divide it into two parts. Here the best-fit line will be achieved after some
epochs, and once achieved, the line is used to classify the data point by simply plotting
it to the graph.

Maximum Likelihood: The Base


The maximum likelihood estimation is a base of some machine learning and deep
learning approaches used for classification problems. One example is logistic
regression, where the algorithm is used to classify the data point using the best-fit line
on the graph. The same approach is known as the perceptron trick regarding deep
learning algorithms.
As shown in the above image, all the data observations are plotted in a two-
dimensional diagram where the X-axis represents the independent column or the
training data, and the y-axis represents the target variable. The line is drawn to
separate both data observations, positives and negatives. According to the algorithm,
the observations that fall above the line are considered positive, and data points below
the line are regarded as negative data points.

Maximum Likelihood Estimation: Code Example


We can quickly implement the maximum likelihood estimation technique using logistic
regression on any classification dataset. Let us try to implement the same.
import pandas as pd

import numpy as np

import seaborn as sns

from sklearn.linear_model import LogisticRegression

lr=LogisticRegression()

lr.fit(X_train,y_train)

lr_pred=lr.predict(X_test)
sns.regplot(x="X",y='lr_pred',data=df_pred ,logistic=True, ci=None)
The above code will fit the logistic regression for the given dataset and generate the
line plot for the data representing the distribution of the data and the best fit according
to the algorithm.

Key Takeaways
 Maximum Likelihood is a function that describes the data points and their
likeliness to the model for best fitting.
 Maximum likelihood is different from the probabilistic methods, where probabilistic
methods work on the principle of calculation probabilities. In contrast, the
likelihood method tries o maximize the likelihood of data observations according
to the data distribution.
 Maximum likelihood is an approach used for solving the problems like density
distribution and is a base for some algorithms like logistic regression.
 The approach is very similar and is predominantly known as the perceptron trick
in terms of deep learning methods.

Apriori Algorithm in Machine Learning


The Apriori algorithm uses frequent itemsets to generate association rules,
and it is designed to work on the databases that contain transactions. With
the help of these association rule, it determines how strongly or how weakly
two objects are connected. This algorithm uses a breadth-first
search and Hash Tree to calculate the itemset associations efficiently. It is
the iterative process for finding the frequent itemsets from the large dataset.

This algorithm was given by the R. Agrawal and Srikant in the year 1994.
It is mainly used for market basket analysis and helps to find those products
that can be bought together. It can also be used in the healthcare field to
find drug reactions for patients.

What is Frequent Itemset?

Frequent itemsets are those items whose support is greater than the
threshold value or user-specified minimum support. It means if A & B are the
frequent itemsets together, then individually A and B should also be the
frequent itemset.

Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in


these two transactions, 2 and 3 are the frequent itemsets.

Steps for Apriori Algorithm


Below are the steps for the apriori algorithm:
Step-1: Determine the support of itemsets in the transactional database,
and select the minimum support and confidence.

Step-2: Take all supports in the transaction with higher support value than
the minimum or selected support value.

Step-3: Find all the rules of these subsets that have higher confidence value
than the threshold or minimum confidence.

Step-4: Sort the rules as the decreasing order of lift.

Bayesian Belief Networks


Bayesian Belief Network is a graphical representation of different
probabilistic relationships among random variables in a particular set.
It is a classifier with no dependency on attributes i.e it is condition
independent. Due to its feature of joint probability, the probability in
Bayesian Belief Network is derived, based on a condition
— P(attribute/parent) i.e probability of an attribute, true over parent
attribute.
(Note: A classifier assigns data in a collection to desired categories.)
 Consider this example:
 In the above figure, we have an alarm ‘A’ – a node, say installed in a
house of a person ‘gfg’, which rings upon two probabilities i.e
burglary ‘B’ and fire ‘F’, which are – parent nodes of the alarm node.
The alarm is the parent node of two probabilities P1 calls ‘P1’ & P2
calls ‘P2’ person nodes.
 Upon the instance of burglary and fire, ‘P1’ and ‘P2’ call person
‘gfg’, respectively. But, there are few drawbacks in this case, as
sometimes ‘P1’ may forget to call the person ‘gfg’, even after
hearing the alarm, as he has a tendency to forget things, quick.
Similarly, ‘P2’, sometimes fails to call the person ‘gfg’, as he is only
able to hear the alarm, from a certain distance.
Q) Find the probability that ‘P1’ is true (P1 has called ‘gfg’), ‘P2’ is true
(P2 has called ‘gfg’) when the alarm ‘A’ rang, but no burglary ‘B’ and
fire ‘F’ has occurred.
=> P ( P1, P2, A, ~B, ~F) [ where- P1, P2 & A are ‘true’ events and
‘~B’ & ‘~F’ are ‘false’ events]
[ Note: The values mentioned below are neither calculated nor
computed. They have observed values ]
Burglary ‘B’ –
 P (B=T) = 0.001 (‘B’ is true i.e burglary has occurred)
 P (B=F) = 0.999 (‘B’ is false i.e burglary has not occurred)
Fire ‘F’ –
 P (F=T) = 0.002 (‘F’ is true i.e fire has occurred)
 P (F=F) = 0.998 (‘F’ is false i.e fire has not occurred)
Alarm ‘A’ –

B F P (A=T) P (A=F)

T T 0.95 0.05

T F 0.94 0.06

F T 0.29 0.71

F F 0.001 0.999

 The alarm ‘A’ node can be ‘true’ or ‘false’ ( i.e may have rung or
may not have rung). It has two parent nodes burglary ‘B’ and fire ‘F’
which can be ‘true’ or ‘false’ (i.e may have occurred or may not
have occurred) depending upon different conditions.
Person ‘P1’ –

A P (P1=T) P (P1=F)

T 0.95 0.05

F 0.05 0.95

 The person ‘P1’ node can be ‘true’ or ‘false’ (i.e may have called the
person ‘gfg’ or not) . It has a parent node, the alarm ‘A’, which can
be ‘true’ or ‘false’ (i.e may have rung or may not have rung ,upon
burglary ‘B’ or fire ‘F’).
Person ‘P2’ –
A P (P2=T) P (P2=F)

T 0.80 0.20

F 0.01 0.99

 The person ‘P2’ node can be ‘true’ or false’ (i.e may have called the
person ‘gfg’ or not). It has a parent node, the alarm ‘A’, which can
be ‘true’ or ‘false’ (i.e may have rung or may not have rung, upon
burglary ‘B’ or fire ‘F’).
Solution: Considering the observed probabilistic scan –
With respect to the question — P ( P1, P2, A, ~B, ~F) , we need to
get the probability of ‘P1’. We find it with regard to its parent node –
alarm ‘A’. To get the probability of ‘P2’, we find it with regard to its
parent node — alarm ‘A’.
We find the probability of alarm ‘A’ node with regard to ‘~B’ & ‘~F’
since burglary ‘B’ and fire ‘F’ are parent nodes of alarm ‘A’.
From the observed probabilistic scan, we can deduce –
P ( P1, P2, A, ~B, ~F)
= P (P1/A) * P (P2/A) * P (A/~B~F) * P (~B) * P (~F)
= 0.95 * 0.80 * 0.001 * 0.999 * 0.998
= 0.00075

Probabilistic modeling
Probabilistic modeling is a statistical technique used to take into account the impact of
random events or actions in predicting the potential occurrence of future outcomes.

Probabilistic Bayesian Networks


Inference
Use of Bayesian Network (BN) is to estimate the probability that
the hypothesis is true based on evidence.

Bayesian Networks Inference:

 Deducing Unobserved Variables


 Parameter Learning
 Structure Learning
Let’s discuss them one by one:

1. Deducing Unobserved Variables


With the help of this network, we can develop a comprehensive
model that delineates the relationship between the variables. It is
used to answer probabilistic queries about them. We can use it to
observe the updated knowledge of the state of a subset of variables.
For computing, the posterior distribution of the variables with the
given evidence is called probabilistic inference. For detection
applications, it gives universal statistics. When anyone wants to
select values for the variable subset, it minimizes some expected
loss function, for instance, the probability of decision error. A BN is
a mechanism for applying Bayes’ theorem to complex problems.
Popular inference methods are:

1.1 Variable Elimination


Variable Elimination eliminates the non-observed non-query
variables. It eliminates one by one by distributing the sum over the
product.

1.2 Clique Tree Propagation


It caches the computation to query many variables at one time and
also to propagate new evidence.

1.3 Recursive Conditioning


Recursive conditioning allows a tradeoff between space and time. It
is equivalent to the variable elimination method if sufficient space
is available.

Wait! Have you checked – Bayesian Network Tutorial


2. Parameter Learning
To specify the BN and thus represent the joint probability
distribution, it is necessary to specify for each node X. Here, the
probability distribution for the node X is conditional, based on its
parents. There can be many forms of distribution of X. Discrete or
Gaussian distributions simplifies calculations. Sometimes
constraints on distribution are only known. To determine a single
distribution, we can use the principle of maximum entropy. The
only one who has the greatest entropy is given the constraints.

Conditional distributions include parameters from data and


unknown. Sometimes by using the most likely approach, we can
estimate the data. When there are unobserved variables, direct
maximization of the likelihood is often complex. EMA refers
to Expectation- maximization algorithm. It is for computing
expected values of the unobserved variables by performing the
maximization of the likelihood with an assumption that the prior
expectations are correct. This process converges on most likelihood
values for parameters under mild condition.
To treat parameters as additional unobserved variables, Bayesian is
an approach. We use BN to compute a posterior distribution
conditional upon observed data and then to integrate out the
parameters. This approach can be costly and lead to large
dimension model. Thus, in real practice, classical parameter-setting
are more common approaches.

3. Structure Learning
BN is specified by an expert and after that, it is used to perform
inference. The task of defining the network is too complex for
humans in other applications. The parameters of the local
distributions and the network structure must learn from data in this
case.

A challenge pursued that within machine learning is automatically


learning the graph structure of a BN. After that, the idea went back
to an algorithm developed by Rebane and Pearl (1987). The
triplets allowed in a Directed Acyclic Graph (DAG):
 X àY àZ
 X ßYàZ
 X àYßZ
X and Z are independent given Y. Represent the same
dependencies by Type 1 and 2, so it is, indistinguishable. We can
uniquely identify Type 3. All other pairs are dependent and X and Z
are marginally independent. So, while the skeletons of these three
triplets are identical, the direction of the arrows is somehow
identifiable. When X and Z have common parents, the same
distinction applies except that one condition on those parents. We
develop the algorithm to determine the skeleton of the underlying
graph. After that orient, all arrows whose directionality is
estimated by the conditional independencies are observed.

Optimization-based search is an alternative method that is used by


structural learning. It needs a scoring function and a search
strategy. The posterior probability is a common scoring function of
the structure with the given training data. The time for of an
exhaustive search returns a structure. It maximizes the score that
is super-exponential in the number of variables. We make changes
that are incremental in nature in order to improve the overall
score. We can do incremental changes through a local search
strategy. A global search algorithm like Markov chain can avoid
getting trapped in local minima.

Another method consists of focusing on the sub-class of


decomposable models. By decomposable model, the MLE has a
closed-form.

With nodes and edges using rule-based machine learning


techniques, we can augment a BN. To mine rules and create new
nodes, Inductive logic programming can be used. Based on the BN
structure to guide the structural search and augment the network,
we use an approach. The approach is Statistical Relational
Learning and it uses a scoring function. A common SRL Scoring
Function is the area under the ROC curve.
Become a Machine Learning Expert by completing 40+
tutorials of Machine Learning
Structure Learning Algorithms
You can learn about the structure and parameters of BNs through
the structure learning algorithms. It supports both discrete and
continuous data sets.

Below are various types of structure learning algorithms:


i. Constraint-based Structure Learning Algorithms
Examples are Grow-Shrink (GS), Incremental Association Markov
Blanket, Fast – IAMB, (inter –IAMB)

ii. Score-based Structure Learning Algorithms


Examples are Hill Climbing (HC) and Tabu Search (TC)

iii. Constraint-based Structure Learning Algorithms


Examples are Grow-Shrink (GS), Incremental Association Markov
Blanket (IAMB), fast incremental association(Fast – IAMB),
interleaved incremental association (inter –IAMB)

iv. Score-based Structure Learning Algorithms


Examples are Hill Climbing (HC) and Tabu Search (TC)

v. Hybrid Structure Learning Algorithms


Examples are Max-Min Hill Climbing (MMHC) and General2-Phase
Restricted Maximization (RSMAX2)

vi. Local Discovery Algorithms


Examples are Chow-Liu, ARACNE, max-min parents and
children(MMPC) and semi interleaved hiton-PC

vii. Bayesian Network Classifiers


Examples are Naïve Bayes and Tree-Augmented naïve Bayes (TAN)

Fraud Detection – A Naive Bayes Case Study


The advancements in Machine Learning has resulted in a massive
boost in automation. One such area is fraud detection. With the
help of Machine Learning algorithms like Naive Bayes, it has
become much easier for companies to detect fraud at an early
stage. They are also able to detect various irregularities in
transactions.
In Fraud Detection, companies are able to monitor and analyze
user activity to detect any unusual or malicious pattern. With the
increment in internet usage, online transactions have resulted in a
significant increase in the number of frauds.

With the help of Data Science, industries are able to apply machine
learning and predictive modeling to develop tools for the
recognition of unusual patterns in the fraud-detection ecosystem.
Naive Bayes is one of the important algorithms that is used for
fraud detection in the industries.

Probability Density Estimation &


Maximum Likelihood Estimation
Probability Density: Assume a random variable x that has a
probability distribution p(x). The relationship between the outcomes of
a random variable and its probability is referred to as the probability
density.
The problem is that we don’t always know the full probability
distribution for a random variable. This is because we only use a small
subset of observations to derive the outcome. This problem is referred
to as Probability Density Estimation as we use only a random
sample of observations to find the general density of the whole sample
space.
Probability Density Function (PDF)
A PDF is a function that tells the probability of the random variable
from a sub-sample space falling within a particular range of values and
not just one value. It tells the likelihood of the range of values in the
random variable sub-space being the same as that of the whole
sample.
By definition, if X is any continuous random variable, then the function
f(x) is called a probability density function if:
where,
a -> lower limit
b -> upper limit
X -> continuous random variable
f(x) -> probability density function
Steps Involved:
Step 1 - Create a histogram for the random set of observations
to understand the
density of the random sample.

Step 2 - Create the probability density function and fit it on


the random sample.
Observe how it fits the histogram plot.

Step 3 - Now iterate steps 1 and 2 in the following manner:


3.1 - Calculate the distribution parameters.
3.2 - Calculate the PDF for the random sample
distribution.
3.3 - Observe the resulting PDF against the data.
3.4 - Transform the data to until it best fits the
distribution.
Most of the histogram of the different random sample after fitting
should match the histogram plot of the whole population.
Density Estimation: It is the process of finding out the density of the
whole population by examining a random sample of data from that
population. One of the best ways to achieve a density estimate is by
using a histogram plot.
Parametric Density Estimation
A normal distribution has two given parameters, mean and standard
deviation. We calculate the sample mean and standard deviation of the
random sample taken from this population to estimate the density of
the random sample. The reason it is termed as ‘parametric’ is due to
the fact that the relation between the observations and its probability
can be different based on the values of the two parameters.
Now, it is important to understand that the mean and standard
deviation of this random sample is not going to be the same as that of
the whole population due to its small size. A sample plot for parametric
density estimation is shown below.
PDF fitted over histogram plot with one peak value

Nonparametric Density Estimation


In some cases, the PDF may not fit the random sample as it doesn’t
follow a normal distribution (i.e instead of one peak there are multiple
peaks in the graph). Here, instead of using distribution parameters like
mean and standard deviation, a particular algorithm is used to
estimate the probability distribution. Thus, it is known as
a ‘nonparametric density estimation’.
One of the most common nonparametric approach is known as Kernel
Density Estimation. In this, the objective is to calculate the unknown
density fh(x) using the equation given below:

where,
K -> kernel (non-negative function)
h -> bandwidth (smoothing parameter, h > 0)
Kh -> scaled kernel
fh(x) -> density (to calculate)
n -> no. of samples in random sample.
A sample plot for nonparametric density estimation is given below.

PDF plot over sample histogram plot based on KDE

Problems with Probability Distribution Estimation


Probability Distribution Estimation relies on finding the best PDF and
determining its parameters accurately. But the random data sample
that we consider, is very small. Hence, it becomes very difficult to
determine what parameters and what probability distribution function
to use. To tackle this problem, Maximum Likelihood Estimation is used.
Maximum Likelihood Estimation
It is a method of determining the parameters (mean, standard
deviation, etc) of normally distributed random sample data or a
method of finding the best fitting PDF over the random sample data.
This is done by maximizing the likelihood function so that the PDF
fitted over the random sample. Another way to look at it is that MLE
function gives the mean, the standard deviation of the random sample
is most similar to that of the whole sample.
Sequence Modelling
Sequence Modelling is the ability of a computer program to
model, interpret, make predictions about or generate any
type of sequential data, such as audio, text etc. For example,
a computer program that can take a piece of text in English
and translate it to French is an example of a Sequence
Modelling program (because the type of data being dealt
with is text, which is sequential in nature). An AI algorithm
called the Recurrent Neural Network, is a specialised form
of the classic Artificial Neural Network (Multi-Layer
Perceptron) that is used to solve Sequence Modelling
problems. Recurrent Neural Networks are like Artificial
Neural Networks which has loops in them. This means that
the activation of each neuron or cell depends not only on the
current input to it but also its previous activation values.

The architecture of an RNN is also inspired by the human


brain. As we read any essay, we are able to interpret the
sentence we are currently reading better because of
the information we gained from previous sentences of
the essay. Similarly, we can understand the conclusion
of a novel only if we have read the beginning and
middle of the novel. The same logic follows for audio as
well. On a basic level, interpreting a certain part of a
sequence requires information gained from the
previous parts of the sequence. Thus, in a human
brain, information that persists in our memory while
interpreting sequential data is vital in understanding
each part of the sequence. Similarly, RNNs also try to
incorporate this capacity of memory by updating
something called the “state” of its cells each time we
move from one part of a sequence to another. The state
of a cell is basically the total information gained by it
so far by reading the sequence. So, the current state or
knowledge of a cell in an RNN is not only dependent on
the current word or sentence it is reading, but is also
dependent on all the other words or sentences it has
read before the current one. Thus the name Recurrent
Neural Network. (Classic ANNs do not have this
mechanism of memory. An ANN neuron’s current state
depends only on the current input as it discards
information about the previous inputs to the cell)
 The first image above illustrates a recurrent neuron or
cell. It is a simple neuron that has a loop. It takes
some input x and gives some output h.

 This neuron can be thought of as multiple copies of


the same unit or cell chained together. This is
illustrated by the second image, which shows an
“unrolled” form of the recurrent neuron. Each copy or
unit passes a message (some information) to the next
copy.

In Recurrent Neural Networks, there is a concept of time


steps. This means that the recurrent cells or units take
inputs from a sequence one by one. Each step at which the
cell picks up an input is called a time step. For example, if we
have a sequence of words that form a sentence, such as “It’s
a sunny day.”, our recurrent cell will take the
word “It’s” as its input at the first time step. Now it stores
information about the word “It’s” in its memory and updates
its state. Next, it takes the word “a” as its second input at
the second time step. Now it incorporates information about
the word “a” into its memory and updates its state once
again. It repeats the process until the last word. Therefore,
the cell state at the 1st time step depends only on the 1st
input, the cell state at the 2nd time state depends on the 1st
and 2nd inputs, the cell state at the third time step depends
on the 1st, 2nd and 3rd inputs and so on. In this way the cell
continuously updates its memory as time passes (similar to a
human brain).

Referring to what you learnt from the previous paragraph to


the images above; we can say that $latex \Large{x_1}$,
$latex \Large{x_2}$, $latex \Large{x_3}$ and so on are the
inputs to the recurrent cell at the 1st, 2nd, 3rd and so on
time steps. At each time step, the recurrent cell updates its
state based on the current input, gives an output
vector h and then moves on to the next time step. This is
demonstrated in the “unrolled” RNN diagram above.

Therefore, we need 2 separate weight matrices at each time


step to calculate the current state of the recurrent cell. One
matrix W and another matrix U are used. Matrix W is
multiplied by the current input and the matrix U is multiplied
by the previous state of the cell (at the previous time step)
and the two products are added. A bias vector b can be
added to the sum. Then, the whole sum can be passed
through an activation function
like ReLU, Tanh or Sigmoid to form the new updated state
of the cell (The activation function is used to introduce non-
linearity into the network so that it can fit more complex
functions). So, the update formula can be written as:

$latex \huge{h_t + 1 = W \cdot h_t + U \cdot x_t}$, where


$latex h_t$ is the is the cell state at time step t and $latex
x_t$ is the cell input at time step t.

The RNN

Many such Recurrent Neurons stacked one on top of the


other (which may include some Densely Connected Layers at
the end) forms a Deep Recurrent Neural
Network or DRNN.
 A Deep Recurrent Neural Network. The outputs of the
lower layers are fed as inputs to the upper layers (at
each time step). For example, in the above figure, the
output of the lowest layer at time step $latex x_(t —
1)$ is fed as input at the $latex x_(t — 1)$ time step in
the middle layer. With multiple recurrent units
stacked one on top of the other, a DRNN can learn
more complex patterns in sequential data.

The outputs from one recurrent unit at each time step can be
fed as input to the next unit at the same time step. This
forms a deep sequential model that can model a larger range
of more complex sequences than a single recurrent unit.
Long Term Dependencies

Recurrent Neural Networks face the problem of long term


dependencies very often. On many occasions, in sequence
modelling problems we need information from long ago to
make predictions about the next term/s in a sequence. For
example, if we want to find the next word in the sentence “I
grew up in Spain and I am very familiar with the traditions
and customs of …..”. To predict the next word (which seems
to be Spain), we need to have information about the word
“Spain”, which is just the 5th word in the sentence. But we
need to predict the 17th word in the sentence. This is a large
time gap, and RNNs are prone to losing information given to
it many time steps back. RNNs are unable to capture these
long term dependencies in practice.

Long Short Term Memory Networks

A special type of RNN called an LSTM Network was created


to solve the problem of long term dependencies. The
constituent cells of an LSTM network each have their own
system of gates that decide what information and how much
information from the sequence (text or audio) is stored in the
cell’s state and how much is discarded at each time step.
These gates regulate the state of the cell more effectively
and help the cell retain information that it has gained long
ago. These systems of gates are parametrized by weight
matrices and bias vectors. These parameters are trained
using the Back Propagation algorithm. I would suggest
the colah blog for understanding LSTMs for a more in-depth
understanding of how LSTMs work.

Sequence Modelling Applications

Some applications of sequence modelling are:

 Image Captioning (with the help of computer vision):


Generating captions for images. (captions are
sequential data)

 Video Frame Prediction (with the help of computer


vision): Predict the subsequent frames of video given
the previous ones. (the frames in a video are
sequential in nature)

 Classifying songs (audio) as Jazz, Rock , Pop etc


(genre). Here, audio is of sequential nature.

 Composing Music (music is sequential in nature)


Markov model
What is a Markov model?
A Markov model is a stochastic method for randomly changing systems that
possess the Markov property. This means that, at any given time, the next
state is only dependent on the current state and is independent of anything in
the past. Two commonly applied types of Markov model are used when the
system being represented is autonomous -- that is, when the system isn't
influenced by an external agent. These are as follows:
1. Markov chains. These are the simplest type of Markov model and are
used to represent systems where all states are observable. Markov chains
show all possible states, and between states, they show the transition rate,
which is the probability of moving from one state to another per unit of
time. Applications of this type of model include prediction of market
crashes, speech recognition and search engine algorithms.

2. Hidden Markov models. These are used to represent systems with some
unobservable states. In addition to showing states and transition rates,
hidden Markov models also represent observations and observation
likelihoods for each state. Hidden Markov models are used for a range of
applications, including thermodynamics, finance and pattern recognition.

Another two commonly applied types of Markov model are used when the
system being represented is controlled -- that is, when the system is
influenced by a decision-making agent. These are as follows:

1. Markov decision processes. These are used to model decision-making in


discrete, stochastic, sequential environments. In these processes, an
agent makes decisions based on reliable information. These models are
applied to problems in artificial intelligence (AI), economics and behavioral
sciences.

2. Partially observable Markov decision processes. These are used in


cases like Markov decision processes but with the assumption that the
agent doesn't always have reliable information. Applications of these
models include robotics, where it isn't always possible to know the location.
Another application is machine maintenance, where reliable information on
machine parts can't be obtained because it's too costly to shut down the
machine to get the information.
How is Markov analysis applied?
Markov analysis is a probabilistic technique that uses Markov models to
predict the future behavior of some variable based on the current state.
Markov analysis is used in many domains, including the following:

 Markov chains are used for several business applications, including


predicting customer brand switching for marketing, predicting how long
people will remain in their jobs for human resources, predicting time to
failure of a machine in manufacturing, and forecasting the future price of a
stock in finance.

 Markov analysis is also used in natural language processing (NLP) and in


machine learning. For NLP, a Markov chain can be used to generate a
sequence of words that form a complete sentence, or a hidden Markov
model can be used for named-entity recognition and tagging parts of
speech. For machine learning, Markov decision processes are used to
represent reward in reinforcement learning.

 A recent example of the use of Markov analysis in healthcare was in


Kuwait. A continuous-time Markov chain model was used to determine the
optimal timing and duration of a full COVID-19 lockdown in the country,
minimizing both new infections and hospitalizations. The model suggested
that a 90-day lockdown beginning 10 days before the epidemic peak was
optimal.
How are Markov models represented?
The simplest Markov model is a Markov chain, which can be expressed in
equations, as a transition matrix or as a graph. A transition matrix is used to
indicate the probability of moving from each state to each other state.
Generally, the current states are listed in rows, and the next states are
represented as columns. Each cell then contains the probability of moving
from the current state to the next state. For any given row, all the cell values
must then add up to one.
A graph consists of circles, each of which represents a state, and directional
arrows to indicate possible transitions between states. The directional arrows
are labeled with the transition probability. The transition probabilities on the
directional arrows coming out of any given circle must add up to one.

Other Markov models are based on the chain representations but with added
information, such as observations and observation likelihoods.

The transition matrix below represents shifting gears in a car with a manual
transmission. Six states are possible, and a transition from any given state to
any other state depends only on the current state -- that is, where the car
goes from second gear isn't influenced by where it was before second gear.
Such a transition matrix might be built from empirical observations that show,
for example, that the most probable transitions from first gear are to second or
neutral.

This transition matrix represents shifting gears in a car with a manual transmission and the
six states that are possible.

The image below represents the toss of a coin. Two states are possible:
heads and tails. The transition from heads to heads or heads to tails is equally
probable (.5) and is independent of all preceding coin tosses.The circles
represent the two possible states -- heads or tails -- and the arrows show the possible
states the system could transition to in the next step. The number .5 represents the
probability of that transition occurring.

History of the Markov chain


Markov chains are named after their creator, Andrey Andreyevich Markov, a
Russian mathematician who founded a new branch of probability theory
around stochastic processes in the early 1900s. Markov was greatly
influenced by his teacher and mentor, Pafnuty Chebyshev, whose work also
broke new ground in probability theory.
Hidden Markov Models
A Hidden Markov Model (HMM) is a probabilistic model that consists
of a sequence of hidden states, each of which generates an
observation. The hidden states are usually not directly observable, and the
goal of HMM is to estimate the sequence of hidden states based on a
sequence of observations. An HMM is defined by the following components:

o A set of N hidden states, S = {s1, s2, ..., sN}.


o A set of M observations, O = {o1, o2, ..., oM}.
o An initial state probability distribution, ? = {?1, ?2, ..., ?N}, which specifies
the probability of starting in each hidden state.
o A transition probability matrix, A = [aij], defines the probability of moving
from one hidden state to another.
o An emission probability matrix, B = [bjk], defines the probability of emitting
an observation from a given hidden state.

The basic idea behind an HMM is that the hidden states generate the
observations, and the observed data is used to estimate the hidden state
sequence. This is often referred to as the forward-backwards algorithm.

Applications of Hidden Markov Models


Now, we will explore some of the key applications of HMMs, including speech
recognition, natural language processing, bioinformatics, and finance

o Speech Recognition
One of the most well-known applications of HMMs is speech recognition. In
this field, HMMs are used to model the different sounds and phones that
makeup speech. The hidden states, in this case, correspond to the different
sounds or phones, and the observations are the acoustic signals that are
generated by the speech. The goal is to estimate the hidden state sequence,
which corresponds to the transcription of the speech, based on the observed
acoustic signals. HMMs are particularly well-suited for speech recognition
because they can effectively capture the underlying structure of the speech,
even when the data is noisy or incomplete. In speech recognition systems,
the HMMs are usually trained on large datasets of speech signals, and the
estimated parameters of the HMMs are used to transcribe speech in real
time.
o Natural Language Processing
Another important application of HMMs is natural language processing. In this
field, HMMs are used for tasks such as part-of-speech tagging, named
entity recognition, and text classification. In these applications, the
hidden states are typically associated with the underlying grammar or
structure of the text, while the observations are the words in the text. The
goal is to estimate the hidden state sequence, which corresponds to the
structure or meaning of the text, based on the observed words. HMMs are
useful in natural language processing because they can effectively capture
the underlying structure of the text, even when the data is noisy or
ambiguous. In natural language processing systems, the HMMs are usually
trained on large datasets of text, and the estimated parameters of the HMMs
are used to perform various NLP tasks, such as text classification, part-of-
speech tagging, and named entity recognition.
o Bioinformatics
HMMs are also widely used in bioinformatics, where they are used to model
sequences of DNA, RNA, and proteins. The hidden states, in this case,
correspond to the different types of residues, while the observations are the
sequences of residues. The goal is to estimate the hidden state sequence,
which corresponds to the underlying structure of the molecule, based on the
observed sequences of residues. HMMs are useful in bioinformatics because
they can effectively capture the underlying structure of the molecule, even
when the data is noisy or incomplete. In bioinformatics systems, the HMMs
are usually trained on large datasets of molecular sequences, and the
estimated parameters of the HMMs are used to predict the structure or
function of new molecular sequences.
o Finance
Finally, HMMs have also been used in finance, where they are used to model
stock prices, interest rates, and currency exchange rates. In these
applications, the hidden states correspond to different economic states, such
as bull and bear markets, while the observations are the stock prices, interest
rates, or exchange rates. The goal is to estimate the hidden state sequence,
which corresponds to the underlying economic state, based on the observed
prices, rates, or exchange rates. HMMs are useful in finance because they
can effectively capture the underlying economic state, even when the data is
noisy or incomplete. In finance systems, the HMMs are usually trained on
large datasets of financial data, and the estimated parameters of the HMMs
are used to make predictions about future market trends or to develop
investment strategies.

Limitations of Hidden Markov Models


Now, we will explore some of the key limitations of HMMs and discuss how
they can impact the accuracy and performance of HMM-based systems.

o Limited Modeling Capabilities


One of the key limitations of HMMs is that they are relatively limited in their
modelling capabilities. HMMs are designed to model sequences of data,
where the underlying structure of the data is represented by a set of hidden
states. However, the structure of the data can be quite complex, and the
simple structure of HMMs may not be enough to accurately capture all the
details. For example, in speech recognition, the complex relationship
between the speech sounds and the corresponding acoustic signals may not
be fully captured by the simple structure of an HMM.
o Overfitting
Another limitation of HMMs is that they can be prone to overfitting, especially
when the number of hidden states is large or the amount of training data is
limited. Overfitting occurs when the model fits the training data too well and
is unable to generalize to new data. This can lead to poor performance when
the model is applied to real-world data and can result in high error rates. To
avoid overfitting, it is important to carefully choose the number of hidden
states and to use appropriate regularization techniques.
o Lack of Robustness
HMMs are also limited in their robustness to noise and variability in the data.
For example, in speech recognition, the acoustic signals generated by speech
can be subjected to a variety of distortions and noise, which can make it
difficult for the HMM to accurately estimate the underlying structure of the
data. In some cases, these distortions and noise can cause the HMM to make
incorrect decisions, which can result in poor performance. To address these
limitations, it is often necessary to use additional processing and filtering
techniques, such as noise reduction and normalization, to pre-process the
data before it is fed into the HMM.
o Computational Complexity
Finally, HMMs can also be limited by their computational complexity,
especially when dealing with large amounts of data or when using complex
models. The computational complexity of HMMs is due to the need to
estimate the parameters of the model and to compute the likelihood of the
data given in the model. This can be time-consuming and computationally
expensive, especially for large models or for data that is sampled at a high
frequency. To address this limitation, it is often necessary to use parallel
computing techniques or to use approximations that reduce the
computational complexity of the model.

UNIT-5

Neural networks are artificial systems that were inspired by biological neural
networks. These systems learn to perform tasks by being exposed to various
datasets and examples without any task-specific rules. The idea is that the
system generates identifying characteristics from the data they have been
passed without being programmed with a pre-programmed understanding of
these datasets. Neural networks are based on computational models for
threshold logic. Threshold logic is a combination of algorithms and
mathematics. Neural networks are based either on the study of the brain or on
the application of neural networks to artificial intelligence. The work has led to
improvements in finite automata theory. Components of a typical neural network
involve neurons, connections which are known as synapses, weights, biases,
propagation function, and a learning rule. Neurons will receive an
input from predecessor neurons that have an activation ,
threshold , an activation function f, and an output function .
Connections consist of connections, weights and biases which rules how
neuron transfers output to neuron . Propagation computes the input and
outputs the output and sums the predecessor neurons function with the weight.
The learning of neural network basically refers to the adjustment in the free
parameters i.e. weights and bias. There are basically three sequence of events
of learning process.
These includes:
1. The neural network is simulated by an new environment.
2. Then the free parameters of the neural network is changed as a result of this
simulation.
3. The neural network then responds in a new way to the environment because
of the changes in its free parameters.

Types of Neural Networks

There are seven types of neural networks that can be used.


 Multilayer Perceptron (MLP): A type of feedforward neural network with three
or more layers, including an input layer, one or more hidden layers, and an
output layer. It uses nonlinear activation functions.
 Convolutional Neural Network (CNN): A neural network that is designed to
process input data that has a grid-like structure, such as an image. It uses
convolutional layers and pooling layers to extract features from the input
data.
 Recursive Neural Network (RNN): A neural network that can operate on
input sequences of variable length, such as text. It uses weights to make
structured predictions.
 Recurrent Neural Network (RNN): A type of neural network that makes
connections between the neurons in a directed cycle, allowing it to process
sequential data.
 Long Short-Term Memory (LSTM): A type of RNN that is designed to
overcome the vanishing gradient problem in training RNNs. It uses memory
cells and gates to selectively read, write, and erase information.
 Sequence-to-Sequence (Seq2Seq): A type of neural network that uses two
RNNs to map input sequences to output sequences, such as translating one
language to another.
 Shallow Neural Network: A neural network with only one hidden layer, often
used for simpler tasks or as a building block for larger networks.

Perceptron in Machine Learning


In Machine Learning and Artificial Intelligence, Perceptron is the most
commonly used term for all folks. It is the primary step to learn Machine
Learning and Deep Learning technologies, which consists of a set of weights,
input values or scores, and a threshold. Perceptron is a building block of
an Artificial Neural Network. Initially, in the mid of 19 th century, Mr.
Frank Rosenblatt invented the Perceptron for performing certain
calculations to detect input data capabilities or business intelligence.
Perceptron is a linear Machine Learning algorithm used for supervised
learning for various binary classifiers. This algorithm enables neurons to
learn elements and processes them one by one during preparation. In this
tutorial, "Perceptron in Machine Learning," we will discuss in-depth
knowledge of Perceptron and its basic functions in brief. Let's start with the
basic introduction of Perceptron.

What is the Perceptron model in Machine Learning?


Perceptron is Machine Learning algorithm for supervised learning of various
binary classification tasks. Further, Perceptron is also understood as an
Artificial Neuron or neural network unit that helps to detect certain
input data computations in business intelligence.

Perceptron model is also treated as one of the best and simplest types of
Artificial Neural networks. However, it is a supervised learning algorithm of
binary classifiers. Hence, we can consider it as a single-layer neural network
with four main parameters, i.e., input values, weights and Bias, net
sum, and an activation function.

What is Binary classifier in Machine Learning?


In Machine Learning, binary classifiers are defined as the function that helps
in deciding whether input data can be represented as vectors of numbers
and belongs to some specific class.

Binary classifiers can be considered as linear classifiers. In simple words, we


can understand it as a classification algorithm that can predict linear
predictor function in terms of weight and feature vectors.

Basic Components of Perceptron


Mr. Frank Rosenblatt invented the perceptron model as a binary classifier
which contains three main components. These are as follows:

o Input Nodes or Input Layer:


This is the primary component of Perceptron which accepts the initial data
into the system for further processing. Each input node contains a real
numerical value.

o Wight and Bias:

Weight parameter represents the strength of the connection between units.


This is another most important parameter of Perceptron components. Weight
is directly proportional to the strength of the associated input neuron in
deciding the output. Further, Bias can be considered as the line of intercept
in a linear equation.

o Activation Function:

These are the final and important components that help to determine
whether the neuron will fire or not. Activation Function can be considered
primarily as a step function.

Types of Activation functions:

o Sign function
o Step function, and
o Sigmoid function

The data scientist uses the activation function to take a subjective decision
based on various problem statements and forms the desired outputs.
Activation function may differ (e.g., Sign, Step, and Sigmoid) in perceptron
models by checking whether the learning process is slow or has vanishing or
exploding gradients.
How does Perceptron work?
In Machine Learning, Perceptron is considered as a single-layer neural
network that consists of four main parameters named input values (Input
nodes), weights and Bias, net sum, and an activation function. The
perceptron model begins with the multiplication of all input values and their
weights, then adds these values together to create the weighted sum. Then
this weighted sum is applied to the activation function 'f' to obtain the
desired output. This activation function is also known as the step
function and is represented by 'f'.

This step function or Activation function plays a vital role in ensuring that
output is mapped between required values (0,1) or (-1,1). It is important to
note that the weight of input is indicative of the strength of a node. Similarly,
an input's bias value gives the ability to shift the activation function curve up
or down.

Perceptron model works in two important steps as follows:

Step-1

In the first step first, multiply all input values with corresponding weight
values and then add them to determine the weighted sum. Mathematically,
we can calculate the weighted sum as follows:

∑wi*xi = x1*w1 + x2*w2 +…wn*xn


Add a special term called bias 'b' to this weighted sum to improve the
model's performance.

∑wi*xi + b

Step-2

In the second step, an activation function is applied with the above-


mentioned weighted sum, which gives us output either in binary form or a
continuous value as follows:

Y = f(∑wi*xi + b)

Types of Perceptron Models


Based on the layers, Perceptron models are divided into two types. These are
as follows:

1. Single-layer Perceptron Model


2. Multi-layer Perceptron model

Single Layer Perceptron Model:


This is one of the easiest Artificial neural networks (ANN) types. A single-
layered perceptron model consists feed-forward network and also includes a
threshold transfer function inside the model. The main objective of the
single-layer perceptron model is to analyze the linearly separable objects
with binary outcomes.

In a single layer perceptron model, its algorithms do not contain recorded


data, so it begins with inconstantly allocated input for weight parameters.
Further, it sums up all inputs (weight). After adding all inputs, if the total sum
of all inputs is more than a pre-determined value, the model gets activated
and shows the output value as +1.

If the outcome is same as pre-determined or threshold value, then the


performance of this model is stated as satisfied, and weight demand does
not change. However, this model consists of a few discrepancies triggered
when multiple weight inputs values are fed into the model. Hence, to find
desired output and minimize errors, some changes should be necessary for
the weights input.

"Single-layer perceptron can learn only linearly separable patterns."


Multi-Layered Perceptron Model:
Like a single-layer perceptron model, a multi-layer perceptron model also
has the same model structure but has a greater number of hidden layers.

The multi-layer perceptron model is also known as the Backpropagation


algorithm, which executes in two stages as follows:

o Forward Stage: Activation functions start from the input layer in the
forward stage and terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are
modified as per the model's requirement. In this stage, the error between
actual output and demanded originated backward on the output layer and
ended on the input layer.

Hence, a multi-layered perceptron model has considered as multiple artificial


neural networks having various layers in which activation function does not
remain linear, similar to a single layer perceptron model. Instead of linear,
activation function can be executed as sigmoid, TanH, ReLU, etc., for
deployment.

A multi-layer perceptron model has greater processing power and can


process linear and non-linear patterns. Further, it can also implement logic
gates such as AND, OR, XOR, NAND, NOT, XNOR, NOR.

Advantages of Multi-Layer Perceptron:

o A multi-layered perceptron model can be used to solve complex non-linear


problems.
o It works well with both small and large input data.
o It helps us to obtain quick predictions after the training.
o It helps to obtain the same accuracy ratio with large as well as small data.

Disadvantages of Multi-Layer Perceptron:

o In Multi-layer perceptron, computations are difficult and time-consuming.


o In multi-layer Perceptron, it is difficult to predict how much the dependent
variable affects each independent variable.
o The model functioning depends on the quality of the training.
Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input
'x' with the learned weight coefficient 'w'.

Mathematically, we can express it as follows:

f(x)=1; if w.x+b>0

otherwise, f(x)=0

o 'w' represents real-valued weights vector


o 'b' represents the bias
o 'x' represents a vector of input x values.

Characteristics of Perceptron
The perceptron model has the following characteristics.

1. Perceptron is a machine learning algorithm for supervised learning of binary


classifiers.
2. In Perceptron, the weight coefficient is automatically learned.
3. Initially, weights are multiplied with input features, and the decision is made
whether the neuron is fired or not.
4. The activation function applies a step rule to check whether the weight
function is greater than zero.
5. The linear decision boundary is drawn, enabling the distinction between the
two linearly separable classes +1 and -1.
6. If the added sum of all input values is more than the threshold value, it must
have an output signal; otherwise, no output will be shown.

Limitations of Perceptron Model


A perceptron model has limitations as follows:

o The output of a perceptron can only be a binary number (0 or 1) due to the


hard limit transfer function.
o Perceptron can only be used to classify the linearly separable sets of input
vectors. If input vectors are non-linear, it is not easy to classify them
properly.

Future of Perceptron
The future of the Perceptron model is much bright and significant as it helps
to interpret data by building intuitive patterns and applying them in the
future. Machine learning is a rapidly growing technology of Artificial
Intelligence that is continuously evolving and in the developing phase; hence
the future of perceptron technology will continue to support and facilitate
analytical behavior in machines that will, in turn, add to the efficiency of
computers.

The perceptron model is continuously becoming more advanced and working


efficiently on complex problems with the help of artificial neurons.

Feed Forward Process in Deep Neural Network


Now, we know how with the combination of lines with different weight and
biases can result in non-linear models. How does a neural network know
what weight and biased values to have in each layer? It is no different from
how we did it for the single based perceptron model.

We are still making use of a gradient descent optimization algorithm which


acts to minimize the error of our model by iteratively moving in the direction
with the steepest descent, the direction which updates the parameters of our
model while ensuring the minimal error. It updates the weight of every
model in every single layer. We will talk more about optimization algorithms
and backpropagation later.

It is important to recognize the subsequent training of our neural network.


Recognition is done by dividing our data samples through some decision
boundary.

"The process of receiving an input to produce some kind of output to make


some kind of prediction is known as Feed Forward." Feed Forward neural
network is the core of many other important neural networks such as
convolution neural network.

In the feed-forward neural network, there are not any feedback loops or
connections in the network. Here is simply an input layer, a hidden layer, and
an output layer.
There can be multiple hidden layers which depend on what kind of data you
are dealing with. The number of hidden layers is known as the depth of the
neural network. The deep neural network can learn from more functions.
Input layer first provides the neural network with data and the output layer
then make predictions on that data which is based on a series of functions.
ReLU Function is the most commonly used activation function in the deep
neural network.

To gain a solid understanding of the feed-forward process, let's see this


mathematically.

1) The first input is fed to the network, which is represented as matrix x1, x2,
and one where one is the bias value.

2) Each input is multiplied by weight with respect to the first and second
model to obtain their probability of being in the positive region in each
model.

So, we will multiply our inputs by a matrix of weight using matrix


multiplication.
3) ) After that, we will take the sigmoid of our scores and gives us the
probability of the point being in the positive region in both models.

4) We multiply the probability which we have obtained from the previous


step with the second set of weights. We always include a bias of one
whenever taking a combination of inputs.

And as we know to obtain the probability of the point being in the positive
region of this model, we take the sigmoid and thus producing our final output
in a feed-forward process.

Let takes the neural network which we had previously with the following
linear models and the hidden layer which combined to form the non-linear
model in the output layer.
So, what we will do we use our non-linear model to produce an output that
describes the probability of the point being in the positive region. The point
was represented by 2 and 2. Along with bias, we will represent the input as

The first linear model in the hidden layer recall and the equation defined it

Which means in the first layer to obtain the linear combination the inputs are
multiplied by -4, -1 and the bias value is multiplied by twelve.

The weight of the inputs are multiplied by -1/5, 1, and the bias is multiplied
by three to obtain the linear combination of that same point in our second
model.
Now, to obtain the probability of the point is in the positive region relative to
both models we apply sigmoid to both points as

The second layer contains the weights which dictated the combination of the
linear models in the first layer to obtain the non-linear model in the second
layer. The weights are 1.5, 1, and a bias value of 0.5.

Now, we have to multiply our probabilities from the first layer with the
second set of weights as

Now, we will take the sigmoid of our final score

It is complete math behind the feed forward process where the inputs from
the input traverse the entire depth of the neural network. In this example,
there is only one hidden layer. Whether there is one hidden layer or twenty,
the computational processes are the same for all hidden layers.
Backpropagation Process in Deep Neural
Network
Backpropagation is one of the important concepts of a neural network. Our
task is to classify our data best. For this, we have to update the weights of
parameter and bias, but how can we do that in a deep neural network? In the
linear regression model, we use gradient descent to optimize the parameter.
Similarly here we also use gradient descent algorithm using
Backpropagation.

For a single training example, Backpropagation algorithm calculates the


gradient of the error function. Backpropagation can be written as a
function of the neural network. Backpropagation algorithms are a set of
methods used to efficiently train artificial neural networks following a
gradient descent approach which exploits the chain rule.

The main features of Backpropagation are the iterative, recursive and


efficient method through which it calculates the updated weight to improve
the network until it is not able to perform the task for which it is being
trained. Derivatives of the activation function to be known at network design
time is required to Backpropagation.

Now, how error function is used in Backpropagation and how


Backpropagation works? Let start with an example and do it mathematically
to understand how exactly updates the weight using Backpropagation.
Input values
X1=0.05
X2=0.10

Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55

Bias Values
b1=0.35 b2=0.60

Target Values
T1=0.01
T2=0.99

Now, we first calculate the values of H1 and H2 by a forward pass.

Forward Pass
To find the value of H1 we first multiply the input value from the weights as

H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775

To calculate the final result of H1, we performed the sigmoid function as

We will calculate the value of H2 in the same way as H1


H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35

H2=0.3925

To calculate the final result of H1, we performed the sigmoid function as

Now, we calculate the values of y1 and y2 in the same way as we calculate


the H1 and H2.

To find the value of y1, we first multiply the input value i.e., the outcome of
H1 and H2 from the weights as

y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60

y1=1.10590597

To calculate the final result of y1 we performed the sigmoid function as

We will calculate the value of y2 in the same way as y1

y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214
To calculate the final result of H1, we performed the sigmoid function as

Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched
with our target values T1 and T2.

Now, we will find the total error, which is simply the difference between the
outputs from the target outputs. The total error is calculated as

So, the total error is

Activation Functions and Loss


Functions
When one starts to develop their own neural networks, it is
easy to get overwhelmed by the wide variety of options
available for each parameter of the model.Which activation
function to use for each hidden layer? Which activation
function for the output layer? When to use Binary Cross
Entropy vs Categorical Cross Entropy?

Such questions will keep coming till we do not have a firm


understanding of what each option does, it’s pros and cons
and when should one use it. The purpose of the blog is
exactly that. We will be going through the key features of
popular Activation Functions and Loss Functions as well as
understand when should one use which. In case you need a
refresher on how neural networks work or what is a
activation or loss function, please refer to this blog. So
without any further delay let’s dive in!

Activation Functions
The activation function of a neuron defines it’s output given
its inputs.We will be talking about 4 popular activation
functions:

1. Sigmoid Function:

Description: Takes a real-valued number and scales it


between 0 and 1. Large negative numbers become 0 and
large positive numbers become 1
Formula: 1 /(1 + e^-x)
Range: (0,1)
Pros: As it’s range is between 0 and 1, it is ideal for
situations where we need to predict the probability of an
event as an output.
Cons: The gradient values are significant for range -3 and 3
but become much closer to zero beyond this range which
almost kills the impact of the neuron on the final output.
Also, sigmoid outputs are not zero-centered (it is centred
around 0.5) which leads to undesirable zig-zagging dynamics
in the gradient updates for the weights
Plot:

2. Tanh Function:

Description: Similar to sigmoid but takes a real-valued


number and scales it between -1 and 1.It is better than
sigmoid as it is centred around 0 which leads to better
convergence
Formula: (e^x — e^-x) / (e^x + e^-x)
Range: (-1,1)
Pros: The derivatives of the tanh are larger than the
derivatives of the sigmoid which help us minimize the cost
function faster
Cons: Similar to sigmoid, the gradient values become close
to zero for wide range of values (this is known as vanishing
gradient problem). Thus, the network refuses to learn or
keeps learning at a very small rate.
Plot:

3. Softmax Function:

Description: Softmax function can be imagined as a


combination of multiple sigmoids which can returns the
probability for a datapoint belonging to each individual class
in a multiclass classification problem
Formula:
Range: (0,1), sum of output = 1
Pros: Can handle multiple classes and give the probability of
belonging to each class
Cons: Should not be used in hidden layers as we want the
neurons to be independent. If we apply it then they will be
linearly dependent.

Plot: Not Applicable

4. ReLU Function:

Description: The rectified linear activation function or ReLU


for short is a piecewise linear function that will output the
input directly if it is positive, otherwise, it will output zero.
This is the default function but modifying default parameters
allows us to use non-zero thresholds and to use a non-zero
multiple of the input for values below the threshold (called
Leaky ReLU).
Formula: max(0,x)
Range: (0,inf)
Pros: Although RELU looks and acts like a linear function, it
is a nonlinear function allowing complex relationships to be
learned and is able to allow learning through all the hidden
layers in a deep network by having large derivatives.
Cons: It should not be used as the final output layer for
either classification/regression tasks
Plot:

Loss Functions
The other key aspect in setting up the neural network
infrastructure is selecting the right loss functions. With
neural networks, we seek to minimize the error (difference
between actual and predicted value) which is calculated by
the loss function. We will be discussing 3 popular loss
functions:

1. Mean Squared Error, L2 Loss


Description: MSE loss is used for regression tasks. As the
name suggests, this loss is calculated by taking the mean of
squared differences between actual(target) and predicted
values.
Formula:

Range: (0,inf)
Pros: Preferred loss function if the distribution of the target
variable is Gaussian as it has good derivatives and helps the
model converge quickly
Cons: Is not robust to outliers in the data (unlike loss
functions like Mean Absolute Error) and penalizes high and
low predictions exponentially (unlike loss functions like Mean
Squared Logarithmic Error Loss)

2. Binary Cross Entropy

Description: BCE loss is the default loss function used for


the binary classification tasks. It requires one output layer to
classify the data into two classes and the range of output is
(0–1) i.e. should use the sigmoid function.
Formula:

where y is the actual label, ŷ is the classifier’s predicted


probability distributions for predicting one class and m is the
number of records.
Range: (0,inf)
Pros: The continuous nature of the loss function helps the
training process converged well
Cons: Can only be used with sigmoid activation function.
Other loss functions like Hinge or Squared Hinge Loss can
work with tanh activation function

3. Categorical Cross Entropy

Description: It is the default loss function when we have a


multi-class classification task. It requires the same number of
output nodes as the classes with the final layer going
through a softmax activation so that each output node has a
probability value between (0–1).
Formula:
where y is the actual label and p is the classifier’s predicted
probability distributions for predicting the class j
Range: (0,inf)
Pros: Similar to Binary Cross Entropy, the continuous nature
of the loss function helps the training process converged well
Cons: May require a one hot encoded vector with many zero
values if there many classes, requiring significant memory
(should use Sparse Categorical Crossentropy in this case)
There are a few other activation functions like softplus and
elu as well as other loss functions like hinge and huber
available in the deep learning packages/libraries. I would
definitely encourage interested readers to go through these
other functions and try them out (specially, if the ones
discussed here do not yield the right results).
Limitations of Machine Learning
Machine learning, a method that enables computers to learn from data and make
predictions or judgments without being explicitly programmed, has grown in popularity
in artificial intelligence (AI). Machine learning has its limitations, just like any other
technology, and these must be considered before using it in practical situations. The
main machine learning restrictions that every data scientist, researcher, and engineer
should be aware of are covered in this article.

1. Lack of Transparency and Interpretability


One of its main drawbacks is more transparency and interpretability in machine
learning. As they don't reveal how a judgment was made or how it came to be, machine
learning algorithms are frequently called "black boxes." This makes it challenging to
comprehend how a certain model concluded and might be problematic when
explanations are required. For instance, understanding the reasoning behind a
particular diagnosis in healthcare might be easier with transparency and interpretability.
A critical drawback of machine learning algorithms that might have substantial
ramifications in practical applications is their need for more transparency and
interpretability. As they don't reveal how a judgment was made or how it came to be,
machine learning algorithms are sometimes called "black boxes." This might make it
challenging to comprehend how a certain model concluded and can pose problems
when explanations are required.
Increase transparency and interpretability by providing a more thorough description of
the decision-making process through explanations. Natural language explanations or
decision trees are just two examples of the available explanation formats. Natural
language explanations can offer a description of the decision-making process that is
readable by humans, making it simpler for non-experts to comprehend. A visual
representation of the decision-making process, such as a decision tree, can increase
transparency and interpretability.

2. Bias and Discrimination


The possibility for bias and discrimination is a significant flaw in machine learning.
Large datasets, which may have data biases, are used to train machine learning
systems. If these biases are not addressed, the machine learning system may reinforce
them, producing biased results.
The algorithms used in facial recognition are one instance of bias in machine learning.
According to research, facial recognition software performs worse on those with darker
skin tones, which causes false positive and false negative rates to be higher for people
of races. This bias may have significant consequences, particularly in law enforcement
and security applications, where false positives may result in unjustified arrests or other
undesirable results.
Finally, it is critical to understand that biases and discrimination in machine learning
algorithms frequently emerge from larger social and cultural biases. To address these
biases, there has to be a larger push for inclusion and diversity in the design and use of
machine learning algorithms.

3. Overfitting and Underfitting


Machine learning algorithms frequently have two limitations: overfitting and underfitting.
Overfitting is a condition where a machine learning model performs poorly on new,
unknown data because it needs to be simplified and has been trained too successfully
on the training data. On the other side, underfitting happens when a machine learning
model is overly simplistic and unable to recognize the underlying patterns in the data,
resulting in subpar performance on both the training data and fresh data.
Regularization, cross-validation, and ensemble approaches are examples of
techniques that can be used to alleviate overfitting and underfitting. When a model is
regularised, a penalty term is added to the loss function to prevent the model from
growing too complex. Cross-validation includes splitting the data into training and
validation sets so that the model's performance can be assessed and its
hyperparameters can be adjusted. To enhance performance, ensemble approaches
combine several models.
While developing predictive models using machine learning, overfitting, and underfitting
are frequent problems. When a model is overtrained and excessively sophisticated on a
small dataset, overfitting occurs, which results in a good performance on training data
but poor generalization to new data. Conversely, underfitting occurs when a model
needs to be more complex and adequately represent the underlying relationships in the
data, resulting in subpar performance on training and test data. Using regularisation
methods like L1 and L2 regularisation is one way to prevent overfitting. The objective
function receives a penalty term during regularisation that restricts the magnitude of the
model's parameters. Another method is early stopping, in which training is halted when
a model's performance on a validation set stops advancing.
A common method for assessing a machine learning model's performance and fine-
tuning its hyperparameters is cross-validation. The dataset is divided into folds, and the
model is trained and tested on each fold. Overfitting can be prevented, and a more
precise estimate of the model's performance can be obtained.

4. Limited Data Availability


A major challenge for machine learning is the need for more available data. Machine
learning algorithms need a lot of data to learn and produce precise predictions.
However, there might need to be more data available or only restricted access to it in
many fields. Due to privacy considerations, it might be difficult to get medical data,
while data from sporadic events, such as natural catastrophes, may be of restricted
scope.
Researchers are looking into novel techniques for creating synthetic data that may be
used to supplement small datasets to address this constraint. To expand the amount of
data accessible for training machine learning algorithms, efforts are also being made to
enhance data sharing and collaboration across enterprises.
A major obstacle to machine learning is the need for more available data. Addressing
this restriction will need for a concerted effort across industries and disciplines to
improve data collection, sharing, and reinforcement in order to ensure that machine
learning algorithms can continue to be helpful in a variety of applications.

5. Computational Resources
Machine learning algorithms can be computationally expensive, and they may require a
lot of resources to be successfully trained. This may be a major barrier, particularly for
people or smaller companies who want access to high-performance computing
resources. Distributed and cloud computing can be used to get around this restriction,
however the project's cost might go up.
For huge datasets and complex models, machine learning approaches can be
computationally expensive. The scalability and feasibility of machine learning
algorithms may be hampered by the need for significant processing resources. The
availability of computational resources like processor speed, memory, and storage is
another limitation on machine learning.
Using cloud computing is one way to overcome the computational resource barrier.
Users can scale up or decrease their use of computer resources according to their
demands using cloud computing platforms like Amazon Web Services (AWS) and
Microsoft Azure, which offer on-demand access to computing resources. The cost and
difficulty of maintaining computational resources can be greatly decreased.
To lower the computing demands, optimizing the data preprocessing pipelines and
machine learning algorithms is crucial. This may entail the use of more effective
algorithms, a decrease in the data's dimensionality, and the removal of pointless or
redundant information.

6. Lack of Causality
Predictions based on correlations in the data are frequently made using machine
learning algorithms. Machine learning algorithms may not shed light on the underlying
causal links in the data because correlation does not always imply causation. This may
reduce our capacity for precise prediction when causality is crucial.
The absence of causation is one of machine learning's main drawbacks. The main
purpose of machine learning algorithms is to find patterns and correlations in data;
however, they cannot establish causal links between different variables. In other words,
machine learning models can forecast future events based on seen data, but they
cannot explain why such events occur.
A major drawback of using machine learning models to judge is the absence of
causality. For instance, if a machine learning model is used to forecast the likelihood
that a consumer would buy a product, it may find factors like age, income, and gender
that are connected with buying behavior. The model, however, is unable to determine if
these variables are the source of the buying behaviour or whether there are further
underlying causes.
To get over this restriction, machine learning may need to be integrated with other
methodologies like experimental design. Researchers can identify causal relationships
by manipulating variables and observing how those changes impact a result using an
experimental design. However, compared to traditional machine learning techniques,
this approach may require more time and resources.
Machine learning can be a useful tool for predicting outcomes from observable data,
but it's crucial to be aware of its limitations when making decisions based on these
predictions. The lack of causation is a basic flaw in machine learning systems. To
establish causation, it could be necessary to use methods other than machine learning.

7. Ethical Considerations
Machine learning models can have major social, ethical, and legal repercussions when
used to make judgments that affect people's lives. Machine learning models, for
instance, may have a differential effect on groups of individuals when used to make
employment or lending choices. Privacy, security, and data ownership must also be
addressed when adopting machine learning models.
The ethical issue of bias and discrimination is a major one. If the training data is biased
or the algorithms are not created in a fair and inclusive manner, biases and
discrimination in society may be perpetuated and even amplified by machine learning
algorithms.
Another important ethical factor is privacy. Machine learning algorithms can collect and
process large amounts of personal data, which raises questions about how that data is
utilized and safeguarded.
Accountability and transparency are also crucial ethical factors. It is essential to ensure
that machine learning algorithms are visible and understandable and that systems are
in place to hold the creators and users of these algorithms responsible for their actions.
Finally, there are ethical issues around how machine learning will affect society. More
sophisticated machine learning algorithms may have far-reaching social, economic, and
political repercussions that require careful analysis and regulation.
Deep Learning
Deep learning is the branch of machine learning which is based on artificial
neural network architecture. An artificial neural network or ANN uses layers of
interconnected nodes called neurons that work together to process and learn
from the input data.
In a fully connected Deep neural network, there is an input layer and one or
more hidden layers connected one after the other. Each neuron receives input
from the previous layer neurons or the input layer. The output of one neuron
becomes the input to other neurons in the next layer of the network, and this
process continues until the final layer produces the output of the network. The
layers of the neural network transform the input data through a series of
nonlinear transformations, allowing the network to learn complex
representations of the input data.

Today Deep learning has become one of the most popular and visible areas of
machine learning, due to its success in a variety of applications, such as
computer vision, natural language processing, and Reinforcement learning.
Deep learning can be used for supervised, unsupervised as well as
reinforcement machine learning. it uses a variety of ways to process these.
 Supervised Machine Learning: Supervised machine learning is
the machine learning technique in which the neural network learns to make
predictions or classify data based on the labeled datasets. Here we input
both input features along with the target variables. the neural network learns
to make predictions based on the cost or error that comes from the
difference between the predicted and the actual target, this process is known
as backpropagation. Deep learning algorithms like Convolutional neural
networks, Recurrent neural networks are used for many supervised tasks
like image classifications and recognization, sentiment analysis, language
translations, etc.
 Unsupervised Machine Learning: Unsupervised machine learning is
the machine learning technique in which the neural network learns to
discover the patterns or to cluster the dataset based on unlabeled datasets.
Here there are no target variables. while the machine has to self-determined
the hidden patterns or relationships within the datasets. Deep learning
algorithms like autoencoders and generative models are used for
unsupervised tasks like clustering, dimensionality reduction, and anomaly
detection.
 Reinforcement Machine Learning: Reinforcement Machine Learning is
the machine learning technique in which an agent learns to make decisions
in an environment to maximize a reward signal. The agent interacts with the
environment by taking action and observing the resulting rewards. Deep
learning can be used to learn policies, or a set of actions, that maximizes the
cumulative reward over time. Deep reinforcement learning algorithms like
Deep Q networks and Deep Deterministic Policy Gradient (DDPG) are used
to reinforce tasks like robotics and game playing etc.
Convolution Neural Network
A Convolutional Neural Network (CNN) is a type of Deep Learning
neural network architecture commonly used in Computer Vision.
Computer vision is a field of Artificial Intelligence that enables a
computer to understand and interpret the image or visual data.
When it comes to Machine Learning, Artificial Neural Networks perform
really well. Neural Networks are used in various datasets like images,
audio, and text. Different types of Neural Networks are used for
different purposes, for example for predicting the sequence of words
we use Recurrent Neural Networks more precisely an LSTM,
similarly for image classification we use Convolution Neural networks.
In this blog, we are going to build a basic building block for CNN.
In a regular Neural Network there are three types of layers:
1. Input Layers: It’s the layer in which we give input to our model.
The number of neurons in this layer is equal to the total number of
features in our data (number of pixels in the case of an image).
2. Hidden Layer: The input from the Input layer is then feed into the
hidden layer. There can be many hidden layers depending upon our
model and data size. Each hidden layer can have different numbers
of neurons which are generally greater than the number of features.
The output from each layer is computed by matrix multiplication of
output of the previous layer with learnable weights of that layer and
then by the addition of learnable biases followed by activation
function which makes the network nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a
logistic function like sigmoid or softmax which converts the output
of each class into the probability score of each class.
The data is fed into the model and output from each layer is obtained
from the above step is called feedforward, we then calculate the
error using an error function, some common error functions are cross-
entropy, square loss error, etc. The error function measures how well
the network is performing. After that, we backpropagate into the model
by calculating the derivatives. This step is
called Backpropagation which basically is used to minimize the loss.
Convolution Neural Network
Convolutional Neural Network (CNN) is the extended version
of artificial neural networks (ANN) which is predominantly used to
extract the feature from the grid-like matrix dataset. For example
visual datasets like images or videos where data patterns play an
extensive role.

CNN architecture

Convolutional Neural Network consists of multiple layers like the input


layer, Convolutional layer, Pooling layer, and fully connected layers.

Simple CNN architecture

The Convolutional layer applies filters to the input image to extract


features, the Pooling layer downsamples the image to reduce
computation, and the fully connected layer makes the final prediction.
The network learns the optimal filters through backpropagation and
gradient descent.
How Convolutional Layers works

Convolution Neural Networks or covnets are neural networks that share


their parameters. Imagine you have an image. It can be represented as
a cuboid having its length, width (dimension of the image), and height
(i.e the channel as images generally have red, green, and blue
channels).

Now imagine taking a small patch of this image and running a small
neural network, called a filter or kernel on it, with say, K outputs and
representing them vertically. Now slide that neural network across the
whole image, as a result, we will get another image with different
widths, heights, and depths. Instead of just R, G, and B channels now
we have more channels but lesser width and height. This operation is
called Convolution. If the patch size is the same as that of the image
it will be a regular neural network. Because of this small patch, we
have fewer weights.

Image source: Deep Learning Udacity

Now let’s talk about a bit of mathematics that is involved in the whole
convolution process.
 Convolution layers consist of a set of learnable filters (or kernels)
having small widths and heights and the same depth as that of input
volume (3 if the input layer is image input).
 For example, if we have to run convolution on an image with
dimensions 34x34x3. The possible size of filters can be axax3,
where ‘a’ can be anything like 3, 5, or 7 but smaller as compared to
the image dimension.
 During the forward pass, we slide each filter across the whole input
volume step by step where each step is called stride (which can
have a value of 2, 3, or even 4 for high-dimensional images) and
compute the dot product between the kernel weights and patch
from input volume.
 As we slide our filters we’ll get a 2-D output for each filter and we’ll
stack them together as a result, we’ll get output volume having a
depth equal to the number of filters. The network will learn all the
filters.
Recurrent Neural Network (RNN)?
Recurrent Neural Network(RNN) is a type of Neural Network where the output
from the previous step is fed as input to the current step. In traditional neural
networks, all the inputs and outputs are independent of each other, but in cases
when it is required to predict the next word of a sentence, the previous words
are required and hence there is a need to remember the previous words. Thus
RNN came into existence, which solved this issue with the help of a Hidden
Layer. The main and most important feature of RNN is its Hidden state, which
remembers some information about a sequence. The state is also referred to
as Memory State since it remembers the previous input to the network. It uses
the same parameters for each input as it performs the same task on all the
inputs or hidden layers to produce the output. This reduces the complexity of
parameters, unlike other neural networks.
REcurrent neural network

Architecture Of Recurrent Neural Network


RNNs have the same input and output architecture as any other deep neural
architecture. However, differences arise in the way information flows from input
to output. Unlike Deep neural networks where we have different weight matrices
for each Dense network in RNN, the weight across the network remains the
same. It calculates state hidden state Hi for every input Xi . By using the
following formulas:
h= σ(UX + Wh-1 + B)
Y = O(Vh + C) Hence
Y = f (X, h , W, U, V, B, C)
Here S is the State matrix which has element si as the state of the network at
timestep i
The parameters in the network are W, U, V, c, b which are shared across
timestep
What is Recurrent Neural Network

How RNN works

The Recurrent Neural Network consists of multiple fixed activation function


units, one for each time step. Each unit has an internal state which is called the
hidden state of the unit. This hidden state signifies the past knowledge that the
network currently holds at a given time step. This hidden state is updated at
every time step to signify the change in the knowledge of the network about the
past. The hidden state is updated using the following recurrence relation:-
The formula for calculating the current state:

where:
ht -> current state
ht-1 -> previous state
xt -> input state
Formula for applying Activation function(tanh):

where:
whh -> weight at recurrent neuron
wxh -> weight at input neuron
The formula for calculating output:

Yt -> output
Why -> weight at output layer
These parameters are updated using Backpropagation. However, since RNN
works on sequential data here we use an updated backpropagation which is
known as Backpropagation through time.
Backpropagation Through Time (BPTT)
In RNN the neural network is in an ordered fashion and since in the ordered
network each variable is computed one at a time in a specified order like first h1
then h2 then h3 so on. Hence we will apply backpropagation throughout all
these hidden time states sequentially.

Backpropagation Through Time (BPTT) In RNN

L(θ)(loss function) depends on h3


h3 in turn depends on h2 and W
h2 in turn depends on h1 and W
h1 in turn depends on h0 and W
where h0 is a constant starting state.

For simplicity of this equation, we will apply backpropagation on only one


row

We already know how to compute this one as it is the same as any simple deep

neural network backpropagation. .However, we will see how to apply

backpropagation to this term


As we know h3 = σ(Wh2 + b)

And In such an ordered network, we can’t compute by simply treating

h3 as a constant because as it also depends on W. the total derivative


has two parts

1. Explicit: treating all other inputs as constant


2. Implicit: Summing over all indirect paths from h 3 to W
Let us see how to do this
This algorithm is called backpropagation through time (BPTT) as we
backpropagate over all previous time steps

Training through RNN

1. A single-time step of the input is provided to the network.


2. Then calculate its current state using a set of current input and the previous
state.
3. The current ht becomes ht-1 for the next time step.
4. One can go as many time steps according to the problem and join the
information from all the previous states.
5. Once all the time steps are completed the final current state is used to
calculate the output.
6. The output is then compared to the actual output i.e the target output and
the error is generated.
7. The error is then back-propagated to the network to update the weights and
hence the network (RNN) is trained using Backpropagation through time.

Advantages of Recurrent Neural Network

1. An RNN remembers each and every piece of information through time. It is


useful in time series prediction only because of the feature to remember
previous inputs as well. This is called Long Short Term Memory.
2. Recurrent neural networks are even used with convolutional layers to extend
the effective pixel neighborhood.
Disadvantages of Recurrent Neural Network
1. Gradient vanishing and exploding problems.
2. Training an RNN is a very difficult task.
3. It cannot process very long sequences if using tanh or relu as an activation
function.
Applications of Recurrent Neural Network
1. Language Modelling and Generating Text
2. Speech Recognition
3. Machine Translation
4. Image Recognition, Face detection
5. Time series Forecasting
Types Of RNN
There are four types of RNNs based on the number of inputs and outputs in the
network.
1. One to One
2. One to Many
3. Many to One
4. Many to Many

One to One

This type of RNN behaves the same as any simple Neural network it is also
known as Vanilla Neural Network. In this Neural network, there is only one input
and one output.
One to One RNN

One To Many

In this type of RNN, there is one input and many outputs associated with it. One
of the most used examples of this network is Image captioning where given an
image we predict a sentence having Multiple words.

One To Many RNN

Many to One

In this type of network, Many inputs are fed to the network at several states of
the network generating only one output. This type of network is used in the
problems like sentimental analysis. Where we give multiple words as input and
predict only the sentiment of the sentence as output.
Many to One RNN

Many to Many

In this type of neural network, there are multiple inputs and multiple outputs
corresponding to a problem. One Example of this Problem will be language
translation. In language translation, we provide multiple words from one
language as input and predict multiple words from the second language as
output.

Many to Many RNN


Variation Of Recurrent Neural Network (RNN)
To overcome the problems like vanishing gradient and exploding gradient
descent several new advanced versions of RNNs are formed some of these are
as ;
1. Bidirectional Neural Network (BiNN)
2. Long Short-Term Memory (LSTM)
Bidirectional Neural Network (BiNN)
A BiNN is a variation of a Recurrent Neural Network in which the input
information flows in both direction and then the output of both direction are
combined to produce the input. BiNN is useful in situations when the context of
the input is more important such as Nlp tasks and Time-series analysis
problems.
Long Short-Term Memory (LSTM)
Long Short-Term Memory works on the read-write-and-forget principle where
given the input information network reads and writes the most useful
information from the data and it forgets about the information which is not
important in predicting the output. For doing this three new gates are introduced
in the RNN. In this way, only the selected information is passed through the
network.

Use cases:

In which case recurrent neural network is suitable?


Recurrent neural networks (RNNs) are the state of the art algorithm for sequential data
and are used by Apple's Siri and Google's voice search. It is the first algorithm that
remembers its input, due to an internal memory, which makes it perfectly suited
for machine learning problems that involve sequential data.

What is a real life example of a recurrent neural network?


Recurrent Neural Networks are used in several domains. For instance, in Natural
Language Processing (NLP), they've been used to generate handwritten text, perform
machine translation and speech recognition.

You might also like