0% found this document useful (0 votes)
77 views118 pages

R20 ML Notes

The document outlines the course structure and objectives for a Machine Learning elective at Siddhartha Institute of Engineering & Technology. It covers various learning models including supervised, unsupervised, and reinforcement learning, along with dimensionality reduction techniques. Additionally, it highlights the importance and applications of machine learning in real-world scenarios such as image recognition, speech recognition, and self-driving cars.

Uploaded by

Bhukya Rajakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views118 pages

R20 ML Notes

The document outlines the course structure and objectives for a Machine Learning elective at Siddhartha Institute of Engineering & Technology. It covers various learning models including supervised, unsupervised, and reinforcement learning, along with dimensionality reduction techniques. Additionally, it highlights the importance and applications of machine learning in real-world scenarios such as image recognition, speech recognition, and self-driving cars.

Uploaded by

Bhukya Rajakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 118

SIDDHARTH INSTITUTE OF ENGINEERING & TECHNOLOGY:: PUTTUR

(AUTONOMOUS)

(20CS0535) MACHINE LEARNING


III CSE -II SEM
SIDDARTHA INSTITUTE OF SCIENCE AND TECHNOLOGY:: PUTTUR

(AUTONOMOUS) L T P C
3 - - 3
III B.Tech. – II Sem.

(20CS0535) MACHINE LEARNING


(Professional Elective Course-II)

COURSE OBJECTIVES

The objectives of this course:


1. To investigate various Supervised Learning models of machine learning
2. To investigate various Unsupervised Learning models of machine learning
3. To investigate various Reinforcement Learning models of machine learning
4. To expose students to the Dimensionality Reduction

COURSE OUTCOMES (COs)


On successful completion of this course, the student will be able to
1. Understand the basics of Machine Learning
2. Apply the various supervised learning algorithms to classification and regression
problems
3. Analyze the various unsupervised learning techniques like k-means, EM algorithm and to
apply for real world problems
4. Understand the concepts of Clustering Techniques.
5. Identify the need of Parametric methods and Dimensionality Reduction Techniques in
machine learning.
6. Infer the theoretical and practical concepts of Reinforcement Learning
UNIT-I

INTRODUCTION: What is machine learning? -Examples of machine learning applications- Types of


machine learning. –Model selection and generalization – Guidelines for Machine LearningExperiments

UNIT-II

SUPERVISED LEARNING: Classification, Decision Trees – Univariate Tree –Multivariate Tree –


Pruning, Bayesian Decision Theory, Parametric Methods-Maximum Likelihood Estimation -Evaluating
an Estimator Bias and Variance -The Bayes‟ Estimator, Linear Discrimination- Gradient Descent-
Logistic Discrimination-Discrimination by Regression, Multilayer Perceptron-Perceptron- Multilayer
Perceptrons- Back Propagation Algorithm

UNIT-III

UNSUPERVISED LEARNING: clustering- Introduction- Mixture Densities- k-Means Clustering-


Expectation-Maximization Algorithm- Mixtures of Latent Variable Models- Supervised Learning after
Clustering- Hierarchical Clustering
UNIT-IV

NONPARAMETRIC METHODS- Nonparametric Density Estimation- k-Nearest


NeighborEstimator- Nonparametric Classification- Condensed Nearest Neighbor

DIMENSIONALITY REDUCTION-Subset Selection-Principal Components Analysis- Factor


Analysis- Multidimensional Scaling-Linear Discriminant Analysis.

UNIT-V

REINFORCEMENT LEARNING: Introduction- Single State Case:K-Armed Bandit- Elements of


Reinforcement Learning- Model- Based Learning- Temporal DifferenceLearning- Generalization-
Partially Observable States

TEXT BOOKS

1. Ethem Alpaydin, Introduction to Machine Learning,MIT Press, Second Edition,2010.

REFERENCES

1. Tom M Mitchell, Machine Learning, First Edition, McGraw Hill Education, 2013

2. Richard S. Sutton and Andrew G. Barto: Reinforcement Learning: An Introduction.MIT


Press
UNIT-1

UNIT-I

INTRODUCTION: What is machine learning? -Examples of machine learning applications- Types of


machine learning. –Model selection and generalization – Guidelines for Machine LearningExperiments

Machine learning is a growing technology which enables computers to learn automatically from past
data. Machine learning uses various algorithms for building mathematical models and making
predictions using historical data or information. Currently, it is being used for various tasks such
as image recognition, speech recognition, email filtering, Facebook auto-tagging, recommender
system, and many more.

What is Machine Learning?

Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the
development of algorithms which allow a computer to learn from the data and past experiences on
their own. The term machine learning was first introduced by Arthur Samuel in 1959. We can define it
in a summarized way as:

Machine learning enables a machine to automatically learn from data, improve performance from
experiences, and predict things without being explicitly programmed.
With the help of sample historical data, which is known as training data, machine learning algorithms
build a mathematical model that helps in making predictions or decisions without being explicitly
programmed. Machine learning brings computer science and statistics together for creating predictive
models. Machine learning constructs or uses the algorithms that learn from historical data. The more we
will provide the information, the higher will be the performance.

A machine has the ability to learn if it can improve its performance by gaining more data.

Machine learning is a subfield of artificial intelligence that involves training computers to learn from
data without being explicitly programmed. In other words, machine learning algorithms use statistical
techniques to find patterns in data and use these patterns to make predictions or take actions.

How does Machine Learning work

A Machine Learning system learns from historical data, builds the prediction models, and whenever
it receives new data, predicts the output for it. The accuracy of predicted output depends upon the
amount of data, as the huge amount of data helps to build a better model which predicts the output more
accurately.

Suppose we have a complex problem, where we need to perform some predictions, so instead of writing
a code for it, we just need to feed the data to generic algorithms, and with the help of these algorithms,
machine builds the logic as per the data and predict the output. Machine learning has changed our way
of thinking about the problem. The below block diagram explains the working of Machine Learning
algorithm:

Features of Machine Learning:


o Machine learning uses data to detect various patterns in a given dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge amount of the
data.

Need for Machine Learning

The need for machine learning is increasing day by day. The reason behind the need for machine learning
is that it is capable of doing tasks that are too complex for a person to implement directly. Asa human,
we have some limitations as we cannot access the huge amount of data manually, so for this, we need
some computer systems and here comes the machine learning to make things easy for us.
We can train machine learning algorithms by providing them the huge amount of data and let them
explore the data, construct the models, and predict the required output automatically. The performance
of the machine learning algorithm depends on the amount of data, and it can be determined by the cost
function. With the help of machine learning, we can save both time and money.

The importance of machine learning can be easily understood by its uses cases, Currently, machine
learning is used in self-driving cars, cyber fraud detection, face recognition, and friend suggestion
by Facebook, etc. Various top companies such as Netflix and Amazon have build machine learning
models that are using a vast amount of data to analyze the user interest and recommend product
accordingly.

Following are some key points which show the importance of Machine Learning:

o Rapid increment in the production of data


o Solving complex problems, which are difficult for a human
o Decision making in various sector including finance
o Finding hidden patterns and extracting useful information from data.

History of Machine Learning

Before some years (about 40-50 years), machine learning was science fiction, but today it is the part of
our daily life. Machine learning is making our day to day life easy from self-driving cars to Amazon
virtual assistant "Alexa". However, the idea behind machine learning is so old and has a long history.
Below some milestones are given which have occurred in the history of machine learning:

The early history of Machine Learning (Pre-1940):


o 1834: In 1834, Charles Babbage, the father of the computer, conceived a device that could be
programmed with punch cards. However, the machine was never built, but all modern computers
rely on its logical structure.
o 1936: In 1936, Alan Turing gave a theory that how a machine can determine and execute a set
of instructions.

The era of stored program computers:


o 1940: In 1940, the first manually operated computer, "ENIAC" was invented, which was the first
electronic general-purpose computer. After that stored program computer such as EDSAC in
1949 and EDVAC in 1951 were invented.
o 1943: In 1943, a human neural network was modeled with an electrical circuit. In 1950, the
scientists started applying their idea to work and analyzed how human neurons might work.

Computer machinery and intelligence:


o 1950: In 1950, Alan Turing published a seminal paper, "Computer Machinery and
Intelligence," on the topic of artificial intelligence. In his paper, he asked, "Can machines
think?"

Machine intelligence in Games:


o 1952: Arthur Samuel, who was the pioneer of machine learning, created a program that helped
an IBM computer to play a checkers game. It performed better more it played.
o 1959: In 1959, the term "Machine Learning" was first coined by Arthur Samuel.

The first "AI" winter:


o The duration of 1974 to 1980 was the tough time for AI and ML researchers, and this duration
was called as AI winter.
o In this duration, failure of machine translation occurred, and people had reduced their interest
from AI, which led to reduced funding by the government to the researches.

Machine Learning from theory to reality


o 1959: In 1959, the first neural network was applied to a real-world problem to remove echoes
over phone lines using an adaptive filter.
o 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a neural network NETtalk,
which was able to teach itself how to correctly pronounce 20,000 words in one week.
o 1997: The IBM's Deep blue intelligent computer won the chess game against the chess expert
Garry Kasparov, and it became the first computer which had beaten a human chess expert.

Machine Learning at 21st century


o 2006: In the year 2006, computer scientist Geoffrey Hinton has given a new name to neural net
research as "deep learning," and nowadays, it has become one of the most trending technologies.
o 2012: In 2012, Google created a deep neural network which learned to recognize the image of
humans and cats in YouTube videos.
o 2014: In 2014, the Chabot "Eugen Goostman" cleared the Turing Test. It was the first Chabot
who convinced the 33% of human judges that it was not a machine.
o 2014: DeepFace was a deep neural network created by Facebook, and they claimed that it
could recognize a person with the same precision as a human can do.
o 2016: AlphaGo beat the world's number second player Lee sedol at Go game. In 2017 it beat
the number one player of this game Ke Jie.
o 2017: In 2017, the Alphabet's Jigsaw team built an intelligent system that was able to learn
the online trolling. It used to read millions of comments of different websites to learn to stop
online trolling.

Machine Learning at present:

Now machine learning has got a great advancement in its research, and it is present everywhere around
us, such as self-driving cars, Amazon Alexa, Catboats, recommender system, and many more. It
includes Supervised, unsupervised, and reinforcement learning with
clustering, classification, decision tree, SVM algorithms, etc.

Modern machine learning models can be used for making various predictions, including weather
prediction, disease prediction, stock market analysis, etc.

Prerequisites

Before learning machine learning, you must have the basic knowledge of followings so that you can
easily understand the concepts of machine learning:

o Fundamental knowledge of probability and linear algebra.


o The ability to code in any computer language, especially in Python language.
o Knowledge of Calculus, especially derivatives of single variable and multivariate functions.

• interact with it.


Examples of Machine Learning Applications

Machine learning is a buzzword for today's technology, and it is growing very rapidly day by day. We
are using machine learning in our daily life even without knowing it such as Google Maps, Google
assistant, Alexa, etc. Below are some most trending real-world applications of Machine Learning:
1. Image Recognition:

Image recognition is one of the most common applications of machine learning. It is used to identify
objects, persons, places, digital images, etc. The popular use case of image recognition and face detection
is, Automatic friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo with our
Facebook friends, then we automatically get a tagging suggestion with name, and the technology behind
this is machine learning's face detection and recognition algorithm.

It is based on the Facebook project named "Deep Face," which is responsible for face recognition and
person identification in the picture.

2. Speech Recognition

While using Google, we get an option of "Search by voice," it comes under speech recognition, and it's
a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is also known as "Speech
to text", or "Computer speech recognition." At present, machine learning algorithms are widely used
by various applications of speech recognition. Google assistant, Siri, Cortana, and Alexa are
using speech recognition technology to follow the voice instructions.

3. Traffic prediction:

If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the
shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested
with the help of two ways:

o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It takes information from the
user and sends back to its database to improve the performance.
4. Product recommendations:

Machine learning is widely used by various e-commerce and entertainment companies suchas
Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some product
on Amazon, then we started getting an advertisement for the same product while internet surfing on the
same browser and this is because of machine learning.

Google understands the user interest using various machine learning algorithms and suggests the product
as per customer interest.

As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc.,
and this is also done with the help of machine learning.

5. Self-driving cars:

One of the most exciting applications of machine learning is self-driving cars. Machine learning plays
a significant role in self-driving cars. Tesla, the most popular car manufacturing company is working on
self-driving car. It is using unsupervised learning method to train the car models to detect people and
objects while driving.

6. Email Spam and Malware Filtering:

Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We always
receive an important mail in our inbox with the important symbol and spam emails in our spambox, and
the technology behind this is Machine learning. Below are some spam filters used by Gmail:

o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes
classifier are used for email spam filtering and malware detection.

7. Virtual Personal Assistant:

We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the name
suggests, they help us in finding the information using our voice instruction. These assistants canhelp us
in various ways just by our voice instructions such as Play music, call someone, Open an email,
Scheduling an appointment, etc.

These virtual assistants use machine learning algorithms as an important part.

These assistant record our voice instructions, send it over the server on a cloud, and decode it using ML
algorithms and act accordingly.
8. Online Fraud Detection:

Machine learning is making our online transaction safe and secure by detecting fraud transaction.
Whenever we perform some online transaction, there may be various ways that a fraudulent transaction
can take place such as fake accounts, fake ids, and steal money in the middle of a transaction. So to
detect this, Feed Forward Neural network helps us by checking whether it is a genuine transaction or
a fraud transaction.

For each genuine transaction, the output is converted into some hash values, and these values become
the input for the next round. For each genuine transaction, there is a specific pattern which gets change
for the fraud transaction hence, it detects it and makes our online transactions more secure.

9. Stock Market trading:

Machine learning is widely used in stock market trading. In the stock market, there is always a risk of
up and downs in shares, so for this machine learning's long short term memory neural network is used
for the prediction of stock market trends.

10. Medical Diagnosis:

In medical science, machine learning is used for diseases diagnoses. With this, medical technology is
growing very fast and able to build 3D models that can predict the exact position of lesions in the
brain. It helps in finding brain tumors and other brain-related diseases easily.

11. Automatic Language Translation:

Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at all,
as for this also machine learning helps us by converting the text into our known languages. Google's
GNMT (Google Neural Machine Translation) provide this feature, which is a Neural Machine Learning
that translates the text into our familiar language, and it called as automatic translation.

The technology behind the automatic translation is a sequence to sequence learning algorithm, which
is used with image recognition and translates the text from one language to another language.

Types of Machine Learning

At a broad level, machine learning can be classified into three types:

1. Supervised learning 2.Unsupervised learning 3.Reinforcement learning


Supervised Machine Learning

Supervised learning is the types of machine learning in which machines are trained using well"labelled"
training data, and on basis of that data, machines predict the output. The labelled data meanssome input
data is already tagged with the correct output.

In supervised learning, the training data provided to the machines work as the supervisor that teaches
the machines to predict the output correctly. It applies the same concept as a student learns in the
supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output data to the machine
learning model. The aim of a supervised learning algorithm is to find a mapping function to map the
input variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image classification, Fraud
Detection, spam filtering, etc.

How Supervised Learning Works?

In supervised learning, models are trained using labelled dataset, where the model learns about each type
of data. Once the training process is completed, the model is tested on the basis of test data (a subset of
the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and diagram:

Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle, and
Polygon. Now the first step is that we need to train the model for each shape.

o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the
shape.

The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the
shape on the bases of a number of sides, and predicts the output.

Steps Involved in Supervised Learning:


o First Determine the type of training dataset
o Collect/Gather the labelled training data.
o Split the training dataset into training dataset, test dataset, and validation dataset.
o Determine the input features of the training dataset, which should have enough knowledge so
that the model can accurately predict the output.
o Determine the suitable algorithm for the model, such as support vector machine, decision tree,
etc.
o Execute the algorithm on the training dataset. Sometimes we need validation sets as the control
parameters, which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model predicts the correct
output, which means our model is accurate.

Types of supervised Machine learning Algorithms:

Supervised learning can be further divided into two types of problems:

1. Regression

Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc. Below are some popular Regression algorithms which come under supervised learning:

o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression

2. Classification

Classification algorithms are used when the output variable is categorical, which means there are two
classes such as Yes-No, Male-Female, True-false, etc.

Spam Filtering,

o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines

Advantages of Supervised learning:


o With the help of supervised learning, the model can predict the output on the basis of prior
experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.

Disadvantages of supervised learning:


o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.

Unsupervised Machine Learning

In the previous topic, we learned supervised machine learning in which models are trained using labeled
data under the supervision of training data. But there may be many cases in which we do not have labeled
data and need to find the hidden patterns from the given dataset. So, to solve such typesof cases in
machine learning, we need unsupervised learning techniques.

What is Unsupervised Learning?

Unsupervised learning is a machine learning technique in which models are not supervised using training
dataset. Instead, models itself find the hidden patterns and insights from the given data. It can be
compared to learning which takes place in the human brain while learning new things. It can be defined
as:
Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.

Unsupervised learning cannot be directly applied to a regression or classification problem because unlike
supervised learning, we have the input data but no corresponding output data. The goal of unsupervised
learning is to find the underlying structure of dataset, group that data according to similarities,
and represent that dataset in a compressed format.

Example: Suppose the unsupervised learning algorithm is given an input dataset containing images of
different types of cats and dogs. The algorithm is never trained upon the given dataset, which means it
does not have any idea about the features of the dataset. The task of the unsupervised learning algorithm
is to identify the image features on their own. Unsupervised learning algorithm will perform this task by
clustering the image dataset into the groups according to similarities between images.

Keep Watching

Why use Unsupervised Learning?

Below are some main reasons which describe the importance of Unsupervised Learning:

o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own experiences,
which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make unsupervised
learning more important.
o In real-world, we do not always have input data with the corresponding output so to solve such
cases, we need unsupervised learning.

Working of Unsupervised Learning

Working of unsupervised learning can be understood by the below diagram:


Here, we have taken an unlabeled input data, which means it is not categorized and corresponding
outputs are also not given. Now, this unlabeled input data is fed to the machine learning model in order
to train it. Firstly, it will interpret the raw data to find the hidden patterns from the data and then will
apply suitable algorithms such as k-means clustering, Decision tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects into groups according to
the similarities and difference between the objects.

Types of Unsupervised Learning Algorithm:

The unsupervised learning algorithm can be further categorized into two types of problems:

o Clustering: Clustering is a method of grouping the objects into clusters such that objects with
most similarities remains into a group and has less or no similarities with the objects of another
group. Cluster analysis finds the commonalities between the data objects and categorizes them
as per the presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which is used for finding
the relationships between variables in the large database. It determines the set of items that occurs
together in the dataset. Association rule makes marketing strategy more effective. Such as people
who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.
Unsupervised Learning algorithms:
Below is the list of some popular unsupervised learning algorithms:
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriority algorithm
o Singular value decomposition

Advantages of Unsupervised Learning


o Unsupervised learning is used for more complex tasks as compared to supervised learning
because, in unsupervised learning, we don't have labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to labeled
data.

Disadvantages of Unsupervised Learning


o Unsupervised learning is intrinsically more difficult than supervised learning as it does not
have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate as input data is not
labeled, and algorithms do not know the exact output in advance.

Difference between Supervised and Unsupervised Learning

Supervised and Unsupervised learning are the two techniques of machine learning. But both the
techniques are used in different scenarios and with different datasets. Below the explanation of both
learning methods along with their difference table is given.
The main differences between Supervised and Unsupervised learning are given below:

Supervised Learning Unsupervised Learning

Supervised learning algorithms are trained Unsupervised learning algorithms are


using labeled data. trained using unlabeled data.

Supervised learning model takes direct Unsupervised learning model does not take
feedback to check if it is predicting correct any feedback.
output or not.

Supervised learning model predicts the output. Unsupervised learning model finds the
hidden patterns in data.

In supervised learning, input data is provided In unsupervised learning, only input data is
to the model along with the output. provided to the model.

The goal of supervised learning is to train the The goal of unsupervised learning is to find
model so that it can predict the output when it the hidden patterns and useful insights from
is given new data. the unknown dataset.

Supervised learning needs supervision to train Unsupervised learning does not need any
the model. supervision to train the model.

Supervised learning can be categorized Unsupervised Learning can be classified


in Classification and Regression problems. in Clustering and Associations problems.
Supervised learning can be used for those cases Unsupervised learning can be used for those
where we know the input as well as cases where we have only input data and no
corresponding outputs. corresponding output data.

Supervised learning model produces an Unsupervised learning model may give less
accurate result. accurate result as compared to supervised
learning.

Supervised learning is not close to true Unsupervised learning is more close to the
Artificial intelligence as in this, we first train true Artificial Intelligence as it learns
the model for each data, and then only it can similarly as a child learns daily routine things
predict the correct output. by his experiences.

It includes various algorithms such as Linear It includes various algorithms such as


Regression, Logistic Regression, Support Clustering, KNN, and Apriori algorithm.
Vector Machine, Multi-class Classification,
Decision tree, Bayesian Logic, etc.

Reinforcement learning:

Reinforcement learning is an area of Machine Learning. It is about taking suitable action to


maximize reward in a particular situation. It is employed by various software and machines to find
the best possible behavior or path it should take in a specific situation. Reinforcement learning
differs from supervised learning in a way that in supervised learning the training data has the answer
key with it so the model is trained with the correct answer itself whereas in reinforcement learning,
there is no answer but the reinforcement agent decides what to do to perform the given task. In the
absence of a training dataset, it is bound to learn from its experience.
Example: The problem is as follows: We have an agent and a reward, with many hurdles in
between. The agent is supposed to find the best possible path to reach the reward. The following
problem explains the problem more easily.

The above image shows the robot, diamond, and fire. The goal of the robot is to get the reward that
is the diamond and avoid the hurdles that are fired. The robot learns by trying all the possible paths
and then choosing the path which gives him the reward with the least hurdles. Each right step will
give the robot a reward and each wrong step will subtract the reward of the robot. The total reward
will be calculated when it reaches the final reward that is the diamond.
Main points in Reinforcement learning –

• Input: The input should be an initial state from which the model will start
• Output: There are many possible outputs as there are a variety of solutions to a particular
problem
• Training: The training is based upon the input, The model will return a state and the user will
decide to reward or punish the model based on its output.
• The model keeps continues to learn.
• The best solution is decided based on the maximum reward.

Difference between Reinforcement learning and Supervised learning:


Reinforcement learning Supervised learning

Reinforcement learning is all about making decisions


sequentially. In simple words, we can say that the output In Supervised learning, the
depends on the state of the current input and the next input decision is made on the initial
depends on the output of the previous input input or the input given at the start

In supervised learning the


decisions are independent of each
In Reinforcement learning decision is dependent, So we give other so labels are given to each
labels to sequences of dependent decisions decision.

Example: Chess game Example: Object recognition

Types of Reinforcement: There are two types of Reinforcement:

1. Positive –
Positive Reinforcement is defined as when an event, occurs due to a particular behavior,
increases the strength and the frequency of the behavior. In other words, it has a positive effect
on behavior.
Advantages of reinforcement learning are:
• Maximizes Performance
• Sustain Change for a long period of time
• Too much Reinforcement can lead to an overload of states which can diminish the
results
2. Negative –
Negative Reinforcement is defined as strengthening of behavior because a negative condition is
stopped or avoided.
Advantages of reinforcement learning:
• Increases Behavior
• Provide defiance to a minimum standard of performance
• It Only provides enough to meet up the minimum behavior
Various Practical applications of Reinforcement Learning –

• RL can be used in robotics for industrial automation.


• RL can be used in machine learning and data processing
• RL can be used to create training systems that provide custom instruction and materials
according to the requirement of students.
• RL can be used in large environments in the following situations:
A model of the environment is known, but an analytic solution is not available;
• Only a simulation model of the environment is given (the subject of simulation-based
optimization)
Model Selection and Generalization

Model selection refers to the process of selecting the best model from a set of candidate models based
on their performance on a given task. This process typically involves splitting the available data into
training and validation sets, using the training set to train each candidate model, and then evaluating
their performance on the validation set. The model with the best performance on the validation set is
selected as the final model.

Fig:Model Selection and Generalization

Generalization refers to the ability of a model to perform well on new, unseen data. When a model is
trained on a dataset, it may overfitt the training data by memorizing specific patterns in the data that
are not representative of the underlying distribution. This can lead to poor performance on new data.
To ensure good generalization, it is important to evaluate a model's performance on a separate test set
that was not used during model selection or training.

To improve generalization, techniques such as regularization, early stopping, and data augmentation
can be used. Regularization involves adding a penalty term to the loss function to discourage complex
models that are prone to overfitting. Early stopping involves monitoring the validation error during
training and stopping the training process when the error begins to increase. Data augmentation
involves generating new training examples by applying transformations to existing examples, which
can increase the size and diversity of the training set and help prevent overfitting.

Overall, model selection and generalization are crucial aspects of machine learning that help ensure
that models are accurate and reliable, and can be applied successfully to new data.
Fig:Model Seleciton

GUIDELINES FOR MACHINE LEARNING EXPERIMENTS

Machine Learning Steps


The task of imparting intelligence to machines seems daunting and impossible. But it is actually really
easy. It can be broken down into 7 major steps :

1. Collecting Data:

As you know, machines initially learn from the data that you give them. It is of the utmost importance
to collect reliable data so that your machine learning model can find the correct patterns. The quality of
the data that you feed to the machine will determine how accurate your model is. If you have incorrect
or outdated data, you will have wrong outcomes or predictions which are not relevant.

Make sure you use data from a reliable source, as it will directly affect the outcome of your model. Good
data is relevant, contains very few missing and repeated values, and has a good representation of the
various subcategories/classes present.
2. Preparing the Data:

After you have your data, you have to prepare it. You can do this by:

• Putting together all the data you have and randomizing it. This helps make sure that data is evenly
distributed, and the ordering does not affect the learning process.

• Cleaning the data to remove unwanted data, missing values, rows, and columns, duplicate values,
data type conversion, etc. You might even have to restructure the dataset and changethe rows
and columns or index of rows and columns.

• Visualize the data to understand how it is structured and understand the relationship between
various variables and classes present.

• Splitting the cleaned data into two sets - a training set and a testing set. The training set is the set
your model learns from. A testing set is used to check the accuracy of your model after training.
Figure 3: Cleaning and Visualizing Data

3. Choosing a Model:

A machine learning model determines the output you get after running a machine learning algorithm
on the collected data. It is important to choose a model which is relevant to the task at hand. Over the
years, scientists and engineers developed various models suited for different tasks like speech
recognition, image recognition, prediction, etc. Apart from this, you also have to see if your model is
suited for numerical or categorical data and choose accordingly.

Figure 4: Choosing a model


4. Training the Model:

Training is the most important step in machine learning. In training, you pass the prepared data to your
machine learning model to find patterns and make predictions. It results in the model learning from the
data so that it can accomplish the task set. Over time, with training, the model gets better at predicting.

Figure 5: Training a model

5. Evaluating the Model:

After training your model, you have to check to see how it‟s performing. This is done by testing the
performance of the model on previously unseen data. The unseen data used is the testing set that you
split our data into earlier. If testing was done on the same data which is used for training, you will not
get an accurate measure, as the model is already used to the data, and finds the same patterns in it, as it
previously did. This will give you disproportionately high accuracy.

When used on testing data, you get an accurate measure of how your model will perform and its speed.
Figure 6: Evaluating a model
6. Parameter Tuning:

Once you have created and evaluated your model, see if its accuracy can be improved in any way. This
is done by tuning the parameters present in your model. Parameters are the variables in the model that
the programmer generally decides. At a particular value of your parameter, the accuracy will be the
maximum. Parameter tuning refers to finding these values.

Figure 7: Parameter Tuning

7. Making Predictions
In the end, you can use your model on unseen data to make predictions accurately.
How to Implement Machine Learning Steps in Python?
You will now see how to implement a machine learning model using Python.
In this example, data collected is from an insurance company, which tells you the variables that come
into play when an insurance amount is set. Using this, you will have to predict the insurance amount
for a person. This data was collected from Kaggle.com, which has many reliable datasets.
You need to start by importing any necessary modules, as shown.

Figure 8: Importing necessary modules

Following this, you will import the data.

Figure 9: Importing data


Figure 10: Insurance dataset

Now, clean your data by removing duplicate values, and transforming columns into numerical values
to make them easier to work with.

Figure 11: Cleaning Data

The final dataset becomes as shown.


Figure 12: Cleaned dataset

Now, split your dataset into training and testing sets.

Figure 13: Splitting the dataset

As you need to predict a numeral value based on some parameters, you will have to use Linear
Regression. The model needs to learn on your training set. This is done by using the '.fit' command.

Figure 14: Choosing and training your model

Now, predict your testing dataset and find how accurate your predictions are.
Figure 15: Predicting using your model

1.0 is the highest level of accuracy you can get. Now, get your parameters.

Figure 16: Model Parameters

The above picture shows the hyperparameters which affect the various variables in your dataset.
AI& ML Differences

AI is a bigger concept to create intelligent machines that can simulate human thinking capability and
behavior, whereas, machine learning is an application or subset of AI that allows machines to learn
from data without being programmed explicitly.

Below are some main differences between AI and machine learning along with the overview of Artificial
intelligence and machine learning

Artificial Intelligence

Artificial intelligence is a field of computer science which makes a computer system that can mimic
human intelligence. It is comprised of two words "Artificial" and "intelligence", which means "a
human-made thinking power." Hence we can define it as,

Artificial intelligence is a technology using which we can create intelligent systems that can simulate
human intelligence.

The Artificial intelligence system does not require to be pre-programmed, instead of that, they use such
algorithms which can work with their own intelligence. It involves machine learning algorithms such as
Reinforcement learning algorithm and deep learning neural networks. AI is being used in multiple places
such as Siri, Google?s AlphaGo, AI in Chess playing, etc.

Based on capabilities, AI can be classified into three types:


o Weak AI
o General AI
o Strong AI

Currently, we are working with weak AI and general AI. The future of AI is Strong AI for which it is
said that it will be intelligent than humans.

Machine learning
Machine learning is about extracting knowledge from the data. It can be defined as,
Machine learning is a subfield of artificial intelligence, which enables machines to learn from past
data or experiences without being explicitly programmed.

Artificial Intelligence Machine learning

Artificial intelligence is a technology which Machine learning is a subset of AI which allows a machine
enables a machine to simulate humanbehavior. to automatically learn from past data withoutprogramming
explicitly.

The goal of AI is to make a smart computer The goal of ML is to allow machines to learn from data so
system like humans to solve complex that they can give accurate output.
problems.

In AI, we make intelligent systems to perform In ML, we teach machines with data to perform a
any task like a human. particular task and give an accurate result.

Machine learning and deep learning are the Deep learning is a main subset of machine learning.
two main subsets of AI.

AI has a very wide range of scope. Machine learning has a limited scope.

AI is working to create an intelligent system Machine learning is working to create machines that can
which can perform various complex tasks. perform only those specific tasks for which they are trained.

AI system is concerned about maximizing the Machine learning is mainly concerned about accuracy and
chances of success. patterns.

The main applications of AI are Siri, customer The main applications of machine learning are Online
support using catboats, Expert System, recommender system, Google search
Online game playing, intelligent algorithms, Facebook auto friend tagging suggestions,
humanoid robot, etc. etc.

On the basis of capabilities, AI can be divided Machine learning can also be divided into mainly three
into three types, which are, Weak AI, General types that are Supervised learning, Unsupervised
AI, and Strong AI. learning, and Reinforcement learning.

It includes learning, reasoning, and self- It includes learning and self-correction when introduced
correction. with new data.

AI completely deals with Structured, semi- Machine learning deals with Structured and semi-
structured, and unstructured data. structured data.
UNIT-II

SUPERVISED LEARNING: Classification, Decision Trees – Univariate Tree –Multivariate Tree –


Pruning, Bayesian Decision Theory, Parametric Methods-Maximum Likelihood Estimation -
Evaluating an Estimator Bias and Variance -The Bayes‟ Estimator, Linear Discrimination- Gradient
Descent- Logistic Discrimination-Discrimination by Regression, Multilayer Perceptron-Perceptron-
Multilayer Perceptrons- Back Propagation Algorithm
Supervised learning is a type of machine learning where an algorithm learns from labelled data to
make predictions or classifications on new, unseen data. In classification problems, the goal is to
assign a label or category to an input data point based on its features.

Fig.Broad classification of machine learning

CLASSIFICATION:

The process of classification in supervised learning involves the following steps:

• Data Collection: Collecting labeled data to train the classification model is the first step. The
labeled data consists of input data points and their corresponding output labels.

• Data Preprocessing: The collected data is preprocessed to remove any noise or outliers and to
convert it into a suitable format for the machine learning algorithm.
• Feature Extraction: Features that are relevant to the problem are extracted from the input data
points. Feature extraction involves selecting the most important and informative features for the
classification task.

• Model Selection: Choosing a suitable classification algorithm is an important step. There are
various algorithms available for classification, such as logistic regression, decision trees, k-
nearest neighbors, support vector machines, and neural networks.
• Model Training: The selected model is trained on the labelled data, and the algorithm learns to
predict the correct output label for each input data point.
• Model Evaluation: The performance of the trained model is evaluated on a test dataset that
was not used during training. The evaluation metrics may include accuracy, precision, recall, and
F1-score.
• Model Deployment: The final step is to deploy the trained model to make predictions on new,
unseen data.

Overall, classification in supervised learning is a valuable technique for many applications, including
image and speech recognition, fraud detection, and spam filtering.

CLASSIFICATION TECHNIQUES

Fig. Classification techniques in machine learning

There are four different types of Classification Tasks in Machine Learning and they are following -

• Binary Classification

• Multi-Class Classification

• Multi-Label Classification

• Imbalanced Classification
Binary Classification

Those classification jobs with only two class labels are referred to as binary classification.

Examples comprise -

• Prediction of conversion (buy or not).


• Churn forecast (churn or not).
• Detection of spam email (spam or not).
Binary classification problems often require two classes, one representing the normal state and the
other representing the aberrant state.

For instance, the normal condition is "not spam," while the abnormal state is "spam." Another illustration
is when a task involving a medical test has a normal condition of "cancer not identified"and an
abnormal state of "cancer detected."

Class label 0 is given to the class in the normal state, whereas class label 1 is given to the class in the
abnormal condition.

A model that forecasts a Bernoulli probability distribution for each case is frequently used to represent
a binary classification task.

The discrete probability distribution known as the Bernoulli distribution deals with the situation where
an event has a binary result of either 0 or 1. In terms of classification, this indicates that the model
forecasts the likelihood that an example would fall within class 1, or the abnormal state.

The following are well-known binary classification algorithms:

• Logistic Regression

• Support Vector Machines

• Simple Bayes

• Decision Trees

Some algorithms, such as Support Vector Machines and Logistic Regression, were created expressly for
binary classification and do not by default support more than two classes.

Multi-Class Classification

Multi-class labels are used in classification tasks referred to as multi-class classification.


Examples comprise -
• Categorization of faces.(White, black , indian, American,African, japan)
• Classifying plant species.

Character recognition using optical.

The multi-class classification does not have the idea of normal and abnormal outcomes, in contrast to
binary classification. Instead, instances are grouped into one of several well-known classes.

In some cases, the number of class labels could be rather high. In a facial recognition system, for
instance, a model might predict that a shot belongs to one of thousands or tens of thousands of faces.

Text translation models and other problems involving word prediction could be categorized as a
particular case of multi-class classification. Each word in the sequence of words to be predicted requires
a multi-class classification, where the vocabulary size determines the number of possible classes that
may be predicted and may range from tens of thousands to hundreds of thousands ofwords.
Multiclass classification tasks are frequently modeled using a model that forecasts a Multinoulli
probability distribution for each example.

An event that has a categorical outcome, such as K in 1, 2, 3,..., K, is covered by the Multinoulli
distribution, which is a discrete probability distribution. In terms of classification, this implies that the
model forecasts the likelihood that a given example will belong to a certain class label.

For multi-class classification, many binary classification techniques are applicable.

The following well-known algorithms can be used for multi-class classification:

• Progressive Boosting
• Choice trees
• Nearest K Neighbors
• Rough Forest
• Simple Bayes
Multi-class problems can be solved using algorithms created for binary classification.

In order to do this, a method is known as "one-vs-rest" or "one model for each pair of classes" is used,
which includes fitting multiple binary classification models with each class versus all other classes
(called one-vs-one).

One-vs-One: For each pair of classes, fit a single binary classification model.

The following binary classification algorithms can apply these multi-class classification techniques:

One-vs-Rest: Fit a single binary classification model for each class versus all other classes.

The following binary classification algorithms can apply these multi-class classification techniques:

• Support vector Machine

• Logistic Regression

Multi-Label Classification

Multi-label classification problems are those that feature two or more class labels and allow for the
prediction of one or more class labels for each example.

Think about the photo classification example. Here a model can predict the existence of many known
things in a photo, such as “person”, “apple”, "bicycle," etc. A particular photo may have multiple
objects in the scene.

In multi-label classification, we have several labels that are the outputs for a given prediction. When
making predictions, a given input may belong to more than one label. For example, when predicting
a given movie category, it may belong to horror, romance, adventure, action, or all simultaneously.
This greatly contrasts with multi-class classification and binary classification, which anticipate a single
class label for each occurrence.

Multi-label classification problems are frequently modelled using a model that forecasts many outcomes,
with each outcome being forecast as a Bernoulli probability distribution. In essence, this approach
predicts several binary classifications for each example.

It is not possible to directly apply multi-label classification methods used for multi-class or binary
classification. The so-called multi-label versions of the algorithms, which are specialized versions of
the conventional classification algorithms, include:

• Multi-label Gradient Boosting


• Multi-label Random Forests
• Multi-label Decision Trees

Another strategy is to forecast the class labels using a different classification algorithm.

Imbalanced Classification

The term "imbalanced classification" describes classification jobs where the distribution of examples
within each class is not equal.

A majority of the training dataset's instances belong to the normal class, while a minority belong to the
abnormal class, making imbalanced classification tasks binary classification tasks in general.

Examples comprise -

• Clinical diagnostic procedures

• Detection of outliers

• Fraud investigation

Although they could need unique methods, these issues are modeled as binary classification jobs.

Examples of Imbalanced Classification


• Fraud Detection.
• Claim Prediction.
• Default Prediction.
• Churn Prediction.
• Spam Detection.
• Anomaly Detection.
• Outlier Detection.
• Intrusion Detection
For example: in credit card fraud detection, most transactions are legitimate, and only a small fraction
are fraudulent. in spam detection, it's the other way around: most Emails sent around the globe today
are spam

By oversampling the minority class or under sampling the majority class, specialized strategies can be
employed to alter the sample composition in the training dataset.

Examples comprise -

• SMOTE Oversampling

• Random Under sampling

It is possible to utilize specialized modelling techniques, like the cost-sensitive machine learning
algorithms, that give the minority class more consideration when fitting the model to the training
dataset.

Examples comprise:

• Cost-sensitive Support Vector Machines

• Cost-sensitive Decision Trees

• Cost-sensitive Logistic Regression

Since reporting the classification accuracy may be deceptive, alternate performance indicators may be
necessary.

Examples comprise -

• F-Measure

• Recall

• Precision

There are many classification techniques in supervised learning, each with its own strengths and
weaknesses. Here are some of the most popular techniques:

• Logistic Regression: A simple and widely used classification algorithm that works well with
linearly separable data. It models the relationship between the input features and the output
label using a logistic function.

• Decision Trees: A tree-based algorithm that recursively partitions the feature space into smaller
subsets based on the input features, creating a tree-like structure. Each internal node of the tree
represents a decision based on a specific feature, and the leaf nodes represent the
predicted output labels.
• Random Forest: A popular ensemble method that combines multiple decision trees, each
trained on a random subset of the input features and training samples. It reduces overfitting
and improves accuracy and generalization.

• Support Vector Machines (SVM): A powerful algorithm that finds the hyperplane that
maximally separates the different classes in the feature space. SVM can handle non-linearly
separable data using kernel functions.
• K-Nearest Neighbors (K-NN): A non-parametric algorithm that classifies new data points by
finding the k nearest training examples and using the majority vote of their output labels.

• Naive Bayes: A probabilistic algorithm that models the relationship between the input features
and the output labels using Bayes' theorem. It assumes that the input features are independent,
which makes it computationally efficient and scalable.

• Artificial Neural Networks (ANN): A complex and powerful algorithm that simulates the
behaviour of biological neurons and learns complex representations of the input features. It can
handle non-linearly separable data and has achieved state-of-the-art performance in many
classification tasks.

These are just a few examples of the many classification techniques available in supervised learning.
The choice of technique depends on the problem at hand, the size and complexity of the data, and the
desired performance metrics.

DECISION TREE CLASSIFICATION

o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
o Below diagram explains the general structure of a decision tree:
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset
and problem is the main point to remember while creating a machine learning model. Below are the two
reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

Decision Tree Terminologies

• Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node
of the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute
and, based on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes and move
further. It continues the process until it reaches the leaf node of the tree. The complete process can be
better understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he should accept
the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary attribute by
ASM). The root node splits further into the next decision node (distance from the office) and one leaf
node based on the corresponding labels. The next decision node further gets split into one decision node
(Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers
and Declined offer). Consider the below diagram:

Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the best attribute for the
root node and for sub-nodes. So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we can easily select the best attribute
for the nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)


Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness
in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:

o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj 2

Pruning: Getting an Optimal Decision tree


Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without reducing
accuracy is known as Pruning. There are mainly two types of tree pruning technology used:

o Cost Complexity Pruning


o Reduced Error Pruning.

Advantages of the Decision Tree

o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes it complex.


o It may have an over fitting issue, which can be resolved using the Random Forest algorithm.

For more class labels, the computational complexity of the decision tree may

UNIVARIATE TREE AND MULTIVARIATE TREE

There are three different analysis techniques that exist. These are –

• Univariate analysis
• Bivariate analysis
• Multivariate analysis

The selection of the data analysis technique is dependent on the number of variables, types of data and
focus of the statistical inquiry. The following section describes the three different levels of data analysis

A univariate tree is a decision tree that considers only one input variable (i.e., one feature) at
each decision node. The tree splits the data into two subsets based on a threshold value for the chosen
feature. The process continues recursively until a stopping criterion is met, such as reaching a maximum
tree depth or a minimum number of data points in a leaf node. Univariate trees are simple
and computationally efficient, but they may not capture complex interactions between different
input variables.
Here is one example of Univariate analysis-

In a survey of a class room, the researcher may be looking to count the number of boys and girls. In this
instance, the data would simply reflect the number, i.e. a single variable and its quantity as per the below
table. The key objective of Univariate analysis is to simply describe the data to find patterns within
the data. This is to be done by looking into the mean, median, mode, dispersion, variance, range,
standard deviation etc.
Univariate analysis is conducted through several ways which are mostly descriptive in nature
• Frequency Distribution Tables
• Histograms
• Frequency Polygons
• Pie Charts
• Bar Charts

Bivariate analysis
Bivariate analysis is slightly more analytical than Univariate analysis. When the data set contains two
variables and researchers aim to undertake comparisons between the two data set then Bivariate analysis
is the right type of analysis technique.
Here is one simple example of bivariate analysis –
In a survey of a classroom, the researcher may be looking to analysis the ratio of students who scored
above 85% corresponding to their genders. In this case, there are two variables – gender = X
(independent variable) and result = Y (dependent variable). A Bivariate analysis is will measure the
correlations between the two variables.
Bivariate analysis is conducted using –
• Correlation coefficients
• Regression analysis

A multivariate tree, on the other hand, considers multiple input variables (i.e., multiple features) at
each decision node. Instead of choosing a single feature to split the data, the tree selects a subset of
features that best separates the data into different classes. This subset can be determined using various
methods, such as information gain or Gini impurity. The tree then recursively splits the data using the
selected features, and the process continues until a stopping criterion is met. Multivariate trees are
more complex and computationally intensive than univariate trees, but they can capture
complex interactions between input variables and improve the accuracy of the model.

In practice, the choice of a univariate or multivariate tree depends on the characteristics of the data and
the complexity of the problem. For simple problems with few input variables, a univariate tree may be
sufficient and faster to train. For more complex problems with many input variables, a multivariate
tree may be necessary to capture the interactions between features and achieve higher accuracy.

In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan
used to generate a decision tree from a dataset. ID3 is the precursor to the C4. 5 algorithm, and is
typically used in the machine learning and natural language processing domains.

Multivariate analysis

Multivariate analysis is a more complex form of statistical analysis technique and used when there are
more than two variables in the data set.
Here is an example of multivariate analysis –
A doctor has collected data on cholesterol, blood pressure, and weight. She also collected data on
the eating habits of the subjects (e.g., how many ounces of red meat, fish, dairy products, and chocolate
consumed per week). She wants to investigate the relationship between the three measuresof health
and eating habits?

In this instance, a multivariate analysis would be required to understand the relationship of each
variable with each other.

Commonly used multivariate analysis technique include –


•Factor Analysis
•Cluster Analysis
•Variance Analysis
•Discriminant Analysis
•Multidimensional Scaling
•Principal Component Analysis
Redundancy Analysis
Examples of Univariate and Multivariate

1. Univariate data –

This type of data consists of only one variable. The analysis of univariate data is thus the simplest form
of analysis since the information deals with only one quantity that changes. It does not dealwith
causes or relationships and the main purpose of the analysis is to describe the data and find patterns that
exist within it. The example of a univariate data can be height.

Suppose that the heights of seven students of a class is recorded(figure 1),there is only one variable that
is height and it is not dealing with any cause or relationship. The description of patterns found in this
type of data can be made by drawing conclusions using central tendency measures (mean,median and
mode), dispersion or spread of data (range, minimum, maximum, quartiles, variance and standard
deviation) and by using frequency distribution tables, histograms, pie charts, frequency polygon and bar
charts.

2. Bivariate data
This type of data involves two different variables. The analysis of this type of data deals with
causes and relationships and the analysis is done to find out the relationship among the two variables.
Example of bivariate data can be temperature and ice cream sales in summer season.
Suppose the temperature and ice cream sales are the two variables of a bivariate data(figure 2). Here, the
relationship is visible from the table that temperature and sales are directly proportional to each other
and thus related because as the temperature increases, the sales also increase. Thus bivariate data
analysis involves comparisons, relationships, causes and explanations. These variables are often plotted
on X and Y axis on the graph for better understanding of data and one of these variables is independent
while the other is dependent.

3. Multivariate data
When the data involves three or more variables, it is categorized under multivariate. Example of this
type of data is suppose an advertiser wants to compare the popularity of four advertisements ona
website, then their click rates could be measured for both men and women and relationshipsbetween
variables can then be examined. It is similar to bivariate but contains more than one dependent variable.
The ways to perform analysis on this data depends on the goals to be achieved. Some of the techniques
are regression analysis, path analysis, factor analysis and multivariate analysis of variance
(MANOVA).
There are a lots of different tools, techniques and methods that can be used to conduct your analysis.
You could use software libraries, visualization tools and statistic testing methods. However, this blog
we will be compare Univariate, Bivariate and Multivariate analysis.

Univariate Bivariate Multivariate

It only summarize
It only summarize two
single variable at a It only summarize more than 2 variables.
variables
time.

It does not deal with It does deal with causes and It does not deal with causes and
causes and relationships and analysis is relationships and analysis is done.
relationships. done.

It does not contain


It does contain only one It is similar to bivariate but it contains
any dependent
dependent variable. more than 2 variables.
variable.

The main purpose is The main purpose is to The main purpose is to study the
to describe. explain. relationship among them.

Example, Suppose an advertiser wants to


compare the popularity of four
advertisements on a website.
The example of a The example of bivariate can
univariate can be be temperature and ice sales Then their click rates could be measured
height. in summer vacation. for both men and women and relationships
between variable can beexamined

PRUNING

Pruning is a technique used in supervised learning to prevent overfitting of a model. Overfitting occurs
when a model learns the training data too well and performs poorly on new, unseen data. Pruning
involves removing some parts of the model that are not essential to its performance, with the aim of
reducing its complexity and improving its generalization ability.

There are two main types of pruning techniques: pre-pruning and post-pruning.

• Pre-pruning: Pre-pruning involves stopping the growth of a decision tree before it becomes
too complex and overfits the training data. This can be done by setting a maximum tree depth,
a minimum number of data points in a leaf node, or a threshold for the information gain at each
decision node. Pre-pruning is simple and computationally efficient, but it may not capture
complex relationships in the data.
• Post-pruning: Post-pruning involves growing a decision tree to its maximum depth and then
removing the unnecessary branches that do not improve the model's performance on the
validation data. This can be done by calculating a measure of impurity reduction or error rate
reduction for each subtree and removing the subtree that does not meet a certain criterion. Post-
pruning is more computationally intensive than pre-pruning, but it can capture complex
relationships and improve the accuracy of the model.
Pruning can be applied to other supervised learning algorithms as well, such as neural networks and
support vector machines. In neural networks, pruning can involve removing some of the neurons or
connections that are not essential to the model's performance. In support vector machines, pruning can
involve removing some of the support vectors or adjusting the regularization parameter to control
the model's complexity.
Overall, pruning is an important technique in supervised learning to prevent overfitting and improve
the generalization ability of a model.

BAYESIAN DECISION THEORY

Bayesian Decision Theory is a framework for decision making in the presence of uncertainty. It provides
a way to make decisions by explicitly considering probabilities and the consequences of different actions.
The goal of Bayesian decision theory is to choose the action that maximizes the expected utility, which
is a measure of the desirability of different outcomes.

The basic idea of Bayesian decision theory is to model the problem as a probabilistic graphical model,
which captures the relationships between different variables and their probabilities. The model includes
a set of actions, a set of possible outcomes, and a set of features or observations that provide information
about the state of the system.
To make a decision, Bayesian decision theory calculates the expected utility of each action, which is the
sum of the utilities of all possible outcomes weighted by their probabilities. The utility function
represents the preferences of the decision maker and assigns a value to each outcome based on its
desirability.

Bayesian decision theory also incorporates prior beliefs about the probabilities and outcomes, which can
be updated based on new information using Bayes' theorem. This allows the decision maker to adapt to
changing circumstances and update their beliefs as new data becomes available.

Bayesian decision theory has applications in many areas of science and engineering, including
economics, finance, engineering, and artificial intelligence. It provides a principled way to make
decisions based on probabilities and expected utilities, which can help to improve the quality and
consistency of decision making.

Bayesian Decision Theory (i.e. the Bayesian Decision Rule) predicts the outcome not only
based on previous observations, but also by taking into account the current situation. The rule
describes the most reasonable action to take based on an observation.

The formula for Bayesian (Bayes) decision theory is given below:

The elements of the theory are:

• :Prior probability. This accounts for how many times the class Ci occurred
independently from any conditions (i.e. regardless of the input X.
• P(X|Ci)): Likelihood. Under some conditions X, this is how many times the
outcome Ci occurred.
• P(X)): Evidence. The number of times the conditions X occurred.
• P(Ci|X)): Posterior. The probability that the outcome Ci occurs given some conditions X.

PARAMETRIC METHODS IN SUPERVISED LEARNING

The basic idea is that there is a set of fixed parameters that determine a probability model.
Parametric methods are often those for which we know that the population is approximately normal, or
we can approximate using a normal distribution after we invoke the central limit theorem.

Parametric statistics are based on assumptions about the distribution of population from which
the sample was taken. Nonparametric statistics are not based on assumptions, that is, the data can be
collected from a sample that does not follow a specific distribution.

Parametric methods are a type of supervised learning algorithm that assumes the data follows a
particular distribution or functional form. The goal of a parametric method is to estimate the
parameters of this distribution or function based on the training data, and then use these parameters to
make predictions on new data.

Some common examples of parametric methods in supervised learning include linear regression, logistic
regression, and Naive Bayes. These methods make certain assumptions about the underlying distribution
or function, such as linearity in the case of linear regression, and then estimate theparameters that best
fit the data using maximum likelihood or other statistical techniques.

The main advantage of parametric methods is that they are often computationally efficient and can work
well with small to moderate-sized datasets. They also provide interpretable models that can help to
explain the relationships between the input variables and the output variable.

However, the main disadvantage of parametric methods is that they can be limited by the assumptions
made about the underlying distribution or function. If these assumptions are not valid, the model may
not accurately capture the true relationship between the input and output variables, and may lead to poor
predictions. In addition, parametric methods may not be able to capture complex nonlinear relationships
in the data.

Overall, parametric methods are a useful tool in supervised learning, particularly for simple and well-
understood problems where the assumptions made about the underlying distribution or function are
valid. However, for more complex problems or situations where the assumptions do not hold, other
methods such as non-parametric or semi-parametric methods may be more appropriate.

Some more examples of parametric machine learning algorithms include:

• Logistic Regression
• Linear Discriminant Analysis
• Perceptron
• Naive Bayes
• Simple Neural Networks
Benefits of Parametric Machine Learning Algorithms:

Simpler: These methods are easier to understand and interpret results.


Speed: Parametric models are very fast to learn from data.
Less Data: They do not require as much training data and can work well even if the fit to the data is
not perfect.

Limitations of Parametric Machine Learning Algorithms:

Constrained: By choosing a functional form these methods are highly constrained to the specified
form.
Limited Complexity: The methods are more suited to simpler problems.
Poor Fit: In practice the methods are unlikely to match the underlying mapping function.
MAXIMUM LIKELIHOOD ESTIMATION

Maximum Likelihood Estimation (MLE) is a method used in supervised learning to estimate the
parameters of a model that best explain the observed data. The goal of MLE is to find the set of parameter
values that maximize the likelihood of observing the data given the model.

In supervised learning, MLE is commonly used to estimate the parameters of a probabilistic model, such
as a Gaussian distribution or a logistic regression model. The likelihood function is defined as the
probability of observing the training data given the model parameters. The maximum likelihoodestimate
of the parameters is the set of values that maximize the likelihood function.

The MLE process involves several steps:

• Choose a probability distribution or functional form that describes the relationship between the
input variables and the output variable.
• Define the likelihood function as the probability of observing the training data given the model
parameters.
• Take the logarithm of the likelihood function to simplify the calculations and convert the
product of probabilities to a sum of logarithms.
• Maximize the logarithm of the likelihood function with respect to the parameters using
optimization techniques such as gradient descent or Newton's method.
• Once the maximum likelihood estimate of the parameters is obtained, the model can be used to
make predictions on new data.

MLE is a powerful technique in supervised learning because it provides a way to estimate theparameters
of a model that best fit the observed data. However, it assumes that the model is correctly specified and
that the training data is representative of the population. If these assumptions do not hold, the MLE
estimate may be biased or have high variance. Therefore, it is important to carefully validate the model
and the data before using MLE for parameter estimation.

https://drive.google.com/file/d/1iWpQCYLJisBe8IXeWUmBVS2ehOKEUejW/view?usp=drive_l
ink

BIAS AND VARIANCE OF ESTIMATOR

Evaluating the bias and variance of an estimator is an important step in supervised learning to
assess the performance of the model and identify any issues that need to be addressed. Bias refers to
the systematic error of the model, while variance refers to the variability of the model's predictions.

In the context of supervised learning, bias and variance can be evaluated using a technique called
cross-validation. Cross-validation involves partitioning the data into training and validation sets, and
then fitting the model to the training set and evaluating its performance on the validation set. This process
is repeated several times, each time using a different partition of the data, to obtain an estimate of the
model's bias and variance.

The bias of the estimator can be estimated by comparing the predictions of the model to the true values
in the validation set. If the model consistently underestimates or overestimates the true values, it has a
bias. The bias can be reduced by using a more flexible model or increasing the size of the training set.
The variance of the estimator can be estimated by comparing the predictions of the model across different
partitions of the data. If the model's predictions vary widely across different partitions, it has high
variance. High variance can be reduced by using a simpler model or by regularizing the model to prevent
overfitting.

In addition to evaluating bias and variance, other metrics can also be used to assess the
performance of a model in supervised learning, such as accuracy, precision, recall, F1 score, and
area under the receiver operating characteristic (ROC) curve. These metrics can help to provide a
more comprehensive evaluation of the model's performance and identify areas for improvement.

Bayes Estimator

The Bayes' estimator is a method used in supervised learning to estimate the unknown parameter of a
statistical model based on Bayes' theorem. It is also known as the posterior mean or the conditional
expectation.

Bayes' theorem states that the posterior probability of a parameter given the data is proportional to the
likelihood of the data given the parameter and the prior probability of the parameter. Mathematically,
it can be written as:

P(θ|X) 𝖺 P(X|θ) P(θ)

where θ is the unknown parameter, X is the observed data, P(θ|X) is the posterior probability of θ
given X, P(X|θ) is the likelihood of X given θ, and P(θ) is the prior probability of θ.

The Bayes' estimator is the expected value of the parameter given the observed data, and can be
computed using the posterior distribution of the parameter. Mathematically, it can be written as:

θ^B = E[θ|X] = ∫ θ P(θ|X) dθ

where θ^B is the Bayes' estimator of θ, and the integral is taken over the entire parameter space.

The Bayes' estimator is often used in supervised learning when the prior distribution of the parameter
is known or can be assumed, and when the likelihood function is well-defined. It provides a way to
incorporate prior knowledge about the parameter into the estimation process, and can lead to more robust
and accurate estimates.

However, the Bayes' estimator requires the specification of a prior distribution, which may not always
be easy to determine. The choice of the prior can also affect the resulting estimate, and different priors
can lead to different estimates. Therefore, it is important to carefully choose the prior based on prior
knowledge or expert opinion, or use non-informative priors that do not affect the estimation process.

Linear Discriminant

Linear Discriminant Analysis (LDA) is a statistical method used in supervised learning to find a linear
combination of features that best separates two or more classes. It is commonly used in classification
problems, where the goal is to predict the class of a new observation based on its features.
LDA works by modeling the distribution of the features for each class, and then finding a linear boundary
that maximally separates the classes. Specifically, LDA seeks to find a linear discriminant function that
maximizes the between-class variance and minimizes the within-class variance.

To find the discriminant function, LDA first computes the mean and covariance matrix for each class.
It then computes a weighted average of the class covariance matrices, where the weights are proportional
to the number of observations in each class. This weighted average is used to estimate the overall
covariance matrix of the data.

Next, LDA computes the eigenvectors and eigenvalues of the overall covariance matrix. The
eigenvectors represent the directions of maximum variance in the data, and the eigenvalues represent the
variance along each eigenvector. LDA then selects the eigenvectors corresponding to the largest
eigenvalues, and uses them to form a linear discriminant function.

The linear discriminant function can be used to project new observations onto a lower-dimensional
space, where they can be classified based on their position relative to the linear boundary. Alternatively,
it can be used to assign a class probability to each observation, based on the distance between the
observation and the linear boundary.

LDA is a powerful and widely used method in supervised learning, and has been shown to perform well
on a variety of classification problems. However, it assumes that the data is normally distributed and
that the classes have equal covariance matrices, which may not always be true in practice.

Gradiant Descent

Gradient Descent is a popular optimization algorithm used in supervised learning to minimize the cost
function of a model. In supervised learning, the goal is to learn a model that can make accurate
predictions on new, unseen data. To achieve this, we need to optimize the parameters of the model to
minimize the difference between the predicted output and the actual output.

The cost function is a measure of the difference between the predicted output and the actual output,
and is typically defined as the mean squared error, cross-entropy, or another appropriate metric
depending on the problem. The goal of Gradient Descent is to find the set of parameters that minimizes
the cost function.

The basic idea behind Gradient Descent is to iteratively update the parameters in the direction of the
negative gradient of the cost function. The negative gradient of the cost function tells us the direction
of steepest descent, or the direction in which the cost function decreases the most. By taking small
steps in this direction, we can iteratively approach the optimal set of parameters.

There are several variants of Gradient Descent, including Batch Gradient Descent, Stochastic Gradient
Descent, and Mini-batch Gradient Descent. Batch Gradient Descent computes the gradient of the cost
function with respect to all the training examples at once and updates the parameters accordingly.
Stochastic Gradient Descent updates the parameters based on the gradient of the cost function with
respect to a single training example at a time. Mini-batch Gradient Descent is a compromise between
these two approaches, and updates the parameters based on the gradient of the cost function with respect
to a small batch of training examples at a time.
Gradient Descent is a powerful and widely used optimization algorithm in supervised learning, and is
used to train many popular models, including linear regression, logistic regression, and neural networks.
However, it can be sensitive to the choice of learning rate and can get stuck in local optima. To mitigate
these issues, various extensions and modifications to Gradient Descent have been proposed, including
adaptive learning rate methods and momentum-based methods.

Logistic Discrimination

Logistic Regression is a popular classification algorithm in supervised learning that is used to model the
probability of a binary response variable based on one or more predictor variables. It is a type of
discriminative model that seeks to learn a decision boundary that separates the two classes.

The decision boundary is represented by a linear function of the predictor variables, where the output
of the function is passed through the logistic function to obtain the predicted probability. The logistic
function, also known as the sigmoid function, maps any real-valued input to a value between 0 and 1,
which can be interpreted as the probability of belonging to the positive class.

The logistic regression model is trained using maximum likelihood estimation, where the goal is to find
the set of parameters that maximize the likelihood of the observed data. The likelihood function is
defined as the product of the conditional probabilities

of the response variable given the predictor variables, where the probabilities are modeled using the
logistic function.

The logistic regression model can be extended to handle multiclass classification problems by using a
one-vs-rest approach, where a separate logistic regression model is trained for each class against the
remaining classes. Alternatively, a multinomial logistic regression model, also known as softmax
regression, can be used to directly model the probabilities of each class.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:


o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

o Logistic Regression is a simple yet powerful algorithm that is widely used in supervised learning
for its interpretability, ease of use, and ability to handle non-linear relationships between the
predictor variables and the response variable. However, it assumes that the decision boundary
is linear, which may not always be true in practice. Various extensions to logistic regression have
been proposed to handle non-linear relationships, including polynomial regression and kernel
logistic regression.

Discrimination by Regression(Linear Regression)

Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression showsthe
linear relationship, which means it finds how the value of the dependent variable is changing according
to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:

Mathematically, we can represent a linear regression as:

y= a0+a1x+ ε

Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model representation.

Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

Simple Linear Regression:

If a single independent variable is used to predict the value of a numerical dependent variable, then
such a Linear Regression algorithm is called Simple Linear Regression.

Multiple Linear regression:


If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.

Back Propagation in Neural Network: Machine Learning Algorithm

What is Artificial Neural Networks?


A neural network is a group of connected I/O units where each connection has a weight associated with
its computer programs. It helps you to build predictive models from large databases. This model builds
upon the human nervous system. It helps you to conduct image understanding, human learning, computer
speech, etc.

What is Backpropagation?
Backpropagation is the essence of neural network training. It is the method of fine-tuning the weights of
a neural network based on the error rate obtained in the previous epoch (i.e., iteration). Proper
tuning of the weights allows you to reduce error rates and make the model reliable by increasing its
generalization.
Backpropagation in neural network is a short form for “backward propagation of errors.” It is a standard
method of training artificial neural networks. This method helps calculate the gradient of a loss function
with respect to all the weights in the network.

How Backpropagation Algorithm Works

The Back propagation algorithm in neural network computes the gradient of the loss function for a single
weight by the chain rule. It efficiently computes one layer at a time, unlike a native direct computation.
It computes the gradient, but it does not define how the gradient is used. It generalizes the computation
in the delta rule.

• Inputs X, arrive through the preconnected path


• Input is modeled using real weights W. The weights are usually randomly selected.
• Calculate the output for every neuron from the input layer, to the hidden layers, to the output
layer.
• Calculate the error in the outputs
• Travel back from the output layer to the hidden layer to adjust the weights such that the error is
decreased.
Why We Need Backpropagation?
Most prominent advantages of Backpropagation are:

-Backpropagation is fast, simple and easy to program


-It has no parameters to tune apart from the numbers of input
-It is a flexible method as it does not require prior knowledge about the network
-It is a standard method that generally works well
-It does not need any special mention of the features of the function to be learned.

What is a Feed Forward Network?


A feedforward neural network is an artificial neural network where the nodes never form a cycle. This
kind of neural network has an input layer, hidden layers, and an output layer. It is the first and simplest
type of artificial neural network.

Types of Backpropagation Networks


Two Types of Backpropagation Networks are:

-Static Back-propagation
-Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input for static output.
It is useful to solve static classification issues like optical character recognition.

Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved. After that,
the error is computed and propagated backward.

MULTILAYER PERCEPTRON
A multilayer perceptron (MLP) is a type of artificial neural network commonly used in supervised
learning for classification and regression tasks. It consists of multiple layers of interconnected nodes,
also known as neurons, that are organized into input, hidden, and output layers.

The input layer receives the input data, which is then passed through the hidden layers to the output
layer, where the final prediction is made. Each neuron in the hidden layers is connected to every
neuron in the previous layer, and each connection is associated with a weight, which determines the
strength of the connection.

During training, the weights of the MLP are updated using backpropagation, which involves computing
the gradient of the loss function with respect to the weights and adjusting the weights in the direction of
the gradient to minimize the loss. The loss function typically measures the difference between the
predicted output and the true output, and the goal of training is to find the set of weights that minimize
the loss on the training data.

MLPs are powerful models that can learn complex nonlinear relationships between the input and output
variables. However, they are prone to overfitting, where the model learns to memorize the training data
instead of generalizing to new data. Regularization techniques, such as L1 and L2 regularization,
dropout, and early stopping, can be used to prevent overfitting and improve the generalization
performance of the model.

MLPs have been successfully applied to a wide range of applications, including image classification,
speech recognition, natural language processing, and financial modeling.

As
an
example, it can be anything from an adult who transforms their ability to spread their opinion after
learning how to use social networks and the internet to a person who has a transformative experience on
the way they view life due to a traumatic experience.
UNIT-III

UNSUPERVISED LEARNING: clustering- Introduction- Mixture Densities- k-Means-Clustering-


Expectation-Maximization Algorithm- Mixtures of Latent Variable Models Supervised Learning after
Clustering- Hierarchical Clustering

Clustering

It is basically a type of unsupervised learning method. An unsupervised learning method is a method in


which we draw references from datasets consisting of input data without labelled responses. Generally,
it is used as a process to find meaningful structure, explanatory underlying processes, generative
features, and groupings inherent in a set of examples.

Clustering is the task of dividing the population or data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group and dissimilar to the
data points in other groups. It is basically a collection of objects on the basis of similarity and
dissimilarity between them.

For example, the data points in the graph below clustered together can be classified into one single group.
We can distinguish the clusters, and we can identify that there are 3 clusters in the below picture.

Clustering Methods:

Density-Based Methods: These methods consider the clusters as the dense region having some
similarities and differences from the lower dense region of the space. These methods have good accuracy
and the ability to merge two clusters.
Example: DBSCAN (Density-Based Spatial Clustering of Applications with Noise),
Hierarchical Based Methods: The clusters formed in this method form a tree-type structure based on
the hierarchy. New clusters are formed using the previously formed one.
It is divided into two categories-
• Agglomerative (bottom-up approach)
• Divisive (top-down approach)
Partitioning Methods: These methods partition the objects into k clusters and each partition forms
one cluster. This method is used to optimize an objective criterion similarity function such as when the
distance is a major parameter
Example: K-means, CLARANS (Clustering Large Applications based upon Randomized Search), etc.

Grid-based Methods: In this method, the data space is formulated into a finite number of cells that
form a grid-like structure. All the clustering operations done on these grids are fast and independent of
the number of data objects
Example: STING (Statistical Information Grid), wave cluster, CLIQUE (CLustering In Quest), etc.

Clustering Algorithms
The clustering algorithm is based on the kind of data that we are using. Such as, some algorithms need
to guess the number of clusters in the given dataset, whereas some are required to find the minimum
distance between the observations of the dataset.
Mainly popular Clustering algorithms that are widely used in machine learning are:

1. K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms. It
classifies the dataset by dividing the samples into different clusters of equal variances. The
number of clusters must be specified in this algorithm. It is fast with fewer computations
required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density of
data points. It is an example of a centroid-based model, that works on updating the candidates
for centroid to be the center of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with Noise.
It is an example of a density-based model similar to the mean-shift, but with some remarkable
advantages. In this algorithm, the areas of high density are separated by the areas of low density.
Because of this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an alternative
for the k-means algorithm or for those cases where K-means can be failed. In GMM, it is
assumed that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm performs the
bottom-up hierarchical clustering. In this, each data point is treated as a single cluster at the outset
and then successively merged. The cluster hierarchy can be represented as a tree- structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not require
specifying the number of clusters. In this, each data point sends a message between the pair of
data points until convergence. It has O(N2T) time complexity, which is the main drawback of
this algorithm.

Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
o In Identification of Cancer Cells: The clustering algorithms are widely used for the identification
of cancerous cells. It divides the cancerous and non-cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering technique. The search result
appears based on the closest object to the search query. It does it by grouping similar data
objects in one group that is far from the other dissimilar objects. The accurate result of a query
depends on the quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the customers based on their
choice and preferences.
o In Biology: It is used in the biology stream to classify different species of plants and animals
using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands use in the
GIS database. This can be very useful to find that for what purpose the particular land should
be used, that means for which purpose it is more suitable
Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozento
many thousands of dimensions. Such high-dimensional spaces of data are often encountered in areas
such as medicine, where DNA microarray technology can produce many measurements at once, and
the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions
equals the size of the vocabulary.
Four problems need to be overcome for clustering in high-dimensional data:
• Multiple dimensions are hard to think in, impossible to visualize, and, due to the exponential
growth of the number of possible values with each dimension, complete enumeration of all
subspaces becomes intractable with increasing dimensionality. This problem is known asthe
curse of dimensionality.
• The concept of distance becomes less precise as the number of dimensions grows, since the
distance between any two points in a given dataset converges. The discrimination of the nearest
and farthest point in particular becomes meaningless:
• A cluster is intended to group objects that are related, based on observations of their attribute's
values. However, given a large number of attributes some of the attributes will usually not be
meaningful for a given cluster.
For example, in newborn screening a cluster of samples might identify new-borns that share
similar blood values, which might lead to insights about the relevance of certain blood values for
a disease. But for different diseases, different blood values might form a cluster, and other values
might be uncorrelated. This is known as the local feature relevance problem: different clusters
might be found in different subspaces, so a global filtering of attributes is not sufficient.
• Given a large number of attributes, it is likely that some attributes are correlated. Hence, clusters
might exist in arbitrarily oriented affine subspaces.

Mixture densities
In unsupervised learning, the calculation of the mixture density involves estimating the parameters of a
mixture model from the observed data. The mixture density represents the probability density function
(PDF) of the observed data, which is a combination of multiple component densities.
Here's a general overview of how the mixture density is calculated in unsupervised learning:
o Choose the Mixture Model:
o Select the type of mixture model that best suits the data distribution. Common choices include
Gaussian Mixture Models (GMMs) or other types of mixture models like Dirichlet Process
Mixtures.
o Specify the Number of Components:
o Determine the number of components (clusters) in the mixture model. This can be done based
on prior knowledge or using techniques such as model selection criteria (e.g., AIC, BIC) or cross-
validation.
o Initialize the Model Parameters:
o Initialize the parameters of the mixture model, including the mixing proportions and the
parameters of each component distribution (e.g., mean, covariance for Gaussian components).
o E-step: Expectation Step:
o Given the current parameter estimates, calculate the posterior probabilities or responsibilities of
each component for each data point. This step is often computed using the Bayes' theorem or the
posterior probability of the latent variables given the observed data.
o M-step: Maximization Step:
o Update the model parameters based on the responsibilities obtained in the E-step. This typically
involves maximizing the likelihood or maximizing the expected complete-data log-likelihood.
o Iterative Optimization:
o Iterate between the E-step and M-step until convergence. The convergence criteria can be
based on the change in log-likelihood or a predetermined number of iterations.
o Compute the Mixture Density:
o Once the mixture model parameters have converged, the mixture density can be computed by
combining the densities of each component, weighted by the corresponding mixing proportions.
o Utilize the Mixture Density:
o The calculated mixture density can be used for various purposes, such as clustering, density
estimation, anomaly detection, or generating new samples from the learned distribution.
It's important to note that the specific algorithms and techniques used for the estimation and calculation
of the mixture density may vary depending on the chosen mixture model and the inference method
employed (e.g., EM algorithm, variational inference, etc.).

Gaussian Mixture Models (GMMs) assume that there are a certain number of Gaussian distributions,
and each of these distributions represents a cluster. Hence, a Gaussian Mixture Model tends to group the
data points belonging to a single distribution together.
Gaussian Mixture Models are probabilistic models and use the soft clustering approach for distributing
the points in different clusters.
Here, we have three clusters that are denoted by three colors – Blue, Green, and Cyan. Let‟s take the
data point highlighted in red. The probability of this point being a part of the blue cluster is 1, while the
probability of it being a part of the green or cyan clusters is 0.

Now, consider another point – somewhere in between the blue and cyan (highlighted in the below
figure). The probability that this point is a part of cluster green is 0, right? And the probability that
this belongs to blue and cyan is 0.2 and 0.8 respectively.
In a one dimensional space, the probability density function of a Gaussian distribution is given by:

Where μ is the mean and ζ2 is the variance.


But this would only be true for a single variable. In the case of two variables, instead of a 2D bell-
shaped curve, we will have a 3D bell curve as shown below:

The probability density function would be given by:

Where x is the input vector, μ is the 2D mean vector, and Σ is the 2×2 covariance matrix. The covariance
would now define the shape of this curve. We can generalize the same for d-dimensions
K-Means Clustering-
The K-means clustering algorithm computes centroids and repeats until the optimal centroid is found.
It is presumptively known how many clusters there are. It is also known as the flat clusteringalgorithm.
The number of clusters found from data by the method is denoted by the letter „K‟ in K- means.

The working of the K-Means algorithm is explained in the below steps:


Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third step, which means reassign each datapoint to the new closest centroid
of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

The elbow method is a graphical representation of finding the optimal „K‟ in a K-means clustering. It
works by finding WCSS (Within-Cluster Sum of Square) i.e. the sum of the square distance between
points in a cluster and the cluster centroid.The elbow graph shows WCSS values (on the y-axis)
corresponding to the different values of K (on the x-axis). When we see an elbow shape in the graph, we
pick the K-value where the elbow gets created. We can call this point the Elbow point. Beyond the Elbow
point, increasing the value of „K‟ does not lead to a

Significant reduction in WCSS.

Example:
Cluster the following eight points (with (x, y) representing locations) into three clusters:
A1 (2, 10), A2 (2, 5), A3 (8, 4), A4 (5, 8), A5 (7, 5), A6 (6, 4), A7 (1, 2), A8 (4, 9)

Initial cluster centres are: A1 (2, 10), A4 (5, 8) and A7 (1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-

Ρ (a, b) = |x2 – x1| + |y2 – y1|

Distance Distance Distance


Point
Given from center from center from center
belongs to
Points (2, 10) of (5, 8) of (1, 2) of
Cluster
Cluster-01 Cluster-02 Cluster-03

A1(2, 10) 0 5 9 C1

A2(2, 5) 5 6 4 C3

A3(8, 4) 12 7 9 C2

A4(5, 8) 5 0 10 C2
A5(7, 5) 10 5 9 C2

A6(6, 4) 10 5 7 C2

A7(1, 2) 9 10 0 C3

A8(4, 9) 3 2 10 C2

Cluster-01:
First cluster contains points-
• A1(2, 10)
Cluster-02:
Second cluster contains points-
• A3(8, 4)
• A4(5, 8)
• A5(7, 5)
• A6(6, 4)
• A8(4, 9)
Cluster-03:
Third cluster contains points-
• A2(2, 5)
• A7(1, 2)
Now,
• We re-compute the new cluster clusters.
• The new cluster center is computed by taking mean of all the points contained in that cluster.
For Cluster-01:
We have only one point A1 (2, 10) in Cluster-01.
• So, cluster center remains the same.
For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)
= (6, 6)
For Cluster-03:
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)

This is completion of Iteration-01.

Iteration-02:

Distance from Distance from Distance from


Given Point belongs
center (2, 10) center (6, 6) of center (1.5, 3.5)
Points to Cluster
of Cluster-01 Cluster-02 of Cluster-03

A1(2, 10) 0 8 7 C1

A2(2, 5) 5 5 2 C3

A3(8, 4) 12 4 7 C2

A4(5, 8) 5 3 8 C2

A5(7, 5) 10 2 7 C2

A6(6, 4) 10 2 5 C2

A7(1, 2) 9 9 2 C3

A8(4, 9) 3 5 8 C1
Repeat the procedure until points converge together.

Expectation-Maximization Algorithm

The Expectation-Maximization algorithm aims to use the available observed data of the dataset
to estimate the missing data of the latent variables and then using that data to update the values of the
parameters in the maximization step.
Let us understand the EM algorithm in a detailed manner:
• Initialization Step: In this step, we initialized the parameter values with a set of initial values,
then give the set of incomplete observed data to the system with the assumption that the observed
data comes from a specific model i.e, probability distribution.
• Expectation Step: In this step, by using the observed data to estimate or guess the values of the
missing or incomplete data. It is used to update the variables.
• Maximization Step: In this step, we use the complete data generated in the “Expectation” step
to update the values of the parameters i.e, update the hypothesis.
• Checking of convergence Step: Now, in this step, we checked whether the values are converging
or not, if yes, then stop otherwise repeat these two steps i.e, the “Expectation” step and
“Maximization” step until the convergence occurs.
Applications of EM Algorithm

The latent variable model has several real-life applications in Machine learning:
• Used to calculate the Gaussian density of a function.
• Helpful to fill in the missing data during a sample.
• It finds plenty of use in different domains such as Natural Language Processing
(NLP), Computer Vision, etc.
• Used in image reconstruction in the field of Medicine and Structural Engineering.
• Used for estimating the parameters of the Hidden Markov Model (HMM) and also for some
other mixed models like Gaussian Mixture Models, etc.
• Used for finding the values of latent variables.

Hierarchical clustering is another unsupervised learning algorithm that is used to group together the
unlabelled data points having similar characteristics.

The closest distance between the two clusters is crucial for the hierarchical clustering. There are
various ways to calculate the distance between two clusters, and these ways decide the rule for
clustering. These measures are called Linkage methods. Some of the popular linkage methods are
given below:

1. Single Linkage: It is the Shortest Distance between the closest points of the clusters. Consider
the below image:

2. Complete Linkage: It is the farthest distance between the two points of two different clusters. It
is one of the popular linkage methods as it forms tighter clusters than single-linkage.
3. Average Linkage: It is the linkage method in which the distance between each pair of datasets
is added up and then divided by the total number of datasets to calculate the average distance between
two clusters. It is also one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid of the
clusters is calculated.

Hierarchical clustering algorithms fall into following two categories.

Agglomerative hierarchical algorithms − In agglomerative hierarchical algorithms, each data point


is treated as a single cluster and then successively merge or agglomerate (bottom-up approach) the pairs
of clusters. The hierarchy of the clusters is represented as a dendrogram or tree structure.

Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all the data
points are treated as one big cluster and the process of clustering involves dividing (Top-downapproach)
the one big cluster into various small clusters.

A dendrogram, which is a tree like structure, is used to represent hierarchical clustering. Individual
objects are represented by leaf nodes and the clusters are represented by root nodes. A representation
of dendrogram is shown in this figure:
Agglomerative Algorithm: Single Link
Single-nearest distance or single linkage is the agglomerative method that uses the distance between
the closest members of the two clusters. We will now solve a problem to understand it better:
Question Find the clusters using a single link technique. Use Euclidean distance and draw the
dendrogram.

Sample No. X Y

P1 0.40 0.53

P2 0.22 0.38

P3 0.35 0.32

P4 0.26 0.19

P5 0.08 0.41

P6 0.45 0.30

Step 1: Compute the distance matrix by:

So we have to find the Euclidean distance between each and every point, say we first find the
Euclidean distance between P1 and P2
Distanc
e
Matrix

Similarly, find the Euclidean distance for every point. But there is one point to focus on that the
diagonal of the above distance matrix is a special point for us.
The distance above and below the diagonal will be same. For eg: d(P2, P5) is equivalent to d(P5, P2).
So we will find the distance of the below section of the matrix.
Therefore, the
updated Distance
Matrix will be :

Step 2: Merging the two closest members of the two clusters and finding the minimum element in
distance matrix. Here the minimum value is 0.10 and hence we combine P3 and P6 (as 0.10 came in
the P6 row and P3 column). Now, form clusters of elements corresponding to the minimum value and
update the distance matrix. To update the distance matrix:
min ((P3,P6), P1) = min ((P3,P1), (P6,P1)) = min (0.22,0.24) = 0.22
min ((P3,P6), P2) = min ((P3,P2), (P6,P2)) = min (0.14,0.24) = 0.14
min ((P3,P6), P4) = min ((P3,P4), (P6,P4)) = min (0.13,0.22) = 0.13
min ((P3,P6), P5) = min ((P3,P5), (P6,P5)) = min (0.28,0.39) = 0.28

Now we will repeat the same process. Merge two closest members of the two clusters and find the
minimum element in distance matrix. The minimum value is 0.13 and hence we combine P3, P6 and
P4. Now, form the clusters of elements corresponding to the minimum values and update the Distance
matrix. In order to find, what we have to update in distance matrix,
min (((P3,P6) P4), P1) = min (((P3,P6), P1), (P4,P1)) = min (0.22,0.37) = 0.22
min (((P3,P6), P4), P2) = min (((P3,P6), P2), (P4,P2)) = min (0.14,0.19) = 0.14
min (((P3,P6), P4), P5) = min (((P3,P6), P5), (P4,P5)) = min (0.28,0.23) = 0.23

Again repeating the same process: The minimum value is 0.14 and hence we combine P2 and P5.
Now, form cluster of elements corresponding to minimum value and update the distance matrix. To
update the distance matrix:
min ((P2,P5), P1) = min ((P2,P1), (P5,P1)) = min (0.23, 0.34) = 0.23
min ((P2,P5), (P3,P6,P4)) = min ((P3,P6,P4), (P3,P6,P4)) = min (0.14. 0.23

Again repeating the same process: The minimum value is 0.14 and hence we combine P2,P5 and
P3,P6,P4. Now, form cluster of elements corresponding to minimum value and update the distance
matrix. To update the distance matrix:
min ((P2,P5,P3,P6,P4), P1) = min ((P2,P5), P1), ((P3,P6,P4), P1)) = min (0.23, 0.22) = 0.22

Mixtures of Latent Variable Models

A latent variable model is a statistical model that relates a set of observable variables (also called
manifest variables or indicators) to a set of latent variables.

Manifest variables
Latent variables Continuous Categorical
Continuous Factor analysis Item response theory
Categorical Latent profile analysis Latent class analysis
In machine learning, mixture models are a class of latent variable models that are used to represent
complex distributions by combining simpler component distributions. Latent variable models involve
unobserved variables (latent variables) that are used to capture hidden patterns or structure in the data.
Let's consider an example of a mixture of Gaussian distributions, which is one of the most commonly,
used types of mixture models. In this case, the observed data is assumed to come from a combination
of several Gaussian distributions.
Model Representation: Latent Variables: We introduce a set of latent variables, often called "mixture
indicators" or "cluster assignments," denoted as z. Each latent variable z corresponds to a specific
component of the mixture.
Parameters: We have a set of parameters for the mixture model, including the mixing proportions π
and the parameters (mean and covariance) of each Gaussian component.
Data Generation: Sample Cluster: For each data point, we first sample a latent variable z from a
categorical distribution according to the mixing proportions π. This determines the component from
which the data point will be generated.
Generate Data: Given the selected component, we sample the data point x from the corresponding
Gaussian distribution.
Model Inference: Given observed data points x, the goal is to infer the latent variables z and the
model parameters.
Inference can be done using various techniques such as Expectation-Maximization (EM) algorithm,
variational inference, or Markov chain Monte Carlo (MCMC) methods.
Model Learning: The model parameters, including the mixing proportions π and the Gaussian
parameters, are learned from the observed data using the chosen inference algorithm.
The learning process involves iteratively updating the model parameters until convergence, maximizing
the likelihood or posterior probability of the observed data.
Model Utilization: Once the model is learned, it can be used for various tasks such as clustering, density
estimation, and anomaly detection.

Gaussian Mixture Models are probabilistic models and use the soft clustering approach for distributing
the points in different clusters.
Here, we have three clusters that are denoted by three colors – Blue, Green, and Cyan. Let‟s take the
data point highlighted in red. The probability of this point being a part of the blue cluster is 1, while the
probability of it being a part of the green or cyan clusters is 0.

Now, consider another point – somewhere in between the blue and cyan (highlighted in the below
figure). The probability that this point is a part of cluster green is 0, right? And the probability that this
belongs to blue and cyan is 0.2 and 0.8 respectively.

In a one dimensional space, the probability density function of a Gaussian distribution is given by:

Where μ is the mean and ζ2 is the variance.


But this would only be true for a single variable. In the case of two variables, instead of a 2D bell-
shaped curve, we will have a 3D bell curve as shown below:

The probability density function would be given by:

Where x is the input vector, μ is the 2D mean vector, and Σ is the 2×2 covariance matrix. The
covariance would now define the shape of this curve. We can generalize the same for d-dimensions

How to calculate distance between two points in clustering techniques, there are several metrics
as follows -
Distance Metrics
• Euclidean Distance
• Manhattan Distance
• Minkowski Distance
• Hamming Distance

Euclidean Distance
Euclidean Distance represents the shortest distance between two vectors. It is the square root of the sum
of squares of differences between corresponding elements.

Manhattan Distance
Manhattan distance between two points in two dimensions is the sum of absolute differences of their
Cartesian coordinates. Manhattan distance is also called with different names such as rectilinear
distance, L1 distance, L1 norm, snake distance, city block distance, etc.

Minkowski Distance
Minkowski distance can be considered as a generalized form of both the Euclidean distance and the
Manhattan distance.
The Minkowski distance of order p (where p is an integer) between two points X = (x1, x2 … xn) and
Y = (y1, y2….yn) is given by:

Hamming Distance
It is named after Richard Hamming. The hamming distance between two strings of equal length is the
number of positions at which the corresponding symbols are different. The strings can be letters, bits, or
decimal digits, etc.

Cosine Distance and Cosine Similarity


The cosine of two non-zero vectors is given by using the Euclidean dot product formula as below:

Given two vectors A and B, the cosine similarity, cos(θ), is represented using a dot product andmagnitude
as below:
UNIT-IV

NONPARAMETRIC METHODS- Nonparametric Density Estimation- k-Nearest


NeighborEstimator-Nonparametric Classification- Condensed Nearest Neighbor
DIMENSIONALITY REDUCTION-Subset Selection-Principal Components Analysis- Factor
Analysis-Multidimensional Scaling-Linear Discriminant Analysis

Nonparametric Density Estimation:


Non-parametric methods: Similar inputs have similar outputs. These are also called instance-based
or memory-based learning algorithms. There are 3 Non – parametric density estimation methods:
1. Histogram Estimator
2. Kernel Density Estimator (KDE)
3. KNN estimator (K – Nearest Neighbor Estimator)

Histogram Estimator
It is the oldest and the most popular method used to estimate the density, where the input space is divided
into equal-sized intervals called bins. Given the training set X = {xt}N t=1 an origin x0 and the bin width
h, the histogram density estimator function is:

Histogram estimator

The density of a sample is dependent on the number of training samples present in that bin. In
constructing the histogram of densities we choose the origin and the bin width, the position of origin
affects the estimation near the boundaries.
For example
Kernel Density Estimator (KDE)
Kernel estimator is used to smoothen the probability distribution function (pdf) and cumulative
distribution function (CDF) graphics. The kernel is nothing but a weight. Gaussian Kernel is the most
popular kernel:

Gaussian kernel

The kernel estimator is also called Parzen Window:

Kernel density estimator

As you can observe, as |x – xt| increases that means, the training sample is far away from the given
sample, and the kernel value decreases. Hence we can say that the contribution of a farther sample is
less when compared to the nearest training samples. There are many more kernels: Gaussian,
Rectangular, Triangular, Biweight, Uniform, Cosine, etc.

K – Nearest Neighbor Estimator (KNN Estimator)


Unlike the previous methods of fixing the bin width h, in this estimation, we fix the value of nearest
neighbors k. The density of a sample depends on the value of k and the distance of the kth nearest
neighbor from the sample. This is close enough to the Kernel estimation method. The K-NN density
estimation is, where dk(x) is the Euclidean distance from the sample to its kth nearest neighbor.

KNN Estimator

Let us have an example data sample and estimate the density at a point using nonparametric density
estimation functions.
Note: Points marked with ‘x’ are the given data samples. Unlike the above estimation methods, we
do not fix the bind size/width, instead, this density estimation method is based on the k value. We observe
a high-density value when k is less and the density is less when the value of k increases.
KNN Estimator

Nonparametric classification:
Non-parametric classification is a type of machine learning algorithm that does not make explicit
assumptions about the functional form or distribution of the underlying data. Instead of estimating
parameters, non-parametric methods directly learn the patterns and relationships from the data. These
methods are particularly useful when the underlying data distribution is complex or unknown.

Here are some popular non-parametric classification algorithms:

k-Nearest Neighbors (k-NN): This algorithm classifies new data points based on the class labels of their
nearest neighbors in the training set. The value of k determines the number of neighbors considered.

Decision Trees: Decision trees recursively split the feature space based on different features to form a
hierarchical structure. Each internal node represents a feature, and each leaf node represents a class label.

Random Forests: Random forests are an ensemble method that combines multiple decision trees. Each
tree is trained on a random subset of the data and features, and the final prediction is obtained by
aggregating the predictions of individual trees.

Support Vector Machines (SVM): SVMs map the data into a higher-dimensional space and find the
optimal hyperplane that maximally separates the classes. The decision boundary is determined by a
subset of the training samples called support vectors.

Neural Networks: While neural networks are often considered parametric, certain architectures such as
deep neural networks with a large number of layers and parameters can be considered non-parametric
due to their ability to learn complex functions without explicit assumptions.

Naive Bayes: Naive Bayes classifiers are probabilistic models that assume independence between
features given the class. Although they have certain parametric assumptions, they are often considered
non-parametric in practice due to their simplicity and effectiveness.
Non-parametric classification algorithms are generally flexible and can capture complex patterns in the
data. However, they may be computationally intensive and require more training data compared to
parametric methods. It is important to note that the choice of algorithm depends on the specific problem
and the characteristics of the dataset.

Condensed nearest neighbor:


Condensed nearest neighbor (CNN, the Hart algorithm) is an algorithm designed to reduce the data set
for k-NN classification. It selects the set of prototypes U from the training data, such that 1NN with U
can classify the examples almost as accurately as 1NN does with the whole data set.
CNN basically removes the majority class instances in such a manner that there is no information loss
and the dataset becomes balanced.
The aim of CNN is to produce a subset of dataset that can correctly classify all the data points in the
original dataset.
Simply removing majority class data points from the dataset will lead to information loss and hence a
further deterioration of the algorithms output.
Working of CNN
Let us understand the working of CNN.
Suppose that we have a dataset D, given by

where, xᵢ is a data point and yᵢ is its original classification


Step 1: Choose ‘k’ : CNN works with a value of „k‟. The value of „k‟ which we choose will give a
unique result. Changing „k‟ changes our final result. Let us assume k = 3 for our case.
Step 2: Start the iteration by choosing randomly any k (here 3) points to keep in the store S.
For simplicity, consider that we have squares and circles as shown below and we select these 3 points as
the initial points in the Store S.
(PS: The diagram below is just an illustration. Clearly, its very much balanced. Imbalancy is feature
pertaining to a ratio of 70:30 majority class to minority class instances or more extreme)
Step 3: We check whether the store S is „Training Set Consistent‟ or not. If it is, we stop; else we add a
point to the store so as to improve it and make it training set consistent.

Training Set Consistency: A set is said to be training set consistent if on running KNN on the dataset

with the classifiers as the points in store, we get the same classification as when KNN was run on the
entire dataset.
i.e.

gₓ(aᵢ) represents the classification of aᵢ with respect to dataset X


Now, we check if our store S with the 3 selected points is consistent or not.
Consider the dark blue point.
According to the store, it must be blue (since out of the 3 points in store, 2 are blue).
So,

3-NN with S as the dataset classifies the dark blue point as blue.However, according to the complete
dataset, that point must be red, becuase out of 3 nearest neighbors, 2 arered.i.e.
3-NN on complete dataset classifies as the dark blue point as red.

Clearly, we have a training set inconsistency.

Step 4: We select a random point from the dataset to add in store such that the inconsistency with the
classification of dark blue point is solved keeping the prediction of the dataset as gold standard
i.e. select a random point xᵢ from the dataset such that on adding xᵢ we have,

So we add the dark red point (in square) to the store S.

We add the red square (outlined with blue dotted square) in store S

Step 5: Repeat till the Store S is training set consistent.

Factor Analysis in Machine Learning :

1. Reduce a large numbers of variables into fewer numbers of factors.

2. Puts maximum common variance into a common score.


3. Associates multiple observed variables with a latent variable.

4. Has the same numbers of factors and variables, where each factor contains a certain amount of

overall variance .

DIMENSIONALITY REDUCTION:
Dimensionality reduction technique can be defined as, "It is a way of converting the higher dimensions

dataset into lesser dimensions dataset ensuring that it provides similar information." These techniques

are widely used in machine learning for obtaining a better fit predictive model while solving the
classification and regression problems.

It is commonly used in the fields that deal with high-dimensional data, such as speech recognition, signal

processing, bioinformatics, etc. It can also be used for data visualization, noise reduction, clusteranalysis,

etc.

A feature is an attribute that has an impact on a problem or is useful for the problem, and
choosing the important features for the model is known as feature selection.

Feature Selection Techniques


There are mainly two types of Feature Selection techniques, which are:

Supervised Feature Selection technique

Supervised Feature selection techniques consider the target variable and can be used for the labelled
dataset.

Unsupervised Feature Selection technique

Unsupervised Feature selection techniques ignore the target variable and can be used for the unlabelled
dataset.

Feature Selection Techniques in Machine Learning

There are mainly three techniques under supervised feature Selection:

As most data scientists know, dimensionality is a curse; although the number of dimensions is not a

curse itself but also the quality of each feature in a specific dimension. Dimensionality reduction is a set

of techniques that try to transform the input space into a space with fewer dimensions while keeping the

meaning and value of the features. In this post, we will journey through a greedy algorithm (greedy in
the sense that it does not guarantee to find the optimal answer) that generates a selection of features that

try to minimize the model's error.


Feature selection vs Feature extraction

Dimensionality reduction algorithms can be classified as feature selection methods or feature extraction
methods. Feature selection methods are interested in reducing the number of initial features to the ones

that give us the most information. On the other hand, feature extraction methods are interested in

finding a new set of features, different from the initial ones, and with fewer dimensions.

Subset selection

Subset selection is a feature selection algorithm that can variate between a forward selection and a

backward selection. Both methods consist in finding a subset of the initial features that contain the least

number of dimensions that most contribute to accuracy. A naive approach would be to try all the 2^n

possible subset combinations but if the number of dimensions is too big it would take forever. Instead,

based on a heuristic function (error function) we add or remove features. The performance of subset

selection depends highly on the model we choose and our pruning selection algorithm.

Forward selection
In forward selection we start with an empty set of features, for each feature that is not in the set we train
the model with it and test its performance; we then select the feature with the least amount of error. We

continue adding new features for the model to train until the error is low enough or until we have

selected a proportion of the total features.


Backward selection

Backward selection works in the same way as forward but instead of starting with an empty set and

adding features one by one, we start with a full set and remove features one by one. Thus we remove the
features that cause the most error.

Principal Component Analysis:


Principal Component Analysis is an unsupervised learning algorithm that is used for thedimensionality
reduction in machine learning. It is a statistical process that converts the observationsof correlated
features into a set of linearly uncorrelated features with the help of orthogonal transformation. These
new transformed features are called the Principal Components. It is one of the popular tools that is
used for exploratory data analysis and predictive modeling. It is a technique to draw strong patterns from
the given dataset by reducing the variances.

PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.

PCA works by considering the variance of each attribute because the high attribute shows the good split
between the classes, and hence it reduces the dimensionality. Some real-world applications of PCA are
image processing, movie recommendation system, optimizing the power allocation in various
communication channels. It is a feature extraction technique, so it contains the important variables and
drops the least important variable.

The PCA algorithm is based on some mathematical concepts such as:

o Variance and Covariance


o Eigenvalues and Eigen factors

Some common terms used in PCA algorithm:

o Dimensionality: It is the number of features or variables present in the given dataset. More
easily, it is the number of columns present in the dataset.
o Correlation: It signifies that how strongly two variables are related to each other. Such as if one
changes, the other variable also gets changed. The correlation value ranges from -1 to +1. Here,
-1 occurs if variables are inversely proportional to each other, and +1 indicates that variables are
directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and hence the correlation
between the pair of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v will be
eigenvector if Av is the scalar multiple of v.
o Covariance Matrix: A matrix containing the covariance between the pair of variables is called
the Covariance Matrix.

Principal Components in PCA

As described above, the transformed new features or the output of PCA are the Principal Components.
The number of these PCs are either equal to or less than the original features present in the dataset. Some
properties of these principal components are given below:

o The principal component must be the linear combination of the original features.
o These components are orthogonal, i.e., the correlation between a pair of variables is zero.
o The importance of each component decreases when going to 1 to n, it means the 1 PC has the
most importance, and n PC will have the least importance.

Steps for PCA algorithm

1. Getting the dataset.


Firstly, we need to take the input dataset and divide it into two subparts X and Y, where X is
the training set, and Y is the validation set.
2. Representing data into a structure.

Now we will represent our dataset into a structure. Such as we will represent the two-
dimensional matrix of independent variable X. Here each row corresponds to the data items, and
the column corresponds to the Features. The number of columns is the dimensions of the dataset.
3. Standardizing the data

In this step, we will standardize our dataset. Such as in a particular column, the features with
high variance are more important compared to the features with lowervariance.If
the importance of features is independent of the variance of the feature, then we will divide each
data item in a column with the standard deviation of the column. Here we will name the matrix
as Z.

4. Calculating the covariance of z.


To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After
transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of Z.
5. Calculating the Eigen values and Eigen vectors.
Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance matrix
Z. Eigenvectors or the covariance matrix are the directions of the axes with high information.
And the coefficients of these eigenvectors are defined as the eigenvalues.
6. Sorting the Eigen vectors.

In this step, we will take all the eigenvalues and will sort them in decreasing order, which means
from largest to smallest. And simultaneously sort the eigenvectors accordingly in matrix P of
eigenvalues. The resultant matrix will be named as P*.
7. Calculating the new features or principal components.

Here we will calculate the new features. To do this, we will multiply the P* matrix to the Z. In
the resultant matrix Z*, each observation is the linear combination of original features. Each
column of the Z* matrix is independent of each other.
8. Remove less or unimportant features from the new dataset. The
new feature set has occurred, so we will decide here what to keep and what to remove. It means,
we will only keep the relevant or important features in the new dataset, and unimportantfeatures
will be removed out.

Applications of Principal Component Analysis


o PCA is mainly used as the dimensionality reduction technique in various AI applications such
as computer vision, image compression, etc.

o It can also be used for finding hidden patterns if data has high dimensions. Some fields where
PCA is used are Finance, data mining, Psychology, etc.

MULTIDIMENSIONAL SCALING:
Multidimensional scaling is the graphical or visual representation of the datasets in the form of a distance
or dissimilarities matrix between sets of objects. Here object term refers to anything, for example,
jackets, perfumes, cars, bikes, etc. With the help of multidimensional scaling, we can calculate the
similarity between the objects.
With the distance or dissimilarity value, we can conclude a representation of objects similar to each
other. The closer the distance or less dissimilarity between the objects more similar they are, and the
bigger the distance, the less similar the objects are.
The word dimension here refers to the attribute of a dataset. If there are two attributes in a dataset or
matrix, then we will take a two-dimensional representation of the data, but this cannot be the case in
every dataset.
You might use multiple dimensions to represent the multiple attributes, but this can make our outcome
complex to represent visually, and we will need help comprehending it.
It is best to use the three dimensions at most because, more than that, our brain can not process the
information visually. But mathematically, we can achieve it.
The term scaling represents the measurement of the object. It is like a scale of two numbers in which
one is higher and the other is lower that we can use to measure the preference or perception of the object
for a user.
For example, a scale from 1 to 5 represents a person's liking of street food.
Techniques of Multidimensional Scaling
There are multiple techniques available in multidimensional scaling that you can use. Their techniques
depend on the input data you use for multidimensional scaling.
Metric Multidimensional scaling
Metric Multidimensional Scaling can be considered a technique for visualizing data: you input a distance
matrix with the distances between a set number of data points, and the technique produces a graph
displaying those observations.
Example
We have a matrix of distances between different cities. Let's name the city from A to E for simplicity.
The distance is in KM.
City A B C D E

A 0 222 240 131 190

B 222 0 230 97 89

C 240 230 0 306 311

D 131 97 306 0 55

E 190 89 311 55 0
From the matrix, we can observe that distances from one city to another like from A to B is 222 km
and from A to C it's 240, and so on. The 0 value means the distance from city A To A.

As you can see, we have plotted the graph from the given matrix, and if we add the directions from
north, south, east, and west, we can easily see the map.

Non-metric Multidimensional scaling


In the non-metric multidimensional scaling, we will use ordinal data.
Ordinal data is the categorized statistical type where the distances between the categories are unknown,
and the variables have natural, ordered categories. It also provides the output as a metric.

Individual Differences Scaling


This is the third method or technique we can use to implement multidimensional scaling. In individual
differences scaling, we use the data based on personal human perception. This makes the individual
difference scaling method different from the above two methods.
Individual difference scaling methods represents a more accurate model for implementing
multidimensional scaling. In this method, we will not use a single dissimilarity matrix. There can be
multiple inputs varies from person to person; that's why it is closer to reality.

Multidimensional Analysis of Preference


This is the fourth method in multidimensional scaling. Multidimensional analysis of preferences is the
same as the individual difference scaling, but the data we will use to implement multidimensional scaling
is different. We will use rating data in this technique, and as in the individual differences scaling method,
we can have multiple rating data matrixes.
Math behind Multidimensional Scaling
Let's discuss the math behind multidimensional scaling. We have mentioned the distance in the
introduction of multidimensional scaling, but how do we calculate this distance to generate the output?
Well, you can do this with the help of the euclidean distance formula.

Euclidean distance
The Euclidean distance measures the distance between two vectors with real values. When computing
the distance between two rows of data with numerical values, such as a floating point or integer value,
you are most likely to use the Euclidean distance.

The closer the euclidean distance is between two objects on the graph, the more similar the
objectsare.

Linear discriminant analysis:


Linear Discriminant analysis is one of the most popular dimensionality reduction techniques used for
supervised classification problems in machine learning. It is also considered a pre-processing step for
modeling differences in ML and applications of pattern classification.

Whenever there is a requirement to separate two or more classes having multiple features efficiently, the
Linear Discriminant Analysis model is considered the most common technique to solve such
classification problems. For e.g., if we have two classes with multiple features and need to separate them
efficiently. When we classify them using a single feature, then it may show overlapping.
To overcome the overlapping issue in the classification process, we must increase the number of
features regularly.

Example:

Let's assume we have to classify two different classes having two sets of data points in a 2-dimensional
plane as shown below image:

However, it is impossible to draw a straight line in a 2-d plane that can separate these data points
efficiently but using linear Discriminant analysis; we can dimensionally reduce the 2-D plane into the

1-D plane. Using this technique, we can also maximize the separability between multiple classes.

How Linear Discriminant Analysis (LDA) works?

Linear Discriminant analysis is used as a dimensionality reduction technique in machine learning, using
which we can easily transform a 2-D and 3-D graph into a 1-dimensional plane.

Let's consider an example where we have two classes in a 2-D plane having an X-Y axis, and we need
to classify them efficiently. As we have already seen in the above example that LDA enables us to
draw a straight line that can completely separate the two classes of the data points. Here, LDA uses an
X-Y axis to create a new axis by separating them using a straight line and projecting data onto a new
axis.

Hence, we can maximize the separation between these classes and reduce the 2-D plane into 1-D.
Factor Analytics is a special technique reducing the huge number of variables into a few numbers of
factors is known as factoring of the data, and managing which data is to be present in sheet comes under
factor analysis. It is completely a statistical approach that is also used to describe fluctuations among the
observed and correlated variables in terms of a potentially lower number of unobserved variables called
factors.

The factor analysis technique extracts the maximum common variance from all the variables and puts
them into a common score. It is a theory that is used in training the machine learning model and so it is
quite related to data mining. The belief behind factor analytic techniques is that the information gained
about the interdependencies between observed variables can be used later to reduce the set of variables
in a dataset.

Factor analysis is a very effective tool for inspecting changeable relationships for complex concepts such
as social status, economic status, dietary patterns, psychological scales, biology, psychometrics,
personality theories, marketing, product management, operations research, finance, etc. It can help a
researcher to investigate the concepts that are not easily measured in a much easier and quicker way
directly by the cave in a large number of variables into a few easily interpretable fundamental factors.

Factor Analysis in machine learning is used to reduce the number of variables in a given dataset to obtain
a more accurate and enhanced collection of observable factors. Multiple algorithms based on machine
learning are used to work in this manner.

They are properly trained with massive amounts of data in order to lead the way to new applications.
Factor analysis is an unsupervised machine learning approach that is commonly used in machine learning
for dimensionality reduction. As a result, machine learning and factor analysis could be used together to
create data mining approaches and make data analysis much more efficient.

Types of factor analysis :

Cost-Effective
Data research and data mining algorithms are extremely expensive. But the statistical model of factor
analysis is available at a surprisingly affordable cost. Moreover, you don‟t need too many resources to
perform factor analysis. Additionally, it can be performed by experienced professionals as well as
beginners.
Measurable
One of the major benefits of factor analysis is its measurable nature. This statistical model can be
worked upon various attributes. Whether it‟s subjective or objective, it works well with everything.

Flexible
Several machine learning algorithms are limited to a single approach. But factor analysis is anexception
and offers a lot of flexibility. The flexible approach of the statistical model helps determine the
connections between different variables and their underlying components.
UNIT-V

REINFORCEMENT LEARNING: Introduction- Single State Case:K-Armed Bandit- Elements of


Reinforcement Learning- Model-Based Learning- Temporal DifferenceLearning- Generalization-
Partially Observable States

INTRODUCTION:

Reinforcement learning is an area of Machine Learning. It is about taking suitable action to maximize
reward in a particular situation. It is employed by various software and machines to find the best possible
behavior or path it should take in a specific situation. Reinforcement learning differs from supervised
learning in a way that in supervised learning the training data has the answer key with it so the model
is trained with the correct answer itself whereas in reinforcement learning, there is noanswer but
the reinforcement agent decides what to do to perform the given task. In the absence of a training dataset,
it is bound to learn from its experience.

Reinforcement Learning (RL) is the science of decision making. It is about learning the optimal behavior
in an environment to obtain maximum reward. In RL, the data is accumulated from machine learning
systems that use a trial-and-error method. Data is not part of the input that we would find in supervised
or unsupervised machine learning.

Reinforcement Learning Problem

An agent interacting with its environment. The agent exists in an environment


described by some set of possible states S.
Agent perform any of a set of possible actions A. Each time it performs an action a,
in some state st the agent receives a real-valued reward r, that indicates the immediate
value of this state-action transition. This produces a sequence of states si, actions ai,
and immediate rewards ri as shown in the figure.

The agent's task is to learn a control policy, 𝝅: S → A, that maximizes the expected sum of these rewards,
with future rewards discounted exponentially by their delay.
Terms used in Reinforcement Learning
o Agent(): An entity that can perceive/explore the environment and act upon it.
o Environment(): A situation in which an agent is present or surrounded by. In RL, we assume
the stochastic environment, which means it is random in nature.
o Action(): Actions are the moves taken by an agent within the environment.
o State(): State is a situation returned by the environment after each action taken by the agent.
o Reward(): A feedback returned to the agent from the environment to evaluate the action of the
agent.
o Policy(): Policy is a strategy applied by the agent for the next action based on the current state.
o Value(): It is expected long-term retuned with the discount factor and opposite to the short-
term reward.
o Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current
action (a).

Difference between Reinforcement learning and Supervised learning:

Reinforcement learning Supervised learning

Reinforcement learning is all about making decisions In Supervised learning, the


sequentially. In simple words, we can say that the output decision is made on the initial input
depends on the state of the current input and the next or the input given at the start
input depends on the output of the previous input

In supervised learning thedecisions


In Reinforcement learning decision is dependent, So we are independent of each other so
give labels to sequences of dependent decisions labels are given to each decision.

Example: Object recognition,spam


Example: Chess game,text summarization
detetction

Reinforcement learning problem characteristics

1. Delayed reward: The task of the agent is to learn a target function 𝜋 that maps from the current
state s to the optimal action a = 𝜋 (s). In reinforcement learning, training information is not
available in (s, 𝜋 (s)). Instead, the trainer provides only a sequence of immediate reward values
as the agent executes its sequence of actions. The agent, therefore, faces the problem of temporal
credit assignment: determining which of the actions in its sequence are to be credited with
producing the eventual rewards.
2. Exploration: In reinforcement learning, the agent influences the distribution of training
examples by the action sequence it chooses. This raises the question of which experimentation
strategy produces most effective learning. The learner faces a trade-off in choosing whether to
favor exploration of unknown states and actions, or exploitation of states and actions that it has
already learned will yield high reward.
3. Partially observable states: The agent's sensors can perceive the entire state of the environment
at each time step, in many practical situations sensors provide only partial information. In such
cases, the agent needs to consider its previous observations together with its current sensor data
when choosing actions, and the best policy may be one that chooses actions specifically to
improve the observability of the environment.
4. Life-long learning: Robot requires to learn several related tasks within the same environment,
using the same sensors. For example, a mobile robot may need to learn how to dock on its battery
charger, how to navigate through narrow corridors, and how to pick up output from laser printers.
This setting raises the possibility of using previously obtained experience or knowledge to reduce
sample complexity when learning new tasks.

Elements of Reinforcement Learning

There are four main elements of Reinforcement Learning, which are given below:

1. Policy
2. Reward Signal
3. Value Function
4. Model of the environment

1) Policy: A policy can be defined as a way how an agent behaves at a given time. It maps the perceived
states of the environment to the actions taken on those states. A policy is the core element ofthe RL as it
alone can define the behavior of the agent. In some cases, it may be a simple function or a lookup table,
whereas, for other cases, it may involve general computation as a search process. It couldbe deterministic
or a stochastic policy:

2) Reward Signal: The goal of reinforcement learning is defined by the reward signal. At each state,
the environment sends an immediate signal to the learning agent, and this signal is known as a reward
signal. These rewards are given according to the good and bad actions taken by the agent. The agent's
main objective is to maximize the total number of rewards for good actions. The reward signal can
change the policy, such as if an action selected by the agent leads to low reward, then the policy may
change to select other actions in the future.

3) Value Function: The value function gives information about how good the situation and action are
and how much reward an agent can expect. A reward indicates the immediate signal for each good and
bad action, whereas a value function specifies the good state and action for the future. The value
function depends on the reward as, without reward, there could be no value. The goal of estimating
values is to achieve more rewards.

4) Model: The last element of reinforcement learning is the model, which mimics the behavior of the
environment. With the help of the model, one can make inferences about how the environment will
behave. Such as, if a state and an action are given, then a model can predict the next state and reward.

The model is used for planning, which means it provides a way to take a course of action by considering
all future situations before actually experiencing those situations. The approaches for solving the RL
problems with the help of the model are termed as the model-based approach. Comparatively, an
approach without using a model is called a model-free approach.
How does Reinforcement Learning Work?

To understand the working process of the RL, we need to consider two main things:

o Environment: It can be anything such as a room, maze, football ground, etc.


o Agent: An intelligent agent such as AI robot.

Types of Reinforcement Learning

Positive Reinforcement. Positive reinforcement is defined as when an event, occurs due to


specific behavior, increases the strength and frequency of the behavior. ...

Negative Reinforcement. Negative Reinforcement is represented as the strengthening of a


behavior.

What is the K-Armed Bandit Problem?

The K-armed bandit (also known as the Multi-Armed Bandit problem) is a simple, yet powerful example
of allocation of a limited set of resources over time and under uncertainty. It ha s been initially studied
by Thompson (1933), who suggested a heuristic for navigating the exploration - exploitation dilemma.
The problem has also been studied in the fields of computer science, operations research, probability
theory, and economics, and is well suited for exploring with the tools of reinforcement learning.

In its basic form, the problem considers a gambler standing in front of a row of K slot machines(also
known as one-armed bandits) and trying to conceive a strategy for which machine to play,for how
many times, and when to switch machines in order to increase the chances of making a profit.
What makes this premise interesting is that each of the bandits dispenses rewards according to a
probability distribution, which is specific to the bandit and is initially unknown to the gambler.
The optimal strategy, therefore, would involve striking a balance between learning more about the
individual probability distributions (exploration) and maximising the profits based on the information
acquired so far (exploitation).
Formalizing the K-armed Bandit Problem
Now let‟s formalise the k-armed bandit problem, so we can use it to introduce some of the tools
and techniques used in reinforcement learning. Let say we are playing bandits and each
game consists of turns. Let be the set of all possible actions in the game. As there
are arms to select from, it is clear that . We will also use to denote the action
from taken at time . Note that we are using the term time in a discrete sense and
interchangeably with turn.

We start by exploring a variant of the problem where each bandit dispenses rewards according
to an assigned Bernoulli distribution from . In other words, the reward
of each arm is in {0,1}, and is given by the following probability mass function.

Model-Based Learning:

Model-based Reinforcement Learning refers to learning optimal behavior indirectly by learning a model
of the environment by taking actions and observing the outcomes that include the next state and the
immediate reward. The models predict the outcomes of actions and are used in lieu of or in addition to
interaction with the environment to learn optimal policies.
Model : Anything the agent can use to predict how the environment will respond to its actions,
concretely, the state transition T(s‟|s,a) and reward R(s,a).

Model-learning:

Why choose model-based reinforcement learning?


Model-based RL has a strong advantage of being sample efficient. Many models behave linearly at least
in the local proximity. This requires very few samples to learn them. Once the model and the costfunction
are known, we can plan the optimal controls without further sampling.
Components of model-based reinforcement learning

Model based learning algorithms: Model-based learning (also known as structure-based or eager
learning) takes a different approach by constructing models from the training data that can generalize
better than instance-based methods. This involves using algorithms like linear regression, logistic
regression, random forest, etc.

RL algorithms can be mainly divided into two categories – model-based and model-free.

Model-based, as it sounds, has an agent trying to understand its environment and creating a model for
it based on its interactions with this environment. In such a system, preferences take priority over the
consequences of the actions i.e. the greedy agent will always try to perform an action that will get the
maximum reward irrespective of what that action may cause.

On the other hand, model-free algorithms seek to learn the consequences of their actions through
experience via algorithms such as Policy Gradient, Q-Learning, etc. In other words, such an algorithm
will carry out an action multiple times and will adjust the policy (the strategy behind its actions) for
optimal rewards, based on the outcomes.

We can formulate a reinforcement learning problem via a Markov Decision Process (MDP). The
essential elements of such a problem are the environment, state, reward, policy, and value.

A policy is a mapping from states to actions. Finding an optimal policy leads to generating the maximum
reward. Given an MDP environment, we can use dynamic programming algorithms to compute optimal
policies, which lead to the highest possible sum of future rewards at each state.

Dynamic programming algorithms work on the assumption that we have a perfect model of the
environment‟s MDP. So, we‟re able to use a one-step look-ahead approach and compute rewards for all
possible actions.
In this tutorial, we‟ll discuss how to find an optimal policy for a given MDP. More specifically, we‟ll
learn about two dynamic programming algorithms: value iteration and policy iteration. Then, we‟ll
discuss these algorithms‟ advantages and disadvantages over each other.

Policy Iteration

In policy iteration, we start by choosing an arbitrary policy \boldsymbol{\pi}. Then, we iteratively


evaluate and improve the policy until convergence:

We evaluate a policy by calculating the state value function :

Then, we calculate the improved policy by using one-step look-ahead to replace the initial policy :

Here, is the reward generated by taking the action is a discount factor for future rewards
and is the transition probability.
Value Iteration

In value iteration, we compute the optimal state value function


by iteratively updating the estimate {v(s)}:
We start with a random value function . At each step, we update it:

Policy Iteration vs. Value Iteration

policy iteration and value iteration are both dynamic programming algorithms that find an optimal policy
in a reinforcement learning environment. They both employ variations of Bellman updates and exploit
one-step look-ahead:

In policy iteration, we start with a fixed policy. Conversely, in value iteration, we


begin by selecting the value function. Then, in both algorithms, we iteratively improve until we reach
convergence.

The policy iteration algorithm updates the policy. The value iteration algorithm iterates over the value
function instead. Still, both algorithms implicitly update the policy and state value function in each
iteration.

In each iteration, the policy iteration function goes through two phases. One phase evaluates the
policy, and the other one improves it. The value iteration function covers these two phases by taking a
maximum over the utility function for all possible actions.

The value iteration algorithm is straightforward. It combines two phases of the policy iteration into a
single update operation. However, the value iteration function runs through all possible actions at once
to find the maximum action value. Subsequently, the value iteration algorithm is computationally
heavier.

Both algorithms are guaranteed to converge to an optimal policy in the end. Yet, the policy iteration
algorithm converges within fewer iterations. As a result, the policy iteration is reported to conclude faster
than the value iteration algorithm.
TEMPORAL DIFFERENCE LEARNING
Temporal difference (TD) learning refers to a class of model-free reinforcement learning methods which
learn by bootstrapping from the current estimate of the value function. These methods sample from the
environment, like Monte Carlo methods, and perform updates based on current estimates, like dynamic
programming methods.

Temporal Difference Learning is an unsupervised learning technique that is very commonly used in
reinforcement learning for the purpose of predicting the total reward expected over the future. They can,
however, be used to predict other quantities as well. It is essentially a way to learn how to predicta
quantity that is dependent on the future values of a given signal. It is a method that is used to compute
the long-term utility of a pattern of behaviour from a series of intermediate rewards.

Essentially, Temporal Difference Learning (TD Learning) focuses on predicting a variable's future value
in a sequence of states. Temporal difference learning was a major breakthrough in solving the problem
of reward prediction. You could say that it employs a mathematical trick that allows it to replace
complicated reasoning with a simple learning procedure that can be used to generate the very same
results.

The trick is that rather than attempting to calculate the total future reward, temporal difference learning
just attempts to predict the combination of immediate reward and its own reward prediction at the next
moment in time. Now when the next moment comes and brings fresh information with it, the new
prediction is compared with the expected prediction. If these two predictions are different from each
other, the Temporal Difference Learning algorithm will calculate how different the predictions are
from each other and make use of this temporal difference to adjust the old prediction toward the new
prediction.
The temporal difference algorithm always aims to bring the expected prediction and the new prediction
together, thus matching expectations with reality and gradually increasing the accuracy of the entire
chain of prediction.

Temporal Difference Learning aims to predict a combination of the immediate reward and its own
reward prediction at the next moment in time.

In TD Learning, the training signal for a prediction is a future prediction. This method is a
combination of the Monte Carlo (MC) method and the Dynamic Programming (DP) method. Monte
Carlo methods adjust their estimates only after the final outcome is known, but temporal difference
methods tend to adjust predictions to match later, more accurate, predictions for the future, much
before the final outcome is clear and know. This is essentially a type of bootstrapping.

Temporal difference learning in machine learning got its name from the way it uses changes, or
differences, in predictions over successive time steps for the purpose of driving the learning process.

The prediction at any particular time step gets updated to bring it nearer to the prediction of the same
quantity at the next time step.

Benefit of temporal difference learning


The advantages of temporal difference learning in machine learning are:
TD learning methods are able to learn in each step, online or offline.
These methods are capable of learning from incomplete sequences, which means that they can also be
used in continuous problems.
Temporal difference learning can function in non-terminating environments.
TD Learning has less variance than the Monte Carlo method, because it depends on one random
action, transition, reward.
It tends to be more efficient than the Monte Carlo method.
Temporal Difference Learning exploits the Markov property, which makes it more effective in Markov
environments.

Different algorithms in temporal difference learning:


There are predominantly three different categories of TD algorithms which are as follows:

1. TD(1) Algorithm

2. TD(0) Algorithm
3.TD(λ) Algorithm

Exploration Strategies:

Exploitation versus exploration is a critical topic in Reinforcement Learning. We‟d like the RL agent
to find the best solution as fast as possible. However, in the meantime, committing to solutions too
quickly without enough exploration sounds pretty bad, as it could lead to local minima or total failure.
Modern RL algorithms that optimize for the best returns can achieve good exploitation quite efficiently,
while exploration remains more like an open topic.
I would like to discuss several common exploration strategies in Deep RL here. As this is a very big
topic, my post by no means can cover all the important subtopics. I plan to update it periodically and
keep further enriching the content gradually in time.

Classic Exploration Strategies


As a quick recap, let‟s first go through several classic exploration algorithms that work out pretty well
in the multi-armed bandit problem or simple tabular RL.

Epsilon-greedy: The agent does random exploration occasionally with probability and takes the
optimal action most of the time with probability
.
Upper confidence bounds: The agent selects the greediest action to maximize the upper confidence
bound , where is the average rewards associated with action up to time and is a function reversely
proportional to how many times action has been taken. See here for more details.
Boltzmann exploration: The agent draws actions from a boltzmann distribution (softmax) over the
learned Q values, regulated by a temperature parameter
.
Thompson sampling: The agent keeps track of a belief over the probability of optimal actions and
samples from this distribution. See here for more details.
The following strategies could be used for better exploration in deep RL training when neural
networks are used for function approximation:

Entropy loss term: Add an entropy term into the loss function, encouraging the policy to take diverse
actions.
Noise-based Exploration: Add noise into the observation, action or even parameter space (Fortunato,
et al. 2017, Plappert, et al. 2017).
Key Exploration Problems
Good exploration becomes especially hard when the environment rarely provides rewards as feedback
or the environment has distracting noise. Many exploration strategies are proposed to solve one or both
of the following problems.

The Hard-Exploration Problem


The “hard-exploration” problem refers to exploration in an environment with very sparse or even
deceptive reward. It is difficult because random exploration in such scenarios can rarely discover
successful states or obtain meaningful feedback.

Montezuma‟s Revenge is a concrete example for the hard-exploration problem. It remains as a few
challenging games in Atari for DRL to solve. Many papers use Montezuma‟s Revenge to benchmark
their results.

The Noisy-TV Problem


The “Noisy-TV” problem started as a thought experiment in Burda, et al (2018). Imagine that an RL
agent is rewarded with seeking novel experience, a TV with uncontrollable & unpredictable random
noise outputs would be able to attract the agent‟s attention forever. The agent obtains new rewards from
noisy TV consistently, but it fails to make any meaningful progress and becomes a “couch potato”.
Learning Task
•Consider Markov decision process (MDP) where the agent can perceive a set S of distinct states of its
environment and has a set A of actions that it can perform.
•At each discrete time step t, the agent senses the current state st, chooses a current action at, and
performs it.

•The environment responds by giving the agent a reward rt = r(st, at) and by producing the succeeding
state st+l = δ(st, at). Here the functions δ(st, at) and r(st, at) depend only on the current state and
action, and not on earlier states or actions.

The task of the agent is to learn a policy, 𝝅: S → A, for selecting its next action a, based on the current
observed state st; that is, (st) = at.

How shall we specify precisely which policy π we would like the agent to learn?

1. One approach is to require the policy that produces the greatest possible cumulative reward for the
robot over time.
•To state this requirement more precisely, define the cumulative value Vπ (st) achieved by following
an arbitrary policy π from an arbitrary initial state st as follows:

•Where, the sequence of rewards rt+i is generated by beginning at state st and by repeatedly using the
policy π to select actions.
•Here 0 ≤ γ ≤ 1 is a constant that determines the relative value of delayed versus immediate rewards. if
we set γ = 0, only the immediate reward is considered. As we set γ closer to 1, future rewards are given
greater emphasis relative to the immediate reward.
•The quantity Vπ (st) is called the discounted cumulative reward achieved by policy π from initial state
s. It is reasonable to discount future rewards relative to immediate rewards because, in many cases, we
prefer to obtain the reward sooner rather than later.

2. Other definitions of total reward is finite horizon reward,

Considers the undiscounted sum of rewards over a finite number h of steps

3.Another approach is average reward

Considers the average reward per time step over the entire lifetime of the agent.
We require that the agent learn a policy π that maximizes Vπ (st) for all states s. such a policy is called
an optimal policy and denote it by π*

Refer the value function Vπ*(s) an optimal policy as V*(s). V*(s) gives the maximum
discounted cumulative reward that the agent can obtain starting from state s.

Example:
A simple grid-world environment is depicted in the diagram
• The six grid squares in this diagram represent six possible states, or locations, for the agent.
• Each arrow in the diagram represents a possible action the agent can take to move from one state to
another.
• The number associated with each arrow represents the immediate reward r(s, a) the agent receives if it
executes the corresponding state-action transition
• The immediate reward in this environment is defined to be zero for all state-action transitions except
for those leading into the state labelled G. The state G as the goal state, and the agent can receive reward
by entering this state.

Once the states, actions, and immediate rewards are defined, choose a value for the discount factor γ,
determine the optimal policy π * and its value function V*(s).

Let‟s choose γ = 0.9. The diagram at the bottom of the figure shows one optimal policy for this setting.

Values of V*(s) and Q(s, a) follow from r(s, a), and the discount factor γ = 0.9. An optimal policy,
corresponding to actions with maximal Q values, is also shown.

The discounted future reward from the bottom centre state is


0+ γ 100+ γ2 0+ γ3 0+... = 90
Q LEARNING:
How can an agent learn an optimal policy π * for an arbitrary environment?
The training information available to the learner is the sequence of immediate rewards r(si,ai)
for i = 0, 1,2, Given this kind of training information it is easier to learn a numerical evaluation
function defined over states and actions, then implement the optimal policy in terms of this evaluation
function.

What evaluation function should the agent attempt to learn?


One obvious choice is V*. The agent should prefer state sl over state s2 whenever V*(sl) > V*(s2),
because the cumulative future reward will be greater from sl
The optimal action in state s is the action a that maximizes the sum of the immediate reward r(s, a)
plus the value V* of the immediate successor state, discounted by γ.

The Q Function
The value of Evaluation function Q(s, a) is the reward received immediately upon executing action a
from state s, plus the value (discounted by γ ) of following the optimal policy thereafter

Rewrite Equation (3) in terms of Q(s, a) as

Equation (5) makes clear, it need only consider each available action a in its current state s and choose
the action that maximizes Q(s, a).
An Algorithm for Learning Q
• Learning the Q function corresponds to learning the optimal policy.
• The key problem is finding a reliable way to estimate training values for Q, given only a sequence
of immediate rewards r spread out over time. This can be accomplished through iterative approximation

Rewriting Equation
Q learning algorithm:

• Q learning algorithm assuming deterministic rewards and actions. The discount factor γ may be
any constant such that 0 ≤ γ < 1

• 𝑄ˆ to refer to the learner's estimate, or hypothesis, of the actual Q function

Deterministic rewards and actions


For a deterministic policy, it is the action taken at a specific state. For a stochastic policy,
it is the probability of taking an action a given the state s. Reward r(s, a) defines the reward collected by
taking the action a at state s. Our objective is to maximize the total rewards of a policy.

Deterministic policy

This part is dedicated to understand what does “Deterministic” means in the context of a policy.
Deterministic policy in general means that there is always one action that you can possibly take in a
certain situation. There is no other possibilities. Lets understand more through the eyes of a learning
example.

A policy defines how an agent acts from a specific state. For a deterministic policy, it is the action taken

at a specific state.

For a stochastic policy, it is the probability of taking an action a given the state s.
Rewards
Reward r(s, a) defines the reward collected by taking the action a at state s. Our objective is to maximize
the total rewards of a policy. A reward can be the added score in a game, successfully turninga doorknob

or winning a game.

Deterministic policy : Learning Example

You are a goalkeeper in a football team. There is a penalty for the opposing team. Your coach tells you
before the match, that if the player taking the penalty shoots with his left foot, you should dive to the
left. On the other hand, if the player taking the penalty shoots with his right food, you should dive to the
right. Taking into consideration that you fully trust your coach and you always follow his instructions.

This is an example of deterministic policy. This is because in the two situations you will possibly face,
there is always one action that you can take in each situation. “If the player taking the penalty shoots
with his left foot, you should dive to the left. On the other hand, if the player taking the penalty shoots
with his right food, you should dive to the right”. Lets take every situation on its own

The first situation is, if the player taking the penalty shoots with his left, you will dive to the left. The
number of actions that you can choose from is one, there is no possibility for any other action. The same
goes for the second situation.

Whenever the number of actions to take in a certain situation is not more than one, we say that the policy
including the instructions for our actions is a deterministic policy

Stochastic policy
A Stochastic policy is the opposite of a deterministic policy. What differentiates a stochastic policy
and a deterministic policy, is that in a stochastic policy, it is possible to have more the one action to
choose from in a certain situation. Again, lets understand more through the eyes of a learning example.

Stochastic policy : Learning Example

Again, you are a goalkeeper in a football team and there is a penalty for the opposing team. However,
this time the coach tells you can dive either left or right. You decide you will randomly choose whether
you will dive to the left or the right.

This is an example of stochastic policy. This is because in every penalty you will possibly face, there is
more than one action you can choose from.“You can dive either left or right “. This means in every
situation(Penalty), there will be two actions to choose from. Add to that, you choose randomly what
action to choose from.

Whenever the number of actions to take in a certain situation is more than one. In addition, the choice
of what action to take is based on randomness and probabilities. We say that the policy including the
instructions for our actions is a Stochastic policy.
Nondeterministic Rewards and Actions

• We have seen Q learning in deterministic environments


• In the nondeterministic case reward r(s,a) and function δ(s,a) have probabilistic
outcomes
• Examples
1. In Backgammon, action outcomes are probabilistic
• Because each move involves a roll of the dice
2. In robot problems with noisy sensors and efffectors, it is appropriate to model actions and rewards
as probabilistic
Non-deterministic rewards and actions
Also known as a Non-deterministic Markov Decision Process (MDP)
Value of a policy :The value Vπ of a policy is the same as before:

But now just with an expected value:

Q equation Remember:

Non-determinism in Backgammon

• Action – Move based on 2 dice


• State: – 24 board positions, 15 pieces each
• 9D vector: {−1,0,3,−2,0,2,−3,0,1}
Example of Non-determinism: Robotics
Robotics
– Noisy sensors and effectors
– Appropriate to model actions and rewards as nondeterministic

Recall Markov Decision Process


• Agent can perceive a set S of discrete states in its environment
• Has a set A of actions that it can perform
• At each discrete time step t the agent senses the current state st , chooses a current action at and
performs it
• Environment responds by giving it a reward rt =r(st ,at ) and by producing state st+1=δ(st ,at ) – The
function δ and r are part of the environment and not necessarily known to the agent
Nondeterministic MDP
• Functions r(s,a) and δ(s,a) can be viewed as
– First producing a probability distribution over outcomes based on s and a and
– Then drawing an outcome at random according to this distribution
• Nondeterministic Markov decision process
– When these probability distributions depend solely on s and a, i.e.,
• They do not depend on previous states or action

Eligibility traces

Eligibility traces are one of the basic mechanisms of reinforcement learning. For example, in the popular
TD algorithm, the refers to the use of an eligibility trace. Almost any temporal-difference(TD)
method, such as -learning or Sarsa, can be combined with eligibility traces to obtain a more
general method that may learn more efficiently. Eligibility traces unify and generalize TD and Monte
Carlo methods. When TD methods are augmented with eligibility traces, they produce a family of
methods spanning a spectrum that has Monte Carlo methods at one end and one-step TD methods
at the other . In between are intermediate methods that are often better than either extreme
method. Eligibility traces also provide a way of implementing Monte Carlo methods online and on
continuing problems without episodes.

Here are the benefits of Eligibility Traces:

• Provide a way of implementing Monte Carlo in online fashion (does not wait for the episode to
finish) and on problems without episodes.

• Provide an algorithmic mechanism that uses a short-term memory vector.


• Computational efficiency by storing a single vector memory instead a list of feature vectors.

• Learning is done continually rather than waiting results at the end of an episode.

The Forward View

Remember that in Temporal Difference and Monte Carlo methods update a state based on future
rewards. This is done either by looking directly one step ahead or by waiting the episode to finish.

This approach is called the Forward View.

In TD(0) we look one step ahead, while in Monte Carlo we look ahead until the episode is
terminated and we collect the discounted results.However there is a middle ground, in which we look
n-steps ahead.

So let‟s define an average return for all these iterations like the following:

Where G( 𝜆, t) is the weighted average of all returns G(t,t+n) which are the returns of individual
episodes where each episode starts at t and ends at t+n, for n going from 1 to infinity.

𝜆 is a weight that has a value between [0, 1].

As in all weighted average, the sum of the weights must be one, which is the case since

Backward View TD( 𝜆):

Suppose an agent randomly walking in an environment and finds a treasure. He then stops and looks
backwards in an attempt to know what led him to this treasure ?
Naturally the steps that are close to the treasure have more merits in finding it than the steps that are
miles away. So closer locations are more valuable than distant ones and thus they are assigned bigger
values
How does this materialize, is through a vector E called eligibility traces.
Concretely, the eligibility traces is a function of state E(s) or state action E(s,a) and holds the decaying
values of the V(s).
So how do we transit from Forward View to Backward View and what is the role of eligibility traces in
that?

Backward View propagates the error δ to previous states

Algorithms Using Eligibility Traces:

The following are few algorithms that uses Eligibility Traces.


GENERALIZATION

Generalisation in RL is all about creating methods that can tackle these difficulties, challenging a
common assumption in previous RL research that the training and testing environments are identical.

Reinforcement Learning (RL) could be used in a range of applications such as autonomous vehicles
and robotics, but to fulfil this potential we need RL algorithms that can be used in the real world.
Reality is varied, non-stationarity and open-ended, and to handle this algorithms need to be robust to
variation in their environments, and be able to transfer and adapt to unseen (but similar) environments
during their deployment. Generalisation in RL is all about creating methods that can tackle these
difficulties, challenging a common assumption in previous RL research that the training and testing
environments are identical.

The goal in RL is usually described as that of learning a policy for a Markov Decision Process (MDP)
that maximizes some objective function, such as the expected discounted sum of rewards. An MDP is
characterized by a set of states S, a set of actions A, a transition function P and a reward function R.
When we discuss generalization, we can propose a different formulation, in which we wish our policy
to perform well on a distribution of MDPs. Using such as setup, we can now let the agent train on a set
of MDPs and reserve some other MDPs as a test set.
In what way can these MDPs differ from each other? I see three key possible differences:

1. The states are different in some way between MDPs, but the transition function is the same. An
example of this is playing different versions of a video game in which the colors and textures might
change, but the behavior of the policy should not change as a result.

2. The underlying transition function differs between MDPs, even though the states might seem similar.
An example of this some robotic manipulation tasks, in which various physical parameterssuch as
friction coefficients and mass might change, but we would like our policy to be able to adapt to these
changes, or otherwise be robust to them if possible.

3. The MDPs vary in size and apparent complexity, but there is some underlying principle that enables
generalizing to problems of different sizes. Examples of this might be some types of combinatorial
optimization problems such as the Traveling Salesman Problem, for which we would like a policy that
can solve instances of different sizes. (I have previously written on RL for combinatorial optimization)

In my opinion these represent the major sources of generalization challenge, but of course it‟s possible
to create problems that combine more than one such source. In what follows, I am going to focus on
the first type.

Recently, researchers have begun to systematically explore generalization in RL by developing novel


simulated environments that enable creating a distribution of MDPs and splitting unique training and
testing instances. An example of such an environment is CoinRun, introduced by OpenAI in the paper
“Quantifying Generalization in Reinforcement Learning”. This environment can produce a large variety
of levels with different layouts and visual appearance, and thus serves as a nice benchmark for
generalization

PARTIALLY OBSERVABLE STATES:


A partially observable Markov decision process (POMDP) is a combination of an regular Markov
Decision Process to model system dynamics with a hidden Markov model that connects unobservable
system states probabilistically to observations.
The agent can perform actions which affect the system (i.e., may cause the system state to change)
with the goal to maximize the expected future rewards that depend on the sequence of system state and
the agent‟s actions in the future. The goal is to find the optimal policy that guides the agent‟s actions.
Different to MDPs, for POMDPs, the agent cannot directly observe the complete system state, but the
agent makes observations that depend on the state. The agent uses these observations to form a belief
about in what state the system currently is. This belief is called a belief state and is expressed as a
probability distribution over all possible states. The solution of the POMDP is a policy prescribing which
action to take in each belief state. Note that belief states are continuous resulting in an infinite state set
which makes POMDPs much harder to solve compared to MDPs.
A discrete-time POMDP can formally be described as a 7-tuple
P=(S,A,T,R,Ω,O,γ),
Defining a POMDP Problem
The POMDP() function has the following arguments, each corresponds to one of the elements of a
POMDP.

str(args(POMDP))

The Tiger Problem Example


We will demonstrate how to use the package with the Tiger Problem (Cassandra, Kaelbling,
and Littman 1994). The problem is defined as:
An agent is facing two closed doors and a tiger is put with equal probability behind one of the two doors
represented by the states tiger-left and tiger-right, while treasure is put behind the other door.The
possible actions are listen for tiger noises or opening a door (actions open-left and open-right). Listening
is neither free (the action has a reward of -1) nor is it entirely accurate. There is a 15% probability that
the agent hears the tiger behind the left door while it is actually behind the right door and vice versa. If
the agent opens door with the tiger, it will get hurt (a negative reward of -100), but if it opens the door
with the treasure, it will receive a positive reward of 10. After a door is opened, the problem is reset(i.e.,
the tiger is randomly assigned to a door with chance 50/50) and the the agent gets another try.

You might also like