R20 ML Notes
R20 ML Notes
(AUTONOMOUS)
(AUTONOMOUS) L T P C
3 - - 3
III B.Tech. – II Sem.
COURSE OBJECTIVES
UNIT-II
UNIT-III
UNIT-V
TEXT BOOKS
REFERENCES
1. Tom M Mitchell, Machine Learning, First Edition, McGraw Hill Education, 2013
UNIT-I
Machine learning is a growing technology which enables computers to learn automatically from past
data. Machine learning uses various algorithms for building mathematical models and making
predictions using historical data or information. Currently, it is being used for various tasks such
as image recognition, speech recognition, email filtering, Facebook auto-tagging, recommender
system, and many more.
Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the
development of algorithms which allow a computer to learn from the data and past experiences on
their own. The term machine learning was first introduced by Arthur Samuel in 1959. We can define it
in a summarized way as:
Machine learning enables a machine to automatically learn from data, improve performance from
experiences, and predict things without being explicitly programmed.
With the help of sample historical data, which is known as training data, machine learning algorithms
build a mathematical model that helps in making predictions or decisions without being explicitly
programmed. Machine learning brings computer science and statistics together for creating predictive
models. Machine learning constructs or uses the algorithms that learn from historical data. The more we
will provide the information, the higher will be the performance.
A machine has the ability to learn if it can improve its performance by gaining more data.
Machine learning is a subfield of artificial intelligence that involves training computers to learn from
data without being explicitly programmed. In other words, machine learning algorithms use statistical
techniques to find patterns in data and use these patterns to make predictions or take actions.
A Machine Learning system learns from historical data, builds the prediction models, and whenever
it receives new data, predicts the output for it. The accuracy of predicted output depends upon the
amount of data, as the huge amount of data helps to build a better model which predicts the output more
accurately.
Suppose we have a complex problem, where we need to perform some predictions, so instead of writing
a code for it, we just need to feed the data to generic algorithms, and with the help of these algorithms,
machine builds the logic as per the data and predict the output. Machine learning has changed our way
of thinking about the problem. The below block diagram explains the working of Machine Learning
algorithm:
The need for machine learning is increasing day by day. The reason behind the need for machine learning
is that it is capable of doing tasks that are too complex for a person to implement directly. Asa human,
we have some limitations as we cannot access the huge amount of data manually, so for this, we need
some computer systems and here comes the machine learning to make things easy for us.
We can train machine learning algorithms by providing them the huge amount of data and let them
explore the data, construct the models, and predict the required output automatically. The performance
of the machine learning algorithm depends on the amount of data, and it can be determined by the cost
function. With the help of machine learning, we can save both time and money.
The importance of machine learning can be easily understood by its uses cases, Currently, machine
learning is used in self-driving cars, cyber fraud detection, face recognition, and friend suggestion
by Facebook, etc. Various top companies such as Netflix and Amazon have build machine learning
models that are using a vast amount of data to analyze the user interest and recommend product
accordingly.
Following are some key points which show the importance of Machine Learning:
Before some years (about 40-50 years), machine learning was science fiction, but today it is the part of
our daily life. Machine learning is making our day to day life easy from self-driving cars to Amazon
virtual assistant "Alexa". However, the idea behind machine learning is so old and has a long history.
Below some milestones are given which have occurred in the history of machine learning:
Now machine learning has got a great advancement in its research, and it is present everywhere around
us, such as self-driving cars, Amazon Alexa, Catboats, recommender system, and many more. It
includes Supervised, unsupervised, and reinforcement learning with
clustering, classification, decision tree, SVM algorithms, etc.
Modern machine learning models can be used for making various predictions, including weather
prediction, disease prediction, stock market analysis, etc.
Prerequisites
Before learning machine learning, you must have the basic knowledge of followings so that you can
easily understand the concepts of machine learning:
Machine learning is a buzzword for today's technology, and it is growing very rapidly day by day. We
are using machine learning in our daily life even without knowing it such as Google Maps, Google
assistant, Alexa, etc. Below are some most trending real-world applications of Machine Learning:
1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to identify
objects, persons, places, digital images, etc. The popular use case of image recognition and face detection
is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo with our
Facebook friends, then we automatically get a tagging suggestion with name, and the technology behind
this is machine learning's face detection and recognition algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for face recognition and
person identification in the picture.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech recognition, and it's
a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known as "Speech
to text", or "Computer speech recognition." At present, machine learning algorithms are widely used
by various applications of speech recognition. Google assistant, Siri, Cortana, and Alexa are
using speech recognition technology to follow the voice instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the
shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested
with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes information from the
user and sends back to its database to improve the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies suchas
Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some product
on Amazon, then we started getting an advertisement for the same product while internet surfing on the
same browser and this is because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests the product
as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc.,
and this is also done with the help of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine learning plays
a significant role in self-driving cars. Tesla, the most popular car manufacturing company is working on
self-driving car. It is using unsupervised learning method to train the car models to detect people and
objects while driving.
Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We always
receive an important mail in our inbox with the important symbol and spam emails in our spambox, and
the technology behind this is Machine learning. Below are some spam filters used by Gmail:
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes
classifier are used for email spam filtering and malware detection.
We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the name
suggests, they help us in finding the information using our voice instruction. These assistants canhelp us
in various ways just by our voice instructions such as Play music, call someone, Open an email,
Scheduling an appointment, etc.
These assistant record our voice instructions, send it over the server on a cloud, and decode it using ML
algorithms and act accordingly.
8. Online Fraud Detection:
Machine learning is making our online transaction safe and secure by detecting fraud transaction.
Whenever we perform some online transaction, there may be various ways that a fraudulent transaction
can take place such as fake accounts, fake ids, and steal money in the middle of a transaction. So to
detect this, Feed Forward Neural network helps us by checking whether it is a genuine transaction or
a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these values become
the input for the next round. For each genuine transaction, there is a specific pattern which gets change
for the fraud transaction hence, it detects it and makes our online transactions more secure.
Machine learning is widely used in stock market trading. In the stock market, there is always a risk of
up and downs in shares, so for this machine learning's long short term memory neural network is used
for the prediction of stock market trends.
In medical science, machine learning is used for diseases diagnoses. With this, medical technology is
growing very fast and able to build 3D models that can predict the exact position of lesions in the
brain. It helps in finding brain tumors and other brain-related diseases easily.
Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at all,
as for this also machine learning helps us by converting the text into our known languages. Google's
GNMT (Google Neural Machine Translation) provide this feature, which is a Neural Machine Learning
that translates the text into our familiar language, and it called as automatic translation.
The technology behind the automatic translation is a sequence to sequence learning algorithm, which
is used with image recognition and translates the text from one language to another language.
Supervised learning is the types of machine learning in which machines are trained using well"labelled"
training data, and on basis of that data, machines predict the output. The labelled data meanssome input
data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor that teaches
the machines to predict the output correctly. It applies the same concept as a student learns in the
supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the machine
learning model. The aim of a supervised learning algorithm is to find a mapping function to map the
input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification, Fraud
Detection, spam filtering, etc.
In supervised learning, models are trained using labelled dataset, where the model learns about each type
of data. Once the training process is completed, the model is tested on the basis of test data (a subset of
the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle, and
Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the
shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the
shape on the bases of a number of sides, and predicts the output.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc. Below are some popular Regression algorithms which come under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there are two
classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
In the previous topic, we learned supervised machine learning in which models are trained using labeled
data under the supervision of training data. But there may be many cases in which we do not have labeled
data and need to find the hidden patterns from the given dataset. So, to solve such typesof cases in
machine learning, we need unsupervised learning techniques.
Unsupervised learning is a machine learning technique in which models are not supervised using training
dataset. Instead, models itself find the hidden patterns and insights from the given data. It can be
compared to learning which takes place in the human brain while learning new things. It can be defined
as:
Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.
Unsupervised learning cannot be directly applied to a regression or classification problem because unlike
supervised learning, we have the input data but no corresponding output data. The goal of unsupervised
learning is to find the underlying structure of dataset, group that data according to similarities,
and represent that dataset in a compressed format.
Example: Suppose the unsupervised learning algorithm is given an input dataset containing images of
different types of cats and dogs. The algorithm is never trained upon the given dataset, which means it
does not have any idea about the features of the dataset. The task of the unsupervised learning algorithm
is to identify the image features on their own. Unsupervised learning algorithm will perform this task by
clustering the image dataset into the groups according to similarities between images.
Keep Watching
Below are some main reasons which describe the importance of Unsupervised Learning:
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own experiences,
which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make unsupervised
learning more important.
o In real-world, we do not always have input data with the corresponding output so to solve such
cases, we need unsupervised learning.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups according to
the similarities and difference between the objects.
The unsupervised learning algorithm can be further categorized into two types of problems:
o Clustering: Clustering is a method of grouping the objects into clusters such that objects with
most similarities remains into a group and has less or no similarities with the objects of another
group. Cluster analysis finds the commonalities between the data objects and categorizes them
as per the presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which is used for finding
the relationships between variables in the large database. It determines the set of items that occurs
together in the dataset. Association rule makes marketing strategy more effective. Such as people
who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.
Unsupervised Learning algorithms:
Below is the list of some popular unsupervised learning algorithms:
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriority algorithm
o Singular value decomposition
Supervised and Unsupervised learning are the two techniques of machine learning. But both the
techniques are used in different scenarios and with different datasets. Below the explanation of both
learning methods along with their difference table is given.
The main differences between Supervised and Unsupervised learning are given below:
Supervised learning model takes direct Unsupervised learning model does not take
feedback to check if it is predicting correct any feedback.
output or not.
Supervised learning model predicts the output. Unsupervised learning model finds the
hidden patterns in data.
In supervised learning, input data is provided In unsupervised learning, only input data is
to the model along with the output. provided to the model.
The goal of supervised learning is to train the The goal of unsupervised learning is to find
model so that it can predict the output when it the hidden patterns and useful insights from
is given new data. the unknown dataset.
Supervised learning needs supervision to train Unsupervised learning does not need any
the model. supervision to train the model.
Supervised learning model produces an Unsupervised learning model may give less
accurate result. accurate result as compared to supervised
learning.
Supervised learning is not close to true Unsupervised learning is more close to the
Artificial intelligence as in this, we first train true Artificial Intelligence as it learns
the model for each data, and then only it can similarly as a child learns daily routine things
predict the correct output. by his experiences.
Reinforcement learning:
The above image shows the robot, diamond, and fire. The goal of the robot is to get the reward that
is the diamond and avoid the hurdles that are fired. The robot learns by trying all the possible paths
and then choosing the path which gives him the reward with the least hurdles. Each right step will
give the robot a reward and each wrong step will subtract the reward of the robot. The total reward
will be calculated when it reaches the final reward that is the diamond.
Main points in Reinforcement learning –
• Input: The input should be an initial state from which the model will start
• Output: There are many possible outputs as there are a variety of solutions to a particular
problem
• Training: The training is based upon the input, The model will return a state and the user will
decide to reward or punish the model based on its output.
• The model keeps continues to learn.
• The best solution is decided based on the maximum reward.
1. Positive –
Positive Reinforcement is defined as when an event, occurs due to a particular behavior,
increases the strength and the frequency of the behavior. In other words, it has a positive effect
on behavior.
Advantages of reinforcement learning are:
• Maximizes Performance
• Sustain Change for a long period of time
• Too much Reinforcement can lead to an overload of states which can diminish the
results
2. Negative –
Negative Reinforcement is defined as strengthening of behavior because a negative condition is
stopped or avoided.
Advantages of reinforcement learning:
• Increases Behavior
• Provide defiance to a minimum standard of performance
• It Only provides enough to meet up the minimum behavior
Various Practical applications of Reinforcement Learning –
Model selection refers to the process of selecting the best model from a set of candidate models based
on their performance on a given task. This process typically involves splitting the available data into
training and validation sets, using the training set to train each candidate model, and then evaluating
their performance on the validation set. The model with the best performance on the validation set is
selected as the final model.
Generalization refers to the ability of a model to perform well on new, unseen data. When a model is
trained on a dataset, it may overfitt the training data by memorizing specific patterns in the data that
are not representative of the underlying distribution. This can lead to poor performance on new data.
To ensure good generalization, it is important to evaluate a model's performance on a separate test set
that was not used during model selection or training.
To improve generalization, techniques such as regularization, early stopping, and data augmentation
can be used. Regularization involves adding a penalty term to the loss function to discourage complex
models that are prone to overfitting. Early stopping involves monitoring the validation error during
training and stopping the training process when the error begins to increase. Data augmentation
involves generating new training examples by applying transformations to existing examples, which
can increase the size and diversity of the training set and help prevent overfitting.
Overall, model selection and generalization are crucial aspects of machine learning that help ensure
that models are accurate and reliable, and can be applied successfully to new data.
Fig:Model Seleciton
1. Collecting Data:
As you know, machines initially learn from the data that you give them. It is of the utmost importance
to collect reliable data so that your machine learning model can find the correct patterns. The quality of
the data that you feed to the machine will determine how accurate your model is. If you have incorrect
or outdated data, you will have wrong outcomes or predictions which are not relevant.
Make sure you use data from a reliable source, as it will directly affect the outcome of your model. Good
data is relevant, contains very few missing and repeated values, and has a good representation of the
various subcategories/classes present.
2. Preparing the Data:
After you have your data, you have to prepare it. You can do this by:
• Putting together all the data you have and randomizing it. This helps make sure that data is evenly
distributed, and the ordering does not affect the learning process.
• Cleaning the data to remove unwanted data, missing values, rows, and columns, duplicate values,
data type conversion, etc. You might even have to restructure the dataset and changethe rows
and columns or index of rows and columns.
• Visualize the data to understand how it is structured and understand the relationship between
various variables and classes present.
• Splitting the cleaned data into two sets - a training set and a testing set. The training set is the set
your model learns from. A testing set is used to check the accuracy of your model after training.
Figure 3: Cleaning and Visualizing Data
3. Choosing a Model:
A machine learning model determines the output you get after running a machine learning algorithm
on the collected data. It is important to choose a model which is relevant to the task at hand. Over the
years, scientists and engineers developed various models suited for different tasks like speech
recognition, image recognition, prediction, etc. Apart from this, you also have to see if your model is
suited for numerical or categorical data and choose accordingly.
Training is the most important step in machine learning. In training, you pass the prepared data to your
machine learning model to find patterns and make predictions. It results in the model learning from the
data so that it can accomplish the task set. Over time, with training, the model gets better at predicting.
After training your model, you have to check to see how it‟s performing. This is done by testing the
performance of the model on previously unseen data. The unseen data used is the testing set that you
split our data into earlier. If testing was done on the same data which is used for training, you will not
get an accurate measure, as the model is already used to the data, and finds the same patterns in it, as it
previously did. This will give you disproportionately high accuracy.
When used on testing data, you get an accurate measure of how your model will perform and its speed.
Figure 6: Evaluating a model
6. Parameter Tuning:
Once you have created and evaluated your model, see if its accuracy can be improved in any way. This
is done by tuning the parameters present in your model. Parameters are the variables in the model that
the programmer generally decides. At a particular value of your parameter, the accuracy will be the
maximum. Parameter tuning refers to finding these values.
7. Making Predictions
In the end, you can use your model on unseen data to make predictions accurately.
How to Implement Machine Learning Steps in Python?
You will now see how to implement a machine learning model using Python.
In this example, data collected is from an insurance company, which tells you the variables that come
into play when an insurance amount is set. Using this, you will have to predict the insurance amount
for a person. This data was collected from Kaggle.com, which has many reliable datasets.
You need to start by importing any necessary modules, as shown.
Now, clean your data by removing duplicate values, and transforming columns into numerical values
to make them easier to work with.
As you need to predict a numeral value based on some parameters, you will have to use Linear
Regression. The model needs to learn on your training set. This is done by using the '.fit' command.
Now, predict your testing dataset and find how accurate your predictions are.
Figure 15: Predicting using your model
1.0 is the highest level of accuracy you can get. Now, get your parameters.
The above picture shows the hyperparameters which affect the various variables in your dataset.
AI& ML Differences
AI is a bigger concept to create intelligent machines that can simulate human thinking capability and
behavior, whereas, machine learning is an application or subset of AI that allows machines to learn
from data without being programmed explicitly.
Below are some main differences between AI and machine learning along with the overview of Artificial
intelligence and machine learning
Artificial Intelligence
Artificial intelligence is a field of computer science which makes a computer system that can mimic
human intelligence. It is comprised of two words "Artificial" and "intelligence", which means "a
human-made thinking power." Hence we can define it as,
Artificial intelligence is a technology using which we can create intelligent systems that can simulate
human intelligence.
The Artificial intelligence system does not require to be pre-programmed, instead of that, they use such
algorithms which can work with their own intelligence. It involves machine learning algorithms such as
Reinforcement learning algorithm and deep learning neural networks. AI is being used in multiple places
such as Siri, Google?s AlphaGo, AI in Chess playing, etc.
Currently, we are working with weak AI and general AI. The future of AI is Strong AI for which it is
said that it will be intelligent than humans.
Machine learning
Machine learning is about extracting knowledge from the data. It can be defined as,
Machine learning is a subfield of artificial intelligence, which enables machines to learn from past
data or experiences without being explicitly programmed.
Artificial intelligence is a technology which Machine learning is a subset of AI which allows a machine
enables a machine to simulate humanbehavior. to automatically learn from past data withoutprogramming
explicitly.
The goal of AI is to make a smart computer The goal of ML is to allow machines to learn from data so
system like humans to solve complex that they can give accurate output.
problems.
In AI, we make intelligent systems to perform In ML, we teach machines with data to perform a
any task like a human. particular task and give an accurate result.
Machine learning and deep learning are the Deep learning is a main subset of machine learning.
two main subsets of AI.
AI has a very wide range of scope. Machine learning has a limited scope.
AI is working to create an intelligent system Machine learning is working to create machines that can
which can perform various complex tasks. perform only those specific tasks for which they are trained.
AI system is concerned about maximizing the Machine learning is mainly concerned about accuracy and
chances of success. patterns.
The main applications of AI are Siri, customer The main applications of machine learning are Online
support using catboats, Expert System, recommender system, Google search
Online game playing, intelligent algorithms, Facebook auto friend tagging suggestions,
humanoid robot, etc. etc.
On the basis of capabilities, AI can be divided Machine learning can also be divided into mainly three
into three types, which are, Weak AI, General types that are Supervised learning, Unsupervised
AI, and Strong AI. learning, and Reinforcement learning.
It includes learning, reasoning, and self- It includes learning and self-correction when introduced
correction. with new data.
AI completely deals with Structured, semi- Machine learning deals with Structured and semi-
structured, and unstructured data. structured data.
UNIT-II
CLASSIFICATION:
• Data Collection: Collecting labeled data to train the classification model is the first step. The
labeled data consists of input data points and their corresponding output labels.
• Data Preprocessing: The collected data is preprocessed to remove any noise or outliers and to
convert it into a suitable format for the machine learning algorithm.
• Feature Extraction: Features that are relevant to the problem are extracted from the input data
points. Feature extraction involves selecting the most important and informative features for the
classification task.
• Model Selection: Choosing a suitable classification algorithm is an important step. There are
various algorithms available for classification, such as logistic regression, decision trees, k-
nearest neighbors, support vector machines, and neural networks.
• Model Training: The selected model is trained on the labelled data, and the algorithm learns to
predict the correct output label for each input data point.
• Model Evaluation: The performance of the trained model is evaluated on a test dataset that
was not used during training. The evaluation metrics may include accuracy, precision, recall, and
F1-score.
• Model Deployment: The final step is to deploy the trained model to make predictions on new,
unseen data.
Overall, classification in supervised learning is a valuable technique for many applications, including
image and speech recognition, fraud detection, and spam filtering.
CLASSIFICATION TECHNIQUES
There are four different types of Classification Tasks in Machine Learning and they are following -
• Binary Classification
• Multi-Class Classification
• Multi-Label Classification
• Imbalanced Classification
Binary Classification
Those classification jobs with only two class labels are referred to as binary classification.
Examples comprise -
For instance, the normal condition is "not spam," while the abnormal state is "spam." Another illustration
is when a task involving a medical test has a normal condition of "cancer not identified"and an
abnormal state of "cancer detected."
Class label 0 is given to the class in the normal state, whereas class label 1 is given to the class in the
abnormal condition.
A model that forecasts a Bernoulli probability distribution for each case is frequently used to represent
a binary classification task.
The discrete probability distribution known as the Bernoulli distribution deals with the situation where
an event has a binary result of either 0 or 1. In terms of classification, this indicates that the model
forecasts the likelihood that an example would fall within class 1, or the abnormal state.
• Logistic Regression
• Simple Bayes
• Decision Trees
Some algorithms, such as Support Vector Machines and Logistic Regression, were created expressly for
binary classification and do not by default support more than two classes.
Multi-Class Classification
The multi-class classification does not have the idea of normal and abnormal outcomes, in contrast to
binary classification. Instead, instances are grouped into one of several well-known classes.
In some cases, the number of class labels could be rather high. In a facial recognition system, for
instance, a model might predict that a shot belongs to one of thousands or tens of thousands of faces.
Text translation models and other problems involving word prediction could be categorized as a
particular case of multi-class classification. Each word in the sequence of words to be predicted requires
a multi-class classification, where the vocabulary size determines the number of possible classes that
may be predicted and may range from tens of thousands to hundreds of thousands ofwords.
Multiclass classification tasks are frequently modeled using a model that forecasts a Multinoulli
probability distribution for each example.
An event that has a categorical outcome, such as K in 1, 2, 3,..., K, is covered by the Multinoulli
distribution, which is a discrete probability distribution. In terms of classification, this implies that the
model forecasts the likelihood that a given example will belong to a certain class label.
• Progressive Boosting
• Choice trees
• Nearest K Neighbors
• Rough Forest
• Simple Bayes
Multi-class problems can be solved using algorithms created for binary classification.
In order to do this, a method is known as "one-vs-rest" or "one model for each pair of classes" is used,
which includes fitting multiple binary classification models with each class versus all other classes
(called one-vs-one).
One-vs-One: For each pair of classes, fit a single binary classification model.
The following binary classification algorithms can apply these multi-class classification techniques:
One-vs-Rest: Fit a single binary classification model for each class versus all other classes.
The following binary classification algorithms can apply these multi-class classification techniques:
• Logistic Regression
Multi-Label Classification
Multi-label classification problems are those that feature two or more class labels and allow for the
prediction of one or more class labels for each example.
Think about the photo classification example. Here a model can predict the existence of many known
things in a photo, such as “person”, “apple”, "bicycle," etc. A particular photo may have multiple
objects in the scene.
In multi-label classification, we have several labels that are the outputs for a given prediction. When
making predictions, a given input may belong to more than one label. For example, when predicting
a given movie category, it may belong to horror, romance, adventure, action, or all simultaneously.
This greatly contrasts with multi-class classification and binary classification, which anticipate a single
class label for each occurrence.
Multi-label classification problems are frequently modelled using a model that forecasts many outcomes,
with each outcome being forecast as a Bernoulli probability distribution. In essence, this approach
predicts several binary classifications for each example.
It is not possible to directly apply multi-label classification methods used for multi-class or binary
classification. The so-called multi-label versions of the algorithms, which are specialized versions of
the conventional classification algorithms, include:
Imbalanced Classification
The term "imbalanced classification" describes classification jobs where the distribution of examples
within each class is not equal.
A majority of the training dataset's instances belong to the normal class, while a minority belong to the
abnormal class, making imbalanced classification tasks binary classification tasks in general.
Examples comprise -
• Detection of outliers
• Fraud investigation
Although they could need unique methods, these issues are modeled as binary classification jobs.
By oversampling the minority class or under sampling the majority class, specialized strategies can be
employed to alter the sample composition in the training dataset.
Examples comprise -
• SMOTE Oversampling
It is possible to utilize specialized modelling techniques, like the cost-sensitive machine learning
algorithms, that give the minority class more consideration when fitting the model to the training
dataset.
Examples comprise:
Since reporting the classification accuracy may be deceptive, alternate performance indicators may be
necessary.
Examples comprise -
• F-Measure
• Recall
• Precision
There are many classification techniques in supervised learning, each with its own strengths and
weaknesses. Here are some of the most popular techniques:
• Logistic Regression: A simple and widely used classification algorithm that works well with
linearly separable data. It models the relationship between the input features and the output
label using a logistic function.
• Decision Trees: A tree-based algorithm that recursively partitions the feature space into smaller
subsets based on the input features, creating a tree-like structure. Each internal node of the tree
represents a decision based on a specific feature, and the leaf nodes represent the
predicted output labels.
• Random Forest: A popular ensemble method that combines multiple decision trees, each
trained on a random subset of the input features and training samples. It reduces overfitting
and improves accuracy and generalization.
• Support Vector Machines (SVM): A powerful algorithm that finds the hyperplane that
maximally separates the different classes in the feature space. SVM can handle non-linearly
separable data using kernel functions.
• K-Nearest Neighbors (K-NN): A non-parametric algorithm that classifies new data points by
finding the k nearest training examples and using the majority vote of their output labels.
• Naive Bayes: A probabilistic algorithm that models the relationship between the input features
and the output labels using Bayes' theorem. It assumes that the input features are independent,
which makes it computationally efficient and scalable.
• Artificial Neural Networks (ANN): A complex and powerful algorithm that simulates the
behaviour of biological neurons and learns complex representations of the input features. It can
handle non-linearly separable data and has achieved state-of-the-art performance in many
classification tasks.
These are just a few examples of the many classification techniques available in supervised learning.
The choice of technique depends on the problem at hand, the size and complexity of the data, and the
desired performance metrics.
o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
o Below diagram explains the general structure of a decision tree:
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset
and problem is the main point to remember while creating a machine learning model. Below are the two
reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
• Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node
of the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute
and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and move
further. It continues the process until it reaches the leaf node of the tree. The complete process can be
better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should accept
the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary attribute by
ASM). The root node splits further into the next decision node (distance from the office) and one leaf
node based on the corresponding labels. The next decision node further gets split into one decision node
(Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers
and Declined offer). Consider the below diagram:
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
o Gini index can be calculated using the below formula:
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without reducing
accuracy is known as Pruning. There are mainly two types of tree pruning technology used:
o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
For more class labels, the computational complexity of the decision tree may
There are three different analysis techniques that exist. These are –
• Univariate analysis
• Bivariate analysis
• Multivariate analysis
The selection of the data analysis technique is dependent on the number of variables, types of data and
focus of the statistical inquiry. The following section describes the three different levels of data analysis
–
A univariate tree is a decision tree that considers only one input variable (i.e., one feature) at
each decision node. The tree splits the data into two subsets based on a threshold value for the chosen
feature. The process continues recursively until a stopping criterion is met, such as reaching a maximum
tree depth or a minimum number of data points in a leaf node. Univariate trees are simple
and computationally efficient, but they may not capture complex interactions between different
input variables.
Here is one example of Univariate analysis-
In a survey of a class room, the researcher may be looking to count the number of boys and girls. In this
instance, the data would simply reflect the number, i.e. a single variable and its quantity as per the below
table. The key objective of Univariate analysis is to simply describe the data to find patterns within
the data. This is to be done by looking into the mean, median, mode, dispersion, variance, range,
standard deviation etc.
Univariate analysis is conducted through several ways which are mostly descriptive in nature
• Frequency Distribution Tables
• Histograms
• Frequency Polygons
• Pie Charts
• Bar Charts
Bivariate analysis
Bivariate analysis is slightly more analytical than Univariate analysis. When the data set contains two
variables and researchers aim to undertake comparisons between the two data set then Bivariate analysis
is the right type of analysis technique.
Here is one simple example of bivariate analysis –
In a survey of a classroom, the researcher may be looking to analysis the ratio of students who scored
above 85% corresponding to their genders. In this case, there are two variables – gender = X
(independent variable) and result = Y (dependent variable). A Bivariate analysis is will measure the
correlations between the two variables.
Bivariate analysis is conducted using –
• Correlation coefficients
• Regression analysis
A multivariate tree, on the other hand, considers multiple input variables (i.e., multiple features) at
each decision node. Instead of choosing a single feature to split the data, the tree selects a subset of
features that best separates the data into different classes. This subset can be determined using various
methods, such as information gain or Gini impurity. The tree then recursively splits the data using the
selected features, and the process continues until a stopping criterion is met. Multivariate trees are
more complex and computationally intensive than univariate trees, but they can capture
complex interactions between input variables and improve the accuracy of the model.
In practice, the choice of a univariate or multivariate tree depends on the characteristics of the data and
the complexity of the problem. For simple problems with few input variables, a univariate tree may be
sufficient and faster to train. For more complex problems with many input variables, a multivariate
tree may be necessary to capture the interactions between features and achieve higher accuracy.
In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan
used to generate a decision tree from a dataset. ID3 is the precursor to the C4. 5 algorithm, and is
typically used in the machine learning and natural language processing domains.
Multivariate analysis
Multivariate analysis is a more complex form of statistical analysis technique and used when there are
more than two variables in the data set.
Here is an example of multivariate analysis –
A doctor has collected data on cholesterol, blood pressure, and weight. She also collected data on
the eating habits of the subjects (e.g., how many ounces of red meat, fish, dairy products, and chocolate
consumed per week). She wants to investigate the relationship between the three measuresof health
and eating habits?
In this instance, a multivariate analysis would be required to understand the relationship of each
variable with each other.
1. Univariate data –
This type of data consists of only one variable. The analysis of univariate data is thus the simplest form
of analysis since the information deals with only one quantity that changes. It does not dealwith
causes or relationships and the main purpose of the analysis is to describe the data and find patterns that
exist within it. The example of a univariate data can be height.
Suppose that the heights of seven students of a class is recorded(figure 1),there is only one variable that
is height and it is not dealing with any cause or relationship. The description of patterns found in this
type of data can be made by drawing conclusions using central tendency measures (mean,median and
mode), dispersion or spread of data (range, minimum, maximum, quartiles, variance and standard
deviation) and by using frequency distribution tables, histograms, pie charts, frequency polygon and bar
charts.
2. Bivariate data
This type of data involves two different variables. The analysis of this type of data deals with
causes and relationships and the analysis is done to find out the relationship among the two variables.
Example of bivariate data can be temperature and ice cream sales in summer season.
Suppose the temperature and ice cream sales are the two variables of a bivariate data(figure 2). Here, the
relationship is visible from the table that temperature and sales are directly proportional to each other
and thus related because as the temperature increases, the sales also increase. Thus bivariate data
analysis involves comparisons, relationships, causes and explanations. These variables are often plotted
on X and Y axis on the graph for better understanding of data and one of these variables is independent
while the other is dependent.
3. Multivariate data
When the data involves three or more variables, it is categorized under multivariate. Example of this
type of data is suppose an advertiser wants to compare the popularity of four advertisements ona
website, then their click rates could be measured for both men and women and relationshipsbetween
variables can then be examined. It is similar to bivariate but contains more than one dependent variable.
The ways to perform analysis on this data depends on the goals to be achieved. Some of the techniques
are regression analysis, path analysis, factor analysis and multivariate analysis of variance
(MANOVA).
There are a lots of different tools, techniques and methods that can be used to conduct your analysis.
You could use software libraries, visualization tools and statistic testing methods. However, this blog
we will be compare Univariate, Bivariate and Multivariate analysis.
It only summarize
It only summarize two
single variable at a It only summarize more than 2 variables.
variables
time.
It does not deal with It does deal with causes and It does not deal with causes and
causes and relationships and analysis is relationships and analysis is done.
relationships. done.
The main purpose is The main purpose is to The main purpose is to study the
to describe. explain. relationship among them.
PRUNING
Pruning is a technique used in supervised learning to prevent overfitting of a model. Overfitting occurs
when a model learns the training data too well and performs poorly on new, unseen data. Pruning
involves removing some parts of the model that are not essential to its performance, with the aim of
reducing its complexity and improving its generalization ability.
There are two main types of pruning techniques: pre-pruning and post-pruning.
• Pre-pruning: Pre-pruning involves stopping the growth of a decision tree before it becomes
too complex and overfits the training data. This can be done by setting a maximum tree depth,
a minimum number of data points in a leaf node, or a threshold for the information gain at each
decision node. Pre-pruning is simple and computationally efficient, but it may not capture
complex relationships in the data.
• Post-pruning: Post-pruning involves growing a decision tree to its maximum depth and then
removing the unnecessary branches that do not improve the model's performance on the
validation data. This can be done by calculating a measure of impurity reduction or error rate
reduction for each subtree and removing the subtree that does not meet a certain criterion. Post-
pruning is more computationally intensive than pre-pruning, but it can capture complex
relationships and improve the accuracy of the model.
Pruning can be applied to other supervised learning algorithms as well, such as neural networks and
support vector machines. In neural networks, pruning can involve removing some of the neurons or
connections that are not essential to the model's performance. In support vector machines, pruning can
involve removing some of the support vectors or adjusting the regularization parameter to control
the model's complexity.
Overall, pruning is an important technique in supervised learning to prevent overfitting and improve
the generalization ability of a model.
Bayesian Decision Theory is a framework for decision making in the presence of uncertainty. It provides
a way to make decisions by explicitly considering probabilities and the consequences of different actions.
The goal of Bayesian decision theory is to choose the action that maximizes the expected utility, which
is a measure of the desirability of different outcomes.
The basic idea of Bayesian decision theory is to model the problem as a probabilistic graphical model,
which captures the relationships between different variables and their probabilities. The model includes
a set of actions, a set of possible outcomes, and a set of features or observations that provide information
about the state of the system.
To make a decision, Bayesian decision theory calculates the expected utility of each action, which is the
sum of the utilities of all possible outcomes weighted by their probabilities. The utility function
represents the preferences of the decision maker and assigns a value to each outcome based on its
desirability.
Bayesian decision theory also incorporates prior beliefs about the probabilities and outcomes, which can
be updated based on new information using Bayes' theorem. This allows the decision maker to adapt to
changing circumstances and update their beliefs as new data becomes available.
Bayesian decision theory has applications in many areas of science and engineering, including
economics, finance, engineering, and artificial intelligence. It provides a principled way to make
decisions based on probabilities and expected utilities, which can help to improve the quality and
consistency of decision making.
Bayesian Decision Theory (i.e. the Bayesian Decision Rule) predicts the outcome not only
based on previous observations, but also by taking into account the current situation. The rule
describes the most reasonable action to take based on an observation.
• :Prior probability. This accounts for how many times the class Ci occurred
independently from any conditions (i.e. regardless of the input X.
• P(X|Ci)): Likelihood. Under some conditions X, this is how many times the
outcome Ci occurred.
• P(X)): Evidence. The number of times the conditions X occurred.
• P(Ci|X)): Posterior. The probability that the outcome Ci occurs given some conditions X.
The basic idea is that there is a set of fixed parameters that determine a probability model.
Parametric methods are often those for which we know that the population is approximately normal, or
we can approximate using a normal distribution after we invoke the central limit theorem.
Parametric statistics are based on assumptions about the distribution of population from which
the sample was taken. Nonparametric statistics are not based on assumptions, that is, the data can be
collected from a sample that does not follow a specific distribution.
Parametric methods are a type of supervised learning algorithm that assumes the data follows a
particular distribution or functional form. The goal of a parametric method is to estimate the
parameters of this distribution or function based on the training data, and then use these parameters to
make predictions on new data.
Some common examples of parametric methods in supervised learning include linear regression, logistic
regression, and Naive Bayes. These methods make certain assumptions about the underlying distribution
or function, such as linearity in the case of linear regression, and then estimate theparameters that best
fit the data using maximum likelihood or other statistical techniques.
The main advantage of parametric methods is that they are often computationally efficient and can work
well with small to moderate-sized datasets. They also provide interpretable models that can help to
explain the relationships between the input variables and the output variable.
However, the main disadvantage of parametric methods is that they can be limited by the assumptions
made about the underlying distribution or function. If these assumptions are not valid, the model may
not accurately capture the true relationship between the input and output variables, and may lead to poor
predictions. In addition, parametric methods may not be able to capture complex nonlinear relationships
in the data.
Overall, parametric methods are a useful tool in supervised learning, particularly for simple and well-
understood problems where the assumptions made about the underlying distribution or function are
valid. However, for more complex problems or situations where the assumptions do not hold, other
methods such as non-parametric or semi-parametric methods may be more appropriate.
• Logistic Regression
• Linear Discriminant Analysis
• Perceptron
• Naive Bayes
• Simple Neural Networks
Benefits of Parametric Machine Learning Algorithms:
Constrained: By choosing a functional form these methods are highly constrained to the specified
form.
Limited Complexity: The methods are more suited to simpler problems.
Poor Fit: In practice the methods are unlikely to match the underlying mapping function.
MAXIMUM LIKELIHOOD ESTIMATION
Maximum Likelihood Estimation (MLE) is a method used in supervised learning to estimate the
parameters of a model that best explain the observed data. The goal of MLE is to find the set of parameter
values that maximize the likelihood of observing the data given the model.
In supervised learning, MLE is commonly used to estimate the parameters of a probabilistic model, such
as a Gaussian distribution or a logistic regression model. The likelihood function is defined as the
probability of observing the training data given the model parameters. The maximum likelihoodestimate
of the parameters is the set of values that maximize the likelihood function.
• Choose a probability distribution or functional form that describes the relationship between the
input variables and the output variable.
• Define the likelihood function as the probability of observing the training data given the model
parameters.
• Take the logarithm of the likelihood function to simplify the calculations and convert the
product of probabilities to a sum of logarithms.
• Maximize the logarithm of the likelihood function with respect to the parameters using
optimization techniques such as gradient descent or Newton's method.
• Once the maximum likelihood estimate of the parameters is obtained, the model can be used to
make predictions on new data.
MLE is a powerful technique in supervised learning because it provides a way to estimate theparameters
of a model that best fit the observed data. However, it assumes that the model is correctly specified and
that the training data is representative of the population. If these assumptions do not hold, the MLE
estimate may be biased or have high variance. Therefore, it is important to carefully validate the model
and the data before using MLE for parameter estimation.
https://drive.google.com/file/d/1iWpQCYLJisBe8IXeWUmBVS2ehOKEUejW/view?usp=drive_l
ink
Evaluating the bias and variance of an estimator is an important step in supervised learning to
assess the performance of the model and identify any issues that need to be addressed. Bias refers to
the systematic error of the model, while variance refers to the variability of the model's predictions.
In the context of supervised learning, bias and variance can be evaluated using a technique called
cross-validation. Cross-validation involves partitioning the data into training and validation sets, and
then fitting the model to the training set and evaluating its performance on the validation set. This process
is repeated several times, each time using a different partition of the data, to obtain an estimate of the
model's bias and variance.
The bias of the estimator can be estimated by comparing the predictions of the model to the true values
in the validation set. If the model consistently underestimates or overestimates the true values, it has a
bias. The bias can be reduced by using a more flexible model or increasing the size of the training set.
The variance of the estimator can be estimated by comparing the predictions of the model across different
partitions of the data. If the model's predictions vary widely across different partitions, it has high
variance. High variance can be reduced by using a simpler model or by regularizing the model to prevent
overfitting.
In addition to evaluating bias and variance, other metrics can also be used to assess the
performance of a model in supervised learning, such as accuracy, precision, recall, F1 score, and
area under the receiver operating characteristic (ROC) curve. These metrics can help to provide a
more comprehensive evaluation of the model's performance and identify areas for improvement.
Bayes Estimator
The Bayes' estimator is a method used in supervised learning to estimate the unknown parameter of a
statistical model based on Bayes' theorem. It is also known as the posterior mean or the conditional
expectation.
Bayes' theorem states that the posterior probability of a parameter given the data is proportional to the
likelihood of the data given the parameter and the prior probability of the parameter. Mathematically,
it can be written as:
where θ is the unknown parameter, X is the observed data, P(θ|X) is the posterior probability of θ
given X, P(X|θ) is the likelihood of X given θ, and P(θ) is the prior probability of θ.
The Bayes' estimator is the expected value of the parameter given the observed data, and can be
computed using the posterior distribution of the parameter. Mathematically, it can be written as:
where θ^B is the Bayes' estimator of θ, and the integral is taken over the entire parameter space.
The Bayes' estimator is often used in supervised learning when the prior distribution of the parameter
is known or can be assumed, and when the likelihood function is well-defined. It provides a way to
incorporate prior knowledge about the parameter into the estimation process, and can lead to more robust
and accurate estimates.
However, the Bayes' estimator requires the specification of a prior distribution, which may not always
be easy to determine. The choice of the prior can also affect the resulting estimate, and different priors
can lead to different estimates. Therefore, it is important to carefully choose the prior based on prior
knowledge or expert opinion, or use non-informative priors that do not affect the estimation process.
Linear Discriminant
Linear Discriminant Analysis (LDA) is a statistical method used in supervised learning to find a linear
combination of features that best separates two or more classes. It is commonly used in classification
problems, where the goal is to predict the class of a new observation based on its features.
LDA works by modeling the distribution of the features for each class, and then finding a linear boundary
that maximally separates the classes. Specifically, LDA seeks to find a linear discriminant function that
maximizes the between-class variance and minimizes the within-class variance.
To find the discriminant function, LDA first computes the mean and covariance matrix for each class.
It then computes a weighted average of the class covariance matrices, where the weights are proportional
to the number of observations in each class. This weighted average is used to estimate the overall
covariance matrix of the data.
Next, LDA computes the eigenvectors and eigenvalues of the overall covariance matrix. The
eigenvectors represent the directions of maximum variance in the data, and the eigenvalues represent the
variance along each eigenvector. LDA then selects the eigenvectors corresponding to the largest
eigenvalues, and uses them to form a linear discriminant function.
The linear discriminant function can be used to project new observations onto a lower-dimensional
space, where they can be classified based on their position relative to the linear boundary. Alternatively,
it can be used to assign a class probability to each observation, based on the distance between the
observation and the linear boundary.
LDA is a powerful and widely used method in supervised learning, and has been shown to perform well
on a variety of classification problems. However, it assumes that the data is normally distributed and
that the classes have equal covariance matrices, which may not always be true in practice.
Gradiant Descent
Gradient Descent is a popular optimization algorithm used in supervised learning to minimize the cost
function of a model. In supervised learning, the goal is to learn a model that can make accurate
predictions on new, unseen data. To achieve this, we need to optimize the parameters of the model to
minimize the difference between the predicted output and the actual output.
The cost function is a measure of the difference between the predicted output and the actual output,
and is typically defined as the mean squared error, cross-entropy, or another appropriate metric
depending on the problem. The goal of Gradient Descent is to find the set of parameters that minimizes
the cost function.
The basic idea behind Gradient Descent is to iteratively update the parameters in the direction of the
negative gradient of the cost function. The negative gradient of the cost function tells us the direction
of steepest descent, or the direction in which the cost function decreases the most. By taking small
steps in this direction, we can iteratively approach the optimal set of parameters.
There are several variants of Gradient Descent, including Batch Gradient Descent, Stochastic Gradient
Descent, and Mini-batch Gradient Descent. Batch Gradient Descent computes the gradient of the cost
function with respect to all the training examples at once and updates the parameters accordingly.
Stochastic Gradient Descent updates the parameters based on the gradient of the cost function with
respect to a single training example at a time. Mini-batch Gradient Descent is a compromise between
these two approaches, and updates the parameters based on the gradient of the cost function with respect
to a small batch of training examples at a time.
Gradient Descent is a powerful and widely used optimization algorithm in supervised learning, and is
used to train many popular models, including linear regression, logistic regression, and neural networks.
However, it can be sensitive to the choice of learning rate and can get stuck in local optima. To mitigate
these issues, various extensions and modifications to Gradient Descent have been proposed, including
adaptive learning rate methods and momentum-based methods.
Logistic Discrimination
Logistic Regression is a popular classification algorithm in supervised learning that is used to model the
probability of a binary response variable based on one or more predictor variables. It is a type of
discriminative model that seeks to learn a decision boundary that separates the two classes.
The decision boundary is represented by a linear function of the predictor variables, where the output
of the function is passed through the logistic function to obtain the predicted probability. The logistic
function, also known as the sigmoid function, maps any real-valued input to a value between 0 and 1,
which can be interpreted as the probability of belonging to the positive class.
The logistic regression model is trained using maximum likelihood estimation, where the goal is to find
the set of parameters that maximize the likelihood of the observed data. The likelihood function is
defined as the product of the conditional probabilities
of the response variable given the predictor variables, where the probabilities are modeled using the
logistic function.
The logistic regression model can be extended to handle multiclass classification problems by using a
one-vs-rest approach, where a separate logistic regression model is trained for each class against the
remaining classes. Alternatively, a multinomial logistic regression model, also known as softmax
regression, can be used to directly model the probabilities of each class.
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
o Logistic Regression is a simple yet powerful algorithm that is widely used in supervised learning
for its interpretability, ease of use, and ability to handle non-linear relationships between the
predictor variables and the response variable. However, it assumes that the decision boundary
is linear, which may not always be true in practice. Various extensions to logistic regression have
been proposed to handle non-linear relationships, including polynomial regression and kernel
logistic regression.
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression showsthe
linear relationship, which means it finds how the value of the dependent variable is changing according
to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:
y= a0+a1x+ ε
Here,
The values for x and y variables are training datasets for Linear Regression model representation.
Linear regression can be further divided into two types of the algorithm:
If a single independent variable is used to predict the value of a numerical dependent variable, then
such a Linear Regression algorithm is called Simple Linear Regression.
What is Backpropagation?
Backpropagation is the essence of neural network training. It is the method of fine-tuning the weights of
a neural network based on the error rate obtained in the previous epoch (i.e., iteration). Proper
tuning of the weights allows you to reduce error rates and make the model reliable by increasing its
generalization.
Backpropagation in neural network is a short form for “backward propagation of errors.” It is a standard
method of training artificial neural networks. This method helps calculate the gradient of a loss function
with respect to all the weights in the network.
The Back propagation algorithm in neural network computes the gradient of the loss function for a single
weight by the chain rule. It efficiently computes one layer at a time, unlike a native direct computation.
It computes the gradient, but it does not define how the gradient is used. It generalizes the computation
in the delta rule.
-Static Back-propagation
-Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input for static output.
It is useful to solve static classification issues like optical character recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved. After that,
the error is computed and propagated backward.
MULTILAYER PERCEPTRON
A multilayer perceptron (MLP) is a type of artificial neural network commonly used in supervised
learning for classification and regression tasks. It consists of multiple layers of interconnected nodes,
also known as neurons, that are organized into input, hidden, and output layers.
The input layer receives the input data, which is then passed through the hidden layers to the output
layer, where the final prediction is made. Each neuron in the hidden layers is connected to every
neuron in the previous layer, and each connection is associated with a weight, which determines the
strength of the connection.
During training, the weights of the MLP are updated using backpropagation, which involves computing
the gradient of the loss function with respect to the weights and adjusting the weights in the direction of
the gradient to minimize the loss. The loss function typically measures the difference between the
predicted output and the true output, and the goal of training is to find the set of weights that minimize
the loss on the training data.
MLPs are powerful models that can learn complex nonlinear relationships between the input and output
variables. However, they are prone to overfitting, where the model learns to memorize the training data
instead of generalizing to new data. Regularization techniques, such as L1 and L2 regularization,
dropout, and early stopping, can be used to prevent overfitting and improve the generalization
performance of the model.
MLPs have been successfully applied to a wide range of applications, including image classification,
speech recognition, natural language processing, and financial modeling.
As
an
example, it can be anything from an adult who transforms their ability to spread their opinion after
learning how to use social networks and the internet to a person who has a transformative experience on
the way they view life due to a traumatic experience.
UNIT-III
Clustering
Clustering is the task of dividing the population or data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group and dissimilar to the
data points in other groups. It is basically a collection of objects on the basis of similarity and
dissimilarity between them.
For example, the data points in the graph below clustered together can be classified into one single group.
We can distinguish the clusters, and we can identify that there are 3 clusters in the below picture.
Clustering Methods:
Density-Based Methods: These methods consider the clusters as the dense region having some
similarities and differences from the lower dense region of the space. These methods have good accuracy
and the ability to merge two clusters.
Example: DBSCAN (Density-Based Spatial Clustering of Applications with Noise),
Hierarchical Based Methods: The clusters formed in this method form a tree-type structure based on
the hierarchy. New clusters are formed using the previously formed one.
It is divided into two categories-
• Agglomerative (bottom-up approach)
• Divisive (top-down approach)
Partitioning Methods: These methods partition the objects into k clusters and each partition forms
one cluster. This method is used to optimize an objective criterion similarity function such as when the
distance is a major parameter
Example: K-means, CLARANS (Clustering Large Applications based upon Randomized Search), etc.
Grid-based Methods: In this method, the data space is formulated into a finite number of cells that
form a grid-like structure. All the clustering operations done on these grids are fast and independent of
the number of data objects
Example: STING (Statistical Information Grid), wave cluster, CLIQUE (CLustering In Quest), etc.
Clustering Algorithms
The clustering algorithm is based on the kind of data that we are using. Such as, some algorithms need
to guess the number of clusters in the given dataset, whereas some are required to find the minimum
distance between the observations of the dataset.
Mainly popular Clustering algorithms that are widely used in machine learning are:
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms. It
classifies the dataset by dividing the samples into different clusters of equal variances. The
number of clusters must be specified in this algorithm. It is fast with fewer computations
required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density of
data points. It is an example of a centroid-based model, that works on updating the candidates
for centroid to be the center of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with Noise.
It is an example of a density-based model similar to the mean-shift, but with some remarkable
advantages. In this algorithm, the areas of high density are separated by the areas of low density.
Because of this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an alternative
for the k-means algorithm or for those cases where K-means can be failed. In GMM, it is
assumed that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm performs the
bottom-up hierarchical clustering. In this, each data point is treated as a single cluster at the outset
and then successively merged. The cluster hierarchy can be represented as a tree- structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not require
specifying the number of clusters. In this, each data point sends a message between the pair of
data points until convergence. It has O(N2T) time complexity, which is the main drawback of
this algorithm.
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
o In Identification of Cancer Cells: The clustering algorithms are widely used for the identification
of cancerous cells. It divides the cancerous and non-cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering technique. The search result
appears based on the closest object to the search query. It does it by grouping similar data
objects in one group that is far from the other dissimilar objects. The accurate result of a query
depends on the quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the customers based on their
choice and preferences.
o In Biology: It is used in the biology stream to classify different species of plants and animals
using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands use in the
GIS database. This can be very useful to find that for what purpose the particular land should
be used, that means for which purpose it is more suitable
Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozento
many thousands of dimensions. Such high-dimensional spaces of data are often encountered in areas
such as medicine, where DNA microarray technology can produce many measurements at once, and
the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions
equals the size of the vocabulary.
Four problems need to be overcome for clustering in high-dimensional data:
• Multiple dimensions are hard to think in, impossible to visualize, and, due to the exponential
growth of the number of possible values with each dimension, complete enumeration of all
subspaces becomes intractable with increasing dimensionality. This problem is known asthe
curse of dimensionality.
• The concept of distance becomes less precise as the number of dimensions grows, since the
distance between any two points in a given dataset converges. The discrimination of the nearest
and farthest point in particular becomes meaningless:
• A cluster is intended to group objects that are related, based on observations of their attribute's
values. However, given a large number of attributes some of the attributes will usually not be
meaningful for a given cluster.
For example, in newborn screening a cluster of samples might identify new-borns that share
similar blood values, which might lead to insights about the relevance of certain blood values for
a disease. But for different diseases, different blood values might form a cluster, and other values
might be uncorrelated. This is known as the local feature relevance problem: different clusters
might be found in different subspaces, so a global filtering of attributes is not sufficient.
• Given a large number of attributes, it is likely that some attributes are correlated. Hence, clusters
might exist in arbitrarily oriented affine subspaces.
Mixture densities
In unsupervised learning, the calculation of the mixture density involves estimating the parameters of a
mixture model from the observed data. The mixture density represents the probability density function
(PDF) of the observed data, which is a combination of multiple component densities.
Here's a general overview of how the mixture density is calculated in unsupervised learning:
o Choose the Mixture Model:
o Select the type of mixture model that best suits the data distribution. Common choices include
Gaussian Mixture Models (GMMs) or other types of mixture models like Dirichlet Process
Mixtures.
o Specify the Number of Components:
o Determine the number of components (clusters) in the mixture model. This can be done based
on prior knowledge or using techniques such as model selection criteria (e.g., AIC, BIC) or cross-
validation.
o Initialize the Model Parameters:
o Initialize the parameters of the mixture model, including the mixing proportions and the
parameters of each component distribution (e.g., mean, covariance for Gaussian components).
o E-step: Expectation Step:
o Given the current parameter estimates, calculate the posterior probabilities or responsibilities of
each component for each data point. This step is often computed using the Bayes' theorem or the
posterior probability of the latent variables given the observed data.
o M-step: Maximization Step:
o Update the model parameters based on the responsibilities obtained in the E-step. This typically
involves maximizing the likelihood or maximizing the expected complete-data log-likelihood.
o Iterative Optimization:
o Iterate between the E-step and M-step until convergence. The convergence criteria can be
based on the change in log-likelihood or a predetermined number of iterations.
o Compute the Mixture Density:
o Once the mixture model parameters have converged, the mixture density can be computed by
combining the densities of each component, weighted by the corresponding mixing proportions.
o Utilize the Mixture Density:
o The calculated mixture density can be used for various purposes, such as clustering, density
estimation, anomaly detection, or generating new samples from the learned distribution.
It's important to note that the specific algorithms and techniques used for the estimation and calculation
of the mixture density may vary depending on the chosen mixture model and the inference method
employed (e.g., EM algorithm, variational inference, etc.).
Gaussian Mixture Models (GMMs) assume that there are a certain number of Gaussian distributions,
and each of these distributions represents a cluster. Hence, a Gaussian Mixture Model tends to group the
data points belonging to a single distribution together.
Gaussian Mixture Models are probabilistic models and use the soft clustering approach for distributing
the points in different clusters.
Here, we have three clusters that are denoted by three colors – Blue, Green, and Cyan. Let‟s take the
data point highlighted in red. The probability of this point being a part of the blue cluster is 1, while the
probability of it being a part of the green or cyan clusters is 0.
Now, consider another point – somewhere in between the blue and cyan (highlighted in the below
figure). The probability that this point is a part of cluster green is 0, right? And the probability that
this belongs to blue and cyan is 0.2 and 0.8 respectively.
In a one dimensional space, the probability density function of a Gaussian distribution is given by:
Where x is the input vector, μ is the 2D mean vector, and Σ is the 2×2 covariance matrix. The covariance
would now define the shape of this curve. We can generalize the same for d-dimensions
K-Means Clustering-
The K-means clustering algorithm computes centroids and repeats until the optimal centroid is found.
It is presumptively known how many clusters there are. It is also known as the flat clusteringalgorithm.
The number of clusters found from data by the method is denoted by the letter „K‟ in K- means.
The elbow method is a graphical representation of finding the optimal „K‟ in a K-means clustering. It
works by finding WCSS (Within-Cluster Sum of Square) i.e. the sum of the square distance between
points in a cluster and the cluster centroid.The elbow graph shows WCSS values (on the y-axis)
corresponding to the different values of K (on the x-axis). When we see an elbow shape in the graph, we
pick the K-value where the elbow gets created. We can call this point the Elbow point. Beyond the Elbow
point, increasing the value of „K‟ does not lead to a
Example:
Cluster the following eight points (with (x, y) representing locations) into three clusters:
A1 (2, 10), A2 (2, 5), A3 (8, 4), A4 (5, 8), A5 (7, 5), A6 (6, 4), A7 (1, 2), A8 (4, 9)
Initial cluster centres are: A1 (2, 10), A4 (5, 8) and A7 (1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
A1(2, 10) 0 5 9 C1
A2(2, 5) 5 6 4 C3
A3(8, 4) 12 7 9 C2
A4(5, 8) 5 0 10 C2
A5(7, 5) 10 5 9 C2
A6(6, 4) 10 5 7 C2
A7(1, 2) 9 10 0 C3
A8(4, 9) 3 2 10 C2
Cluster-01:
First cluster contains points-
• A1(2, 10)
Cluster-02:
Second cluster contains points-
• A3(8, 4)
• A4(5, 8)
• A5(7, 5)
• A6(6, 4)
• A8(4, 9)
Cluster-03:
Third cluster contains points-
• A2(2, 5)
• A7(1, 2)
Now,
• We re-compute the new cluster clusters.
• The new cluster center is computed by taking mean of all the points contained in that cluster.
For Cluster-01:
We have only one point A1 (2, 10) in Cluster-01.
• So, cluster center remains the same.
For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)
= (6, 6)
For Cluster-03:
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
Iteration-02:
A1(2, 10) 0 8 7 C1
A2(2, 5) 5 5 2 C3
A3(8, 4) 12 4 7 C2
A4(5, 8) 5 3 8 C2
A5(7, 5) 10 2 7 C2
A6(6, 4) 10 2 5 C2
A7(1, 2) 9 9 2 C3
A8(4, 9) 3 5 8 C1
Repeat the procedure until points converge together.
Expectation-Maximization Algorithm
The Expectation-Maximization algorithm aims to use the available observed data of the dataset
to estimate the missing data of the latent variables and then using that data to update the values of the
parameters in the maximization step.
Let us understand the EM algorithm in a detailed manner:
• Initialization Step: In this step, we initialized the parameter values with a set of initial values,
then give the set of incomplete observed data to the system with the assumption that the observed
data comes from a specific model i.e, probability distribution.
• Expectation Step: In this step, by using the observed data to estimate or guess the values of the
missing or incomplete data. It is used to update the variables.
• Maximization Step: In this step, we use the complete data generated in the “Expectation” step
to update the values of the parameters i.e, update the hypothesis.
• Checking of convergence Step: Now, in this step, we checked whether the values are converging
or not, if yes, then stop otherwise repeat these two steps i.e, the “Expectation” step and
“Maximization” step until the convergence occurs.
Applications of EM Algorithm
The latent variable model has several real-life applications in Machine learning:
• Used to calculate the Gaussian density of a function.
• Helpful to fill in the missing data during a sample.
• It finds plenty of use in different domains such as Natural Language Processing
(NLP), Computer Vision, etc.
• Used in image reconstruction in the field of Medicine and Structural Engineering.
• Used for estimating the parameters of the Hidden Markov Model (HMM) and also for some
other mixed models like Gaussian Mixture Models, etc.
• Used for finding the values of latent variables.
Hierarchical clustering is another unsupervised learning algorithm that is used to group together the
unlabelled data points having similar characteristics.
The closest distance between the two clusters is crucial for the hierarchical clustering. There are
various ways to calculate the distance between two clusters, and these ways decide the rule for
clustering. These measures are called Linkage methods. Some of the popular linkage methods are
given below:
1. Single Linkage: It is the Shortest Distance between the closest points of the clusters. Consider
the below image:
2. Complete Linkage: It is the farthest distance between the two points of two different clusters. It
is one of the popular linkage methods as it forms tighter clusters than single-linkage.
3. Average Linkage: It is the linkage method in which the distance between each pair of datasets
is added up and then divided by the total number of datasets to calculate the average distance between
two clusters. It is also one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid of the
clusters is calculated.
Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all the data
points are treated as one big cluster and the process of clustering involves dividing (Top-downapproach)
the one big cluster into various small clusters.
A dendrogram, which is a tree like structure, is used to represent hierarchical clustering. Individual
objects are represented by leaf nodes and the clusters are represented by root nodes. A representation
of dendrogram is shown in this figure:
Agglomerative Algorithm: Single Link
Single-nearest distance or single linkage is the agglomerative method that uses the distance between
the closest members of the two clusters. We will now solve a problem to understand it better:
Question Find the clusters using a single link technique. Use Euclidean distance and draw the
dendrogram.
Sample No. X Y
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
So we have to find the Euclidean distance between each and every point, say we first find the
Euclidean distance between P1 and P2
Distanc
e
Matrix
Similarly, find the Euclidean distance for every point. But there is one point to focus on that the
diagonal of the above distance matrix is a special point for us.
The distance above and below the diagonal will be same. For eg: d(P2, P5) is equivalent to d(P5, P2).
So we will find the distance of the below section of the matrix.
Therefore, the
updated Distance
Matrix will be :
Step 2: Merging the two closest members of the two clusters and finding the minimum element in
distance matrix. Here the minimum value is 0.10 and hence we combine P3 and P6 (as 0.10 came in
the P6 row and P3 column). Now, form clusters of elements corresponding to the minimum value and
update the distance matrix. To update the distance matrix:
min ((P3,P6), P1) = min ((P3,P1), (P6,P1)) = min (0.22,0.24) = 0.22
min ((P3,P6), P2) = min ((P3,P2), (P6,P2)) = min (0.14,0.24) = 0.14
min ((P3,P6), P4) = min ((P3,P4), (P6,P4)) = min (0.13,0.22) = 0.13
min ((P3,P6), P5) = min ((P3,P5), (P6,P5)) = min (0.28,0.39) = 0.28
Now we will repeat the same process. Merge two closest members of the two clusters and find the
minimum element in distance matrix. The minimum value is 0.13 and hence we combine P3, P6 and
P4. Now, form the clusters of elements corresponding to the minimum values and update the Distance
matrix. In order to find, what we have to update in distance matrix,
min (((P3,P6) P4), P1) = min (((P3,P6), P1), (P4,P1)) = min (0.22,0.37) = 0.22
min (((P3,P6), P4), P2) = min (((P3,P6), P2), (P4,P2)) = min (0.14,0.19) = 0.14
min (((P3,P6), P4), P5) = min (((P3,P6), P5), (P4,P5)) = min (0.28,0.23) = 0.23
Again repeating the same process: The minimum value is 0.14 and hence we combine P2 and P5.
Now, form cluster of elements corresponding to minimum value and update the distance matrix. To
update the distance matrix:
min ((P2,P5), P1) = min ((P2,P1), (P5,P1)) = min (0.23, 0.34) = 0.23
min ((P2,P5), (P3,P6,P4)) = min ((P3,P6,P4), (P3,P6,P4)) = min (0.14. 0.23
Again repeating the same process: The minimum value is 0.14 and hence we combine P2,P5 and
P3,P6,P4. Now, form cluster of elements corresponding to minimum value and update the distance
matrix. To update the distance matrix:
min ((P2,P5,P3,P6,P4), P1) = min ((P2,P5), P1), ((P3,P6,P4), P1)) = min (0.23, 0.22) = 0.22
A latent variable model is a statistical model that relates a set of observable variables (also called
manifest variables or indicators) to a set of latent variables.
Manifest variables
Latent variables Continuous Categorical
Continuous Factor analysis Item response theory
Categorical Latent profile analysis Latent class analysis
In machine learning, mixture models are a class of latent variable models that are used to represent
complex distributions by combining simpler component distributions. Latent variable models involve
unobserved variables (latent variables) that are used to capture hidden patterns or structure in the data.
Let's consider an example of a mixture of Gaussian distributions, which is one of the most commonly,
used types of mixture models. In this case, the observed data is assumed to come from a combination
of several Gaussian distributions.
Model Representation: Latent Variables: We introduce a set of latent variables, often called "mixture
indicators" or "cluster assignments," denoted as z. Each latent variable z corresponds to a specific
component of the mixture.
Parameters: We have a set of parameters for the mixture model, including the mixing proportions π
and the parameters (mean and covariance) of each Gaussian component.
Data Generation: Sample Cluster: For each data point, we first sample a latent variable z from a
categorical distribution according to the mixing proportions π. This determines the component from
which the data point will be generated.
Generate Data: Given the selected component, we sample the data point x from the corresponding
Gaussian distribution.
Model Inference: Given observed data points x, the goal is to infer the latent variables z and the
model parameters.
Inference can be done using various techniques such as Expectation-Maximization (EM) algorithm,
variational inference, or Markov chain Monte Carlo (MCMC) methods.
Model Learning: The model parameters, including the mixing proportions π and the Gaussian
parameters, are learned from the observed data using the chosen inference algorithm.
The learning process involves iteratively updating the model parameters until convergence, maximizing
the likelihood or posterior probability of the observed data.
Model Utilization: Once the model is learned, it can be used for various tasks such as clustering, density
estimation, and anomaly detection.
Gaussian Mixture Models are probabilistic models and use the soft clustering approach for distributing
the points in different clusters.
Here, we have three clusters that are denoted by three colors – Blue, Green, and Cyan. Let‟s take the
data point highlighted in red. The probability of this point being a part of the blue cluster is 1, while the
probability of it being a part of the green or cyan clusters is 0.
Now, consider another point – somewhere in between the blue and cyan (highlighted in the below
figure). The probability that this point is a part of cluster green is 0, right? And the probability that this
belongs to blue and cyan is 0.2 and 0.8 respectively.
In a one dimensional space, the probability density function of a Gaussian distribution is given by:
Where x is the input vector, μ is the 2D mean vector, and Σ is the 2×2 covariance matrix. The
covariance would now define the shape of this curve. We can generalize the same for d-dimensions
How to calculate distance between two points in clustering techniques, there are several metrics
as follows -
Distance Metrics
• Euclidean Distance
• Manhattan Distance
• Minkowski Distance
• Hamming Distance
Euclidean Distance
Euclidean Distance represents the shortest distance between two vectors. It is the square root of the sum
of squares of differences between corresponding elements.
Manhattan Distance
Manhattan distance between two points in two dimensions is the sum of absolute differences of their
Cartesian coordinates. Manhattan distance is also called with different names such as rectilinear
distance, L1 distance, L1 norm, snake distance, city block distance, etc.
Minkowski Distance
Minkowski distance can be considered as a generalized form of both the Euclidean distance and the
Manhattan distance.
The Minkowski distance of order p (where p is an integer) between two points X = (x1, x2 … xn) and
Y = (y1, y2….yn) is given by:
Hamming Distance
It is named after Richard Hamming. The hamming distance between two strings of equal length is the
number of positions at which the corresponding symbols are different. The strings can be letters, bits, or
decimal digits, etc.
Given two vectors A and B, the cosine similarity, cos(θ), is represented using a dot product andmagnitude
as below:
UNIT-IV
Histogram Estimator
It is the oldest and the most popular method used to estimate the density, where the input space is divided
into equal-sized intervals called bins. Given the training set X = {xt}N t=1 an origin x0 and the bin width
h, the histogram density estimator function is:
Histogram estimator
The density of a sample is dependent on the number of training samples present in that bin. In
constructing the histogram of densities we choose the origin and the bin width, the position of origin
affects the estimation near the boundaries.
For example
Kernel Density Estimator (KDE)
Kernel estimator is used to smoothen the probability distribution function (pdf) and cumulative
distribution function (CDF) graphics. The kernel is nothing but a weight. Gaussian Kernel is the most
popular kernel:
Gaussian kernel
As you can observe, as |x – xt| increases that means, the training sample is far away from the given
sample, and the kernel value decreases. Hence we can say that the contribution of a farther sample is
less when compared to the nearest training samples. There are many more kernels: Gaussian,
Rectangular, Triangular, Biweight, Uniform, Cosine, etc.
KNN Estimator
Let us have an example data sample and estimate the density at a point using nonparametric density
estimation functions.
Note: Points marked with ‘x’ are the given data samples. Unlike the above estimation methods, we
do not fix the bind size/width, instead, this density estimation method is based on the k value. We observe
a high-density value when k is less and the density is less when the value of k increases.
KNN Estimator
Nonparametric classification:
Non-parametric classification is a type of machine learning algorithm that does not make explicit
assumptions about the functional form or distribution of the underlying data. Instead of estimating
parameters, non-parametric methods directly learn the patterns and relationships from the data. These
methods are particularly useful when the underlying data distribution is complex or unknown.
k-Nearest Neighbors (k-NN): This algorithm classifies new data points based on the class labels of their
nearest neighbors in the training set. The value of k determines the number of neighbors considered.
Decision Trees: Decision trees recursively split the feature space based on different features to form a
hierarchical structure. Each internal node represents a feature, and each leaf node represents a class label.
Random Forests: Random forests are an ensemble method that combines multiple decision trees. Each
tree is trained on a random subset of the data and features, and the final prediction is obtained by
aggregating the predictions of individual trees.
Support Vector Machines (SVM): SVMs map the data into a higher-dimensional space and find the
optimal hyperplane that maximally separates the classes. The decision boundary is determined by a
subset of the training samples called support vectors.
Neural Networks: While neural networks are often considered parametric, certain architectures such as
deep neural networks with a large number of layers and parameters can be considered non-parametric
due to their ability to learn complex functions without explicit assumptions.
Naive Bayes: Naive Bayes classifiers are probabilistic models that assume independence between
features given the class. Although they have certain parametric assumptions, they are often considered
non-parametric in practice due to their simplicity and effectiveness.
Non-parametric classification algorithms are generally flexible and can capture complex patterns in the
data. However, they may be computationally intensive and require more training data compared to
parametric methods. It is important to note that the choice of algorithm depends on the specific problem
and the characteristics of the dataset.
Training Set Consistency: A set is said to be training set consistent if on running KNN on the dataset
with the classifiers as the points in store, we get the same classification as when KNN was run on the
entire dataset.
i.e.
3-NN with S as the dataset classifies the dark blue point as blue.However, according to the complete
dataset, that point must be red, becuase out of 3 nearest neighbors, 2 arered.i.e.
3-NN on complete dataset classifies as the dark blue point as red.
Step 4: We select a random point from the dataset to add in store such that the inconsistency with the
classification of dark blue point is solved keeping the prediction of the dataset as gold standard
i.e. select a random point xᵢ from the dataset such that on adding xᵢ we have,
We add the red square (outlined with blue dotted square) in store S
4. Has the same numbers of factors and variables, where each factor contains a certain amount of
overall variance .
DIMENSIONALITY REDUCTION:
Dimensionality reduction technique can be defined as, "It is a way of converting the higher dimensions
dataset into lesser dimensions dataset ensuring that it provides similar information." These techniques
are widely used in machine learning for obtaining a better fit predictive model while solving the
classification and regression problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech recognition, signal
processing, bioinformatics, etc. It can also be used for data visualization, noise reduction, clusteranalysis,
etc.
A feature is an attribute that has an impact on a problem or is useful for the problem, and
choosing the important features for the model is known as feature selection.
Supervised Feature selection techniques consider the target variable and can be used for the labelled
dataset.
Unsupervised Feature selection techniques ignore the target variable and can be used for the unlabelled
dataset.
As most data scientists know, dimensionality is a curse; although the number of dimensions is not a
curse itself but also the quality of each feature in a specific dimension. Dimensionality reduction is a set
of techniques that try to transform the input space into a space with fewer dimensions while keeping the
meaning and value of the features. In this post, we will journey through a greedy algorithm (greedy in
the sense that it does not guarantee to find the optimal answer) that generates a selection of features that
Dimensionality reduction algorithms can be classified as feature selection methods or feature extraction
methods. Feature selection methods are interested in reducing the number of initial features to the ones
that give us the most information. On the other hand, feature extraction methods are interested in
finding a new set of features, different from the initial ones, and with fewer dimensions.
Subset selection
Subset selection is a feature selection algorithm that can variate between a forward selection and a
backward selection. Both methods consist in finding a subset of the initial features that contain the least
number of dimensions that most contribute to accuracy. A naive approach would be to try all the 2^n
possible subset combinations but if the number of dimensions is too big it would take forever. Instead,
based on a heuristic function (error function) we add or remove features. The performance of subset
selection depends highly on the model we choose and our pruning selection algorithm.
Forward selection
In forward selection we start with an empty set of features, for each feature that is not in the set we train
the model with it and test its performance; we then select the feature with the least amount of error. We
continue adding new features for the model to train until the error is low enough or until we have
Backward selection works in the same way as forward but instead of starting with an empty set and
adding features one by one, we start with a full set and remove features one by one. Thus we remove the
features that cause the most error.
PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.
PCA works by considering the variance of each attribute because the high attribute shows the good split
between the classes, and hence it reduces the dimensionality. Some real-world applications of PCA are
image processing, movie recommendation system, optimizing the power allocation in various
communication channels. It is a feature extraction technique, so it contains the important variables and
drops the least important variable.
o Dimensionality: It is the number of features or variables present in the given dataset. More
easily, it is the number of columns present in the dataset.
o Correlation: It signifies that how strongly two variables are related to each other. Such as if one
changes, the other variable also gets changed. The correlation value ranges from -1 to +1. Here,
-1 occurs if variables are inversely proportional to each other, and +1 indicates that variables are
directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and hence the correlation
between the pair of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v will be
eigenvector if Av is the scalar multiple of v.
o Covariance Matrix: A matrix containing the covariance between the pair of variables is called
the Covariance Matrix.
As described above, the transformed new features or the output of PCA are the Principal Components.
The number of these PCs are either equal to or less than the original features present in the dataset. Some
properties of these principal components are given below:
o The principal component must be the linear combination of the original features.
o These components are orthogonal, i.e., the correlation between a pair of variables is zero.
o The importance of each component decreases when going to 1 to n, it means the 1 PC has the
most importance, and n PC will have the least importance.
Now we will represent our dataset into a structure. Such as we will represent the two-
dimensional matrix of independent variable X. Here each row corresponds to the data items, and
the column corresponds to the Features. The number of columns is the dimensions of the dataset.
3. Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column, the features with
high variance are more important compared to the features with lowervariance.If
the importance of features is independent of the variance of the feature, then we will divide each
data item in a column with the standard deviation of the column. Here we will name the matrix
as Z.
In this step, we will take all the eigenvalues and will sort them in decreasing order, which means
from largest to smallest. And simultaneously sort the eigenvectors accordingly in matrix P of
eigenvalues. The resultant matrix will be named as P*.
7. Calculating the new features or principal components.
Here we will calculate the new features. To do this, we will multiply the P* matrix to the Z. In
the resultant matrix Z*, each observation is the linear combination of original features. Each
column of the Z* matrix is independent of each other.
8. Remove less or unimportant features from the new dataset. The
new feature set has occurred, so we will decide here what to keep and what to remove. It means,
we will only keep the relevant or important features in the new dataset, and unimportantfeatures
will be removed out.
o It can also be used for finding hidden patterns if data has high dimensions. Some fields where
PCA is used are Finance, data mining, Psychology, etc.
MULTIDIMENSIONAL SCALING:
Multidimensional scaling is the graphical or visual representation of the datasets in the form of a distance
or dissimilarities matrix between sets of objects. Here object term refers to anything, for example,
jackets, perfumes, cars, bikes, etc. With the help of multidimensional scaling, we can calculate the
similarity between the objects.
With the distance or dissimilarity value, we can conclude a representation of objects similar to each
other. The closer the distance or less dissimilarity between the objects more similar they are, and the
bigger the distance, the less similar the objects are.
The word dimension here refers to the attribute of a dataset. If there are two attributes in a dataset or
matrix, then we will take a two-dimensional representation of the data, but this cannot be the case in
every dataset.
You might use multiple dimensions to represent the multiple attributes, but this can make our outcome
complex to represent visually, and we will need help comprehending it.
It is best to use the three dimensions at most because, more than that, our brain can not process the
information visually. But mathematically, we can achieve it.
The term scaling represents the measurement of the object. It is like a scale of two numbers in which
one is higher and the other is lower that we can use to measure the preference or perception of the object
for a user.
For example, a scale from 1 to 5 represents a person's liking of street food.
Techniques of Multidimensional Scaling
There are multiple techniques available in multidimensional scaling that you can use. Their techniques
depend on the input data you use for multidimensional scaling.
Metric Multidimensional scaling
Metric Multidimensional Scaling can be considered a technique for visualizing data: you input a distance
matrix with the distances between a set number of data points, and the technique produces a graph
displaying those observations.
Example
We have a matrix of distances between different cities. Let's name the city from A to E for simplicity.
The distance is in KM.
City A B C D E
B 222 0 230 97 89
D 131 97 306 0 55
E 190 89 311 55 0
From the matrix, we can observe that distances from one city to another like from A to B is 222 km
and from A to C it's 240, and so on. The 0 value means the distance from city A To A.
As you can see, we have plotted the graph from the given matrix, and if we add the directions from
north, south, east, and west, we can easily see the map.
Euclidean distance
The Euclidean distance measures the distance between two vectors with real values. When computing
the distance between two rows of data with numerical values, such as a floating point or integer value,
you are most likely to use the Euclidean distance.
The closer the euclidean distance is between two objects on the graph, the more similar the
objectsare.
Whenever there is a requirement to separate two or more classes having multiple features efficiently, the
Linear Discriminant Analysis model is considered the most common technique to solve such
classification problems. For e.g., if we have two classes with multiple features and need to separate them
efficiently. When we classify them using a single feature, then it may show overlapping.
To overcome the overlapping issue in the classification process, we must increase the number of
features regularly.
Example:
Let's assume we have to classify two different classes having two sets of data points in a 2-dimensional
plane as shown below image:
However, it is impossible to draw a straight line in a 2-d plane that can separate these data points
efficiently but using linear Discriminant analysis; we can dimensionally reduce the 2-D plane into the
1-D plane. Using this technique, we can also maximize the separability between multiple classes.
Linear Discriminant analysis is used as a dimensionality reduction technique in machine learning, using
which we can easily transform a 2-D and 3-D graph into a 1-dimensional plane.
Let's consider an example where we have two classes in a 2-D plane having an X-Y axis, and we need
to classify them efficiently. As we have already seen in the above example that LDA enables us to
draw a straight line that can completely separate the two classes of the data points. Here, LDA uses an
X-Y axis to create a new axis by separating them using a straight line and projecting data onto a new
axis.
Hence, we can maximize the separation between these classes and reduce the 2-D plane into 1-D.
Factor Analytics is a special technique reducing the huge number of variables into a few numbers of
factors is known as factoring of the data, and managing which data is to be present in sheet comes under
factor analysis. It is completely a statistical approach that is also used to describe fluctuations among the
observed and correlated variables in terms of a potentially lower number of unobserved variables called
factors.
The factor analysis technique extracts the maximum common variance from all the variables and puts
them into a common score. It is a theory that is used in training the machine learning model and so it is
quite related to data mining. The belief behind factor analytic techniques is that the information gained
about the interdependencies between observed variables can be used later to reduce the set of variables
in a dataset.
Factor analysis is a very effective tool for inspecting changeable relationships for complex concepts such
as social status, economic status, dietary patterns, psychological scales, biology, psychometrics,
personality theories, marketing, product management, operations research, finance, etc. It can help a
researcher to investigate the concepts that are not easily measured in a much easier and quicker way
directly by the cave in a large number of variables into a few easily interpretable fundamental factors.
Factor Analysis in machine learning is used to reduce the number of variables in a given dataset to obtain
a more accurate and enhanced collection of observable factors. Multiple algorithms based on machine
learning are used to work in this manner.
They are properly trained with massive amounts of data in order to lead the way to new applications.
Factor analysis is an unsupervised machine learning approach that is commonly used in machine learning
for dimensionality reduction. As a result, machine learning and factor analysis could be used together to
create data mining approaches and make data analysis much more efficient.
Cost-Effective
Data research and data mining algorithms are extremely expensive. But the statistical model of factor
analysis is available at a surprisingly affordable cost. Moreover, you don‟t need too many resources to
perform factor analysis. Additionally, it can be performed by experienced professionals as well as
beginners.
Measurable
One of the major benefits of factor analysis is its measurable nature. This statistical model can be
worked upon various attributes. Whether it‟s subjective or objective, it works well with everything.
Flexible
Several machine learning algorithms are limited to a single approach. But factor analysis is anexception
and offers a lot of flexibility. The flexible approach of the statistical model helps determine the
connections between different variables and their underlying components.
UNIT-V
INTRODUCTION:
Reinforcement learning is an area of Machine Learning. It is about taking suitable action to maximize
reward in a particular situation. It is employed by various software and machines to find the best possible
behavior or path it should take in a specific situation. Reinforcement learning differs from supervised
learning in a way that in supervised learning the training data has the answer key with it so the model
is trained with the correct answer itself whereas in reinforcement learning, there is noanswer but
the reinforcement agent decides what to do to perform the given task. In the absence of a training dataset,
it is bound to learn from its experience.
Reinforcement Learning (RL) is the science of decision making. It is about learning the optimal behavior
in an environment to obtain maximum reward. In RL, the data is accumulated from machine learning
systems that use a trial-and-error method. Data is not part of the input that we would find in supervised
or unsupervised machine learning.
The agent's task is to learn a control policy, 𝝅: S → A, that maximizes the expected sum of these rewards,
with future rewards discounted exponentially by their delay.
Terms used in Reinforcement Learning
o Agent(): An entity that can perceive/explore the environment and act upon it.
o Environment(): A situation in which an agent is present or surrounded by. In RL, we assume
the stochastic environment, which means it is random in nature.
o Action(): Actions are the moves taken by an agent within the environment.
o State(): State is a situation returned by the environment after each action taken by the agent.
o Reward(): A feedback returned to the agent from the environment to evaluate the action of the
agent.
o Policy(): Policy is a strategy applied by the agent for the next action based on the current state.
o Value(): It is expected long-term retuned with the discount factor and opposite to the short-
term reward.
o Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current
action (a).
1. Delayed reward: The task of the agent is to learn a target function 𝜋 that maps from the current
state s to the optimal action a = 𝜋 (s). In reinforcement learning, training information is not
available in (s, 𝜋 (s)). Instead, the trainer provides only a sequence of immediate reward values
as the agent executes its sequence of actions. The agent, therefore, faces the problem of temporal
credit assignment: determining which of the actions in its sequence are to be credited with
producing the eventual rewards.
2. Exploration: In reinforcement learning, the agent influences the distribution of training
examples by the action sequence it chooses. This raises the question of which experimentation
strategy produces most effective learning. The learner faces a trade-off in choosing whether to
favor exploration of unknown states and actions, or exploitation of states and actions that it has
already learned will yield high reward.
3. Partially observable states: The agent's sensors can perceive the entire state of the environment
at each time step, in many practical situations sensors provide only partial information. In such
cases, the agent needs to consider its previous observations together with its current sensor data
when choosing actions, and the best policy may be one that chooses actions specifically to
improve the observability of the environment.
4. Life-long learning: Robot requires to learn several related tasks within the same environment,
using the same sensors. For example, a mobile robot may need to learn how to dock on its battery
charger, how to navigate through narrow corridors, and how to pick up output from laser printers.
This setting raises the possibility of using previously obtained experience or knowledge to reduce
sample complexity when learning new tasks.
There are four main elements of Reinforcement Learning, which are given below:
1. Policy
2. Reward Signal
3. Value Function
4. Model of the environment
1) Policy: A policy can be defined as a way how an agent behaves at a given time. It maps the perceived
states of the environment to the actions taken on those states. A policy is the core element ofthe RL as it
alone can define the behavior of the agent. In some cases, it may be a simple function or a lookup table,
whereas, for other cases, it may involve general computation as a search process. It couldbe deterministic
or a stochastic policy:
2) Reward Signal: The goal of reinforcement learning is defined by the reward signal. At each state,
the environment sends an immediate signal to the learning agent, and this signal is known as a reward
signal. These rewards are given according to the good and bad actions taken by the agent. The agent's
main objective is to maximize the total number of rewards for good actions. The reward signal can
change the policy, such as if an action selected by the agent leads to low reward, then the policy may
change to select other actions in the future.
3) Value Function: The value function gives information about how good the situation and action are
and how much reward an agent can expect. A reward indicates the immediate signal for each good and
bad action, whereas a value function specifies the good state and action for the future. The value
function depends on the reward as, without reward, there could be no value. The goal of estimating
values is to achieve more rewards.
4) Model: The last element of reinforcement learning is the model, which mimics the behavior of the
environment. With the help of the model, one can make inferences about how the environment will
behave. Such as, if a state and an action are given, then a model can predict the next state and reward.
The model is used for planning, which means it provides a way to take a course of action by considering
all future situations before actually experiencing those situations. The approaches for solving the RL
problems with the help of the model are termed as the model-based approach. Comparatively, an
approach without using a model is called a model-free approach.
How does Reinforcement Learning Work?
To understand the working process of the RL, we need to consider two main things:
The K-armed bandit (also known as the Multi-Armed Bandit problem) is a simple, yet powerful example
of allocation of a limited set of resources over time and under uncertainty. It ha s been initially studied
by Thompson (1933), who suggested a heuristic for navigating the exploration - exploitation dilemma.
The problem has also been studied in the fields of computer science, operations research, probability
theory, and economics, and is well suited for exploring with the tools of reinforcement learning.
In its basic form, the problem considers a gambler standing in front of a row of K slot machines(also
known as one-armed bandits) and trying to conceive a strategy for which machine to play,for how
many times, and when to switch machines in order to increase the chances of making a profit.
What makes this premise interesting is that each of the bandits dispenses rewards according to a
probability distribution, which is specific to the bandit and is initially unknown to the gambler.
The optimal strategy, therefore, would involve striking a balance between learning more about the
individual probability distributions (exploration) and maximising the profits based on the information
acquired so far (exploitation).
Formalizing the K-armed Bandit Problem
Now let‟s formalise the k-armed bandit problem, so we can use it to introduce some of the tools
and techniques used in reinforcement learning. Let say we are playing bandits and each
game consists of turns. Let be the set of all possible actions in the game. As there
are arms to select from, it is clear that . We will also use to denote the action
from taken at time . Note that we are using the term time in a discrete sense and
interchangeably with turn.
We start by exploring a variant of the problem where each bandit dispenses rewards according
to an assigned Bernoulli distribution from . In other words, the reward
of each arm is in {0,1}, and is given by the following probability mass function.
Model-Based Learning:
Model-based Reinforcement Learning refers to learning optimal behavior indirectly by learning a model
of the environment by taking actions and observing the outcomes that include the next state and the
immediate reward. The models predict the outcomes of actions and are used in lieu of or in addition to
interaction with the environment to learn optimal policies.
Model : Anything the agent can use to predict how the environment will respond to its actions,
concretely, the state transition T(s‟|s,a) and reward R(s,a).
Model-learning:
Model based learning algorithms: Model-based learning (also known as structure-based or eager
learning) takes a different approach by constructing models from the training data that can generalize
better than instance-based methods. This involves using algorithms like linear regression, logistic
regression, random forest, etc.
RL algorithms can be mainly divided into two categories – model-based and model-free.
Model-based, as it sounds, has an agent trying to understand its environment and creating a model for
it based on its interactions with this environment. In such a system, preferences take priority over the
consequences of the actions i.e. the greedy agent will always try to perform an action that will get the
maximum reward irrespective of what that action may cause.
On the other hand, model-free algorithms seek to learn the consequences of their actions through
experience via algorithms such as Policy Gradient, Q-Learning, etc. In other words, such an algorithm
will carry out an action multiple times and will adjust the policy (the strategy behind its actions) for
optimal rewards, based on the outcomes.
We can formulate a reinforcement learning problem via a Markov Decision Process (MDP). The
essential elements of such a problem are the environment, state, reward, policy, and value.
A policy is a mapping from states to actions. Finding an optimal policy leads to generating the maximum
reward. Given an MDP environment, we can use dynamic programming algorithms to compute optimal
policies, which lead to the highest possible sum of future rewards at each state.
Dynamic programming algorithms work on the assumption that we have a perfect model of the
environment‟s MDP. So, we‟re able to use a one-step look-ahead approach and compute rewards for all
possible actions.
In this tutorial, we‟ll discuss how to find an optimal policy for a given MDP. More specifically, we‟ll
learn about two dynamic programming algorithms: value iteration and policy iteration. Then, we‟ll
discuss these algorithms‟ advantages and disadvantages over each other.
Policy Iteration
Then, we calculate the improved policy by using one-step look-ahead to replace the initial policy :
Here, is the reward generated by taking the action is a discount factor for future rewards
and is the transition probability.
Value Iteration
policy iteration and value iteration are both dynamic programming algorithms that find an optimal policy
in a reinforcement learning environment. They both employ variations of Bellman updates and exploit
one-step look-ahead:
The policy iteration algorithm updates the policy. The value iteration algorithm iterates over the value
function instead. Still, both algorithms implicitly update the policy and state value function in each
iteration.
In each iteration, the policy iteration function goes through two phases. One phase evaluates the
policy, and the other one improves it. The value iteration function covers these two phases by taking a
maximum over the utility function for all possible actions.
The value iteration algorithm is straightforward. It combines two phases of the policy iteration into a
single update operation. However, the value iteration function runs through all possible actions at once
to find the maximum action value. Subsequently, the value iteration algorithm is computationally
heavier.
Both algorithms are guaranteed to converge to an optimal policy in the end. Yet, the policy iteration
algorithm converges within fewer iterations. As a result, the policy iteration is reported to conclude faster
than the value iteration algorithm.
TEMPORAL DIFFERENCE LEARNING
Temporal difference (TD) learning refers to a class of model-free reinforcement learning methods which
learn by bootstrapping from the current estimate of the value function. These methods sample from the
environment, like Monte Carlo methods, and perform updates based on current estimates, like dynamic
programming methods.
Temporal Difference Learning is an unsupervised learning technique that is very commonly used in
reinforcement learning for the purpose of predicting the total reward expected over the future. They can,
however, be used to predict other quantities as well. It is essentially a way to learn how to predicta
quantity that is dependent on the future values of a given signal. It is a method that is used to compute
the long-term utility of a pattern of behaviour from a series of intermediate rewards.
Essentially, Temporal Difference Learning (TD Learning) focuses on predicting a variable's future value
in a sequence of states. Temporal difference learning was a major breakthrough in solving the problem
of reward prediction. You could say that it employs a mathematical trick that allows it to replace
complicated reasoning with a simple learning procedure that can be used to generate the very same
results.
The trick is that rather than attempting to calculate the total future reward, temporal difference learning
just attempts to predict the combination of immediate reward and its own reward prediction at the next
moment in time. Now when the next moment comes and brings fresh information with it, the new
prediction is compared with the expected prediction. If these two predictions are different from each
other, the Temporal Difference Learning algorithm will calculate how different the predictions are
from each other and make use of this temporal difference to adjust the old prediction toward the new
prediction.
The temporal difference algorithm always aims to bring the expected prediction and the new prediction
together, thus matching expectations with reality and gradually increasing the accuracy of the entire
chain of prediction.
Temporal Difference Learning aims to predict a combination of the immediate reward and its own
reward prediction at the next moment in time.
In TD Learning, the training signal for a prediction is a future prediction. This method is a
combination of the Monte Carlo (MC) method and the Dynamic Programming (DP) method. Monte
Carlo methods adjust their estimates only after the final outcome is known, but temporal difference
methods tend to adjust predictions to match later, more accurate, predictions for the future, much
before the final outcome is clear and know. This is essentially a type of bootstrapping.
Temporal difference learning in machine learning got its name from the way it uses changes, or
differences, in predictions over successive time steps for the purpose of driving the learning process.
The prediction at any particular time step gets updated to bring it nearer to the prediction of the same
quantity at the next time step.
1. TD(1) Algorithm
2. TD(0) Algorithm
3.TD(λ) Algorithm
Exploration Strategies:
Exploitation versus exploration is a critical topic in Reinforcement Learning. We‟d like the RL agent
to find the best solution as fast as possible. However, in the meantime, committing to solutions too
quickly without enough exploration sounds pretty bad, as it could lead to local minima or total failure.
Modern RL algorithms that optimize for the best returns can achieve good exploitation quite efficiently,
while exploration remains more like an open topic.
I would like to discuss several common exploration strategies in Deep RL here. As this is a very big
topic, my post by no means can cover all the important subtopics. I plan to update it periodically and
keep further enriching the content gradually in time.
Epsilon-greedy: The agent does random exploration occasionally with probability and takes the
optimal action most of the time with probability
.
Upper confidence bounds: The agent selects the greediest action to maximize the upper confidence
bound , where is the average rewards associated with action up to time and is a function reversely
proportional to how many times action has been taken. See here for more details.
Boltzmann exploration: The agent draws actions from a boltzmann distribution (softmax) over the
learned Q values, regulated by a temperature parameter
.
Thompson sampling: The agent keeps track of a belief over the probability of optimal actions and
samples from this distribution. See here for more details.
The following strategies could be used for better exploration in deep RL training when neural
networks are used for function approximation:
Entropy loss term: Add an entropy term into the loss function, encouraging the policy to take diverse
actions.
Noise-based Exploration: Add noise into the observation, action or even parameter space (Fortunato,
et al. 2017, Plappert, et al. 2017).
Key Exploration Problems
Good exploration becomes especially hard when the environment rarely provides rewards as feedback
or the environment has distracting noise. Many exploration strategies are proposed to solve one or both
of the following problems.
Montezuma‟s Revenge is a concrete example for the hard-exploration problem. It remains as a few
challenging games in Atari for DRL to solve. Many papers use Montezuma‟s Revenge to benchmark
their results.
•The environment responds by giving the agent a reward rt = r(st, at) and by producing the succeeding
state st+l = δ(st, at). Here the functions δ(st, at) and r(st, at) depend only on the current state and
action, and not on earlier states or actions.
The task of the agent is to learn a policy, 𝝅: S → A, for selecting its next action a, based on the current
observed state st; that is, (st) = at.
How shall we specify precisely which policy π we would like the agent to learn?
1. One approach is to require the policy that produces the greatest possible cumulative reward for the
robot over time.
•To state this requirement more precisely, define the cumulative value Vπ (st) achieved by following
an arbitrary policy π from an arbitrary initial state st as follows:
•Where, the sequence of rewards rt+i is generated by beginning at state st and by repeatedly using the
policy π to select actions.
•Here 0 ≤ γ ≤ 1 is a constant that determines the relative value of delayed versus immediate rewards. if
we set γ = 0, only the immediate reward is considered. As we set γ closer to 1, future rewards are given
greater emphasis relative to the immediate reward.
•The quantity Vπ (st) is called the discounted cumulative reward achieved by policy π from initial state
s. It is reasonable to discount future rewards relative to immediate rewards because, in many cases, we
prefer to obtain the reward sooner rather than later.
Considers the average reward per time step over the entire lifetime of the agent.
We require that the agent learn a policy π that maximizes Vπ (st) for all states s. such a policy is called
an optimal policy and denote it by π*
Refer the value function Vπ*(s) an optimal policy as V*(s). V*(s) gives the maximum
discounted cumulative reward that the agent can obtain starting from state s.
Example:
A simple grid-world environment is depicted in the diagram
• The six grid squares in this diagram represent six possible states, or locations, for the agent.
• Each arrow in the diagram represents a possible action the agent can take to move from one state to
another.
• The number associated with each arrow represents the immediate reward r(s, a) the agent receives if it
executes the corresponding state-action transition
• The immediate reward in this environment is defined to be zero for all state-action transitions except
for those leading into the state labelled G. The state G as the goal state, and the agent can receive reward
by entering this state.
Once the states, actions, and immediate rewards are defined, choose a value for the discount factor γ,
determine the optimal policy π * and its value function V*(s).
Let‟s choose γ = 0.9. The diagram at the bottom of the figure shows one optimal policy for this setting.
Values of V*(s) and Q(s, a) follow from r(s, a), and the discount factor γ = 0.9. An optimal policy,
corresponding to actions with maximal Q values, is also shown.
The Q Function
The value of Evaluation function Q(s, a) is the reward received immediately upon executing action a
from state s, plus the value (discounted by γ ) of following the optimal policy thereafter
Equation (5) makes clear, it need only consider each available action a in its current state s and choose
the action that maximizes Q(s, a).
An Algorithm for Learning Q
• Learning the Q function corresponds to learning the optimal policy.
• The key problem is finding a reliable way to estimate training values for Q, given only a sequence
of immediate rewards r spread out over time. This can be accomplished through iterative approximation
Rewriting Equation
Q learning algorithm:
• Q learning algorithm assuming deterministic rewards and actions. The discount factor γ may be
any constant such that 0 ≤ γ < 1
Deterministic policy
This part is dedicated to understand what does “Deterministic” means in the context of a policy.
Deterministic policy in general means that there is always one action that you can possibly take in a
certain situation. There is no other possibilities. Lets understand more through the eyes of a learning
example.
A policy defines how an agent acts from a specific state. For a deterministic policy, it is the action taken
at a specific state.
For a stochastic policy, it is the probability of taking an action a given the state s.
Rewards
Reward r(s, a) defines the reward collected by taking the action a at state s. Our objective is to maximize
the total rewards of a policy. A reward can be the added score in a game, successfully turninga doorknob
or winning a game.
You are a goalkeeper in a football team. There is a penalty for the opposing team. Your coach tells you
before the match, that if the player taking the penalty shoots with his left foot, you should dive to the
left. On the other hand, if the player taking the penalty shoots with his right food, you should dive to the
right. Taking into consideration that you fully trust your coach and you always follow his instructions.
This is an example of deterministic policy. This is because in the two situations you will possibly face,
there is always one action that you can take in each situation. “If the player taking the penalty shoots
with his left foot, you should dive to the left. On the other hand, if the player taking the penalty shoots
with his right food, you should dive to the right”. Lets take every situation on its own
The first situation is, if the player taking the penalty shoots with his left, you will dive to the left. The
number of actions that you can choose from is one, there is no possibility for any other action. The same
goes for the second situation.
Whenever the number of actions to take in a certain situation is not more than one, we say that the policy
including the instructions for our actions is a deterministic policy
Stochastic policy
A Stochastic policy is the opposite of a deterministic policy. What differentiates a stochastic policy
and a deterministic policy, is that in a stochastic policy, it is possible to have more the one action to
choose from in a certain situation. Again, lets understand more through the eyes of a learning example.
Again, you are a goalkeeper in a football team and there is a penalty for the opposing team. However,
this time the coach tells you can dive either left or right. You decide you will randomly choose whether
you will dive to the left or the right.
This is an example of stochastic policy. This is because in every penalty you will possibly face, there is
more than one action you can choose from.“You can dive either left or right “. This means in every
situation(Penalty), there will be two actions to choose from. Add to that, you choose randomly what
action to choose from.
Whenever the number of actions to take in a certain situation is more than one. In addition, the choice
of what action to take is based on randomness and probabilities. We say that the policy including the
instructions for our actions is a Stochastic policy.
Nondeterministic Rewards and Actions
Q equation Remember:
Non-determinism in Backgammon
Eligibility traces
Eligibility traces are one of the basic mechanisms of reinforcement learning. For example, in the popular
TD algorithm, the refers to the use of an eligibility trace. Almost any temporal-difference(TD)
method, such as -learning or Sarsa, can be combined with eligibility traces to obtain a more
general method that may learn more efficiently. Eligibility traces unify and generalize TD and Monte
Carlo methods. When TD methods are augmented with eligibility traces, they produce a family of
methods spanning a spectrum that has Monte Carlo methods at one end and one-step TD methods
at the other . In between are intermediate methods that are often better than either extreme
method. Eligibility traces also provide a way of implementing Monte Carlo methods online and on
continuing problems without episodes.
• Provide a way of implementing Monte Carlo in online fashion (does not wait for the episode to
finish) and on problems without episodes.
• Learning is done continually rather than waiting results at the end of an episode.
Remember that in Temporal Difference and Monte Carlo methods update a state based on future
rewards. This is done either by looking directly one step ahead or by waiting the episode to finish.
In TD(0) we look one step ahead, while in Monte Carlo we look ahead until the episode is
terminated and we collect the discounted results.However there is a middle ground, in which we look
n-steps ahead.
So let‟s define an average return for all these iterations like the following:
Where G( 𝜆, t) is the weighted average of all returns G(t,t+n) which are the returns of individual
episodes where each episode starts at t and ends at t+n, for n going from 1 to infinity.
As in all weighted average, the sum of the weights must be one, which is the case since
Suppose an agent randomly walking in an environment and finds a treasure. He then stops and looks
backwards in an attempt to know what led him to this treasure ?
Naturally the steps that are close to the treasure have more merits in finding it than the steps that are
miles away. So closer locations are more valuable than distant ones and thus they are assigned bigger
values
How does this materialize, is through a vector E called eligibility traces.
Concretely, the eligibility traces is a function of state E(s) or state action E(s,a) and holds the decaying
values of the V(s).
So how do we transit from Forward View to Backward View and what is the role of eligibility traces in
that?
Generalisation in RL is all about creating methods that can tackle these difficulties, challenging a
common assumption in previous RL research that the training and testing environments are identical.
Reinforcement Learning (RL) could be used in a range of applications such as autonomous vehicles
and robotics, but to fulfil this potential we need RL algorithms that can be used in the real world.
Reality is varied, non-stationarity and open-ended, and to handle this algorithms need to be robust to
variation in their environments, and be able to transfer and adapt to unseen (but similar) environments
during their deployment. Generalisation in RL is all about creating methods that can tackle these
difficulties, challenging a common assumption in previous RL research that the training and testing
environments are identical.
The goal in RL is usually described as that of learning a policy for a Markov Decision Process (MDP)
that maximizes some objective function, such as the expected discounted sum of rewards. An MDP is
characterized by a set of states S, a set of actions A, a transition function P and a reward function R.
When we discuss generalization, we can propose a different formulation, in which we wish our policy
to perform well on a distribution of MDPs. Using such as setup, we can now let the agent train on a set
of MDPs and reserve some other MDPs as a test set.
In what way can these MDPs differ from each other? I see three key possible differences:
1. The states are different in some way between MDPs, but the transition function is the same. An
example of this is playing different versions of a video game in which the colors and textures might
change, but the behavior of the policy should not change as a result.
2. The underlying transition function differs between MDPs, even though the states might seem similar.
An example of this some robotic manipulation tasks, in which various physical parameterssuch as
friction coefficients and mass might change, but we would like our policy to be able to adapt to these
changes, or otherwise be robust to them if possible.
3. The MDPs vary in size and apparent complexity, but there is some underlying principle that enables
generalizing to problems of different sizes. Examples of this might be some types of combinatorial
optimization problems such as the Traveling Salesman Problem, for which we would like a policy that
can solve instances of different sizes. (I have previously written on RL for combinatorial optimization)
In my opinion these represent the major sources of generalization challenge, but of course it‟s possible
to create problems that combine more than one such source. In what follows, I am going to focus on
the first type.
str(args(POMDP))